Lead MLOps Engineer
$225K + Bonus + Equity – Remote (USA)
About Us
People in AI is a specialized staffing agency dedicated to helping AI/ML professionals find the best career opportunities. We’re currently recruiting on behalf of a well-funded SaaS company undergoing rapid growth and looking to scale its platform to train, deploy, and monitor thousands of machine-learning models. This role is fully remote within eligible US states, offering a base salary of around $225K plus equity.
Why This Role Is Exciting
- High-Impact MLOps: You’ll design and implement pipelines to handle ML models at a massive scale, supporting thousands of customers.
- Strong DevOps Focus: Manage deep-level AWS infrastructure (VPC, security groups, Terraform, Kubernetes) to ensure reliability and scalability.
- Innovate with Databricks: Integrate Databricks services into an existing AWS environment, optimizing everything from data engineering to real-time inference.
- Opportunity to Lead: Collaborate with data scientists, platform engineers, and DevOps teams, potentially mentoring junior colleagues in best practices.
Key Responsibilities
-
Infrastructure & DevOps
- Own and integrate AWS services (e.g., SageMaker, EKS) with an emphasis on networking, IaC (Terraform/CloudFormation), container orchestration (Kubernetes), and security.
- Oversee automation and provisioning to ensure minimal manual overhead.
-
MLOps Pipeline Development
- Build and enhance end-to-end pipelines for training, deployment, and monitoring of thousands of ML models.
- Champion best practices to minimize human intervention while maintaining high availability and performance.
-
Feature Stores & Data Management
- Evaluate and implement feature stores (Databricks Feature Store, Feast, etc.).
- Streamline data workflows to ensure efficient model development and serving.
-
CI/CD & Orchestration
- Refine CI/CD pipelines (GitHub Actions or similar) for secure and automated deployments.
- Collaborate with DevOps to unify build, test, and release practices across the ML lifecycle.
-
Monitoring & Optimization
- Deploy or integrate monitoring solutions (Prometheus, Grafana) to track model and infrastructure health.
- Optimize cost and performance at scale, from data ingestion through model serving.
-
Collaboration & Leadership
- Work closely with cross-functional teams—data scientists, platform engineers, DevOps—to align technical vision and goals.
- Offer mentorship and guidance in MLOps and DevOps methodologies, shaping the team’s technical roadmap.
Ideal Candidate Profile
- 5+ years’ experience in MLOps, DevOps, Machine Learning Engineering, or Data Engineering.
- Deep AWS knowledge (SageMaker, EKS, VPC networking, security groups) and hands-on Terraform/CloudFormation experience.
- Strong containerization/orchestration skills with Docker and Kubernetes, including security and advanced networking.
- Proficient in Python with exposure to ML frameworks (TensorFlow, PyTorch) and version control (Git).
- CI/CD expertise (GitHub Actions or similar) along with automated testing and deployment best practices.
- Excellent communication: Able to interact with technical and non-technical stakeholders, driving alignment on infrastructure and ML initiatives.
- Problem solver who thrives in fast-paced environments, quickly adapting to new tech and scaling challenges.
What’s in It for You
- Competitive Compensation: ~$225K base + equity.
- Fully Remote: Enjoy flexible work arrangements within eligible US states.
- High-Growth Environment: Collaborate with a passionate team to scale an ML platform used by 6,000+ customers.
- Cutting-Edge Tech Stack: Leverage AWS, Databricks, Kubernetes, Terraform, and more to shape a robust, industry-leading MLOps ecosystem.