Senior MLOps / ML Platform Engineer
Location: Remote (U.S.) | Preference for SF Bay Area
Type: Full-time, Permanent
Salary Range: $180,000 – $250,000 + Equity + Benefits
About the Opportunity
People in AI is working with a confidential, late-stage startup that’s scaling one of the most advanced ML platforms in production. This company operates at enormous scale, supporting trillions of real-time and batch interactions across their data infrastructure — and they’re hiring experienced engineers to help build the backbone of their machine learning practice.
You’ll join a high-impact ML Platform team that owns the infrastructure used by 20+ ML Engineers and Data Scientists — enabling faster experimentation, deployment, and monitoring of models in production.
What You’ll Work On
- Design, build, and operate ML infrastructure for training, deployment, and inference
- Scale and manage feature stores powering real-time and batch use cases
- Develop high-throughput pipelines using Ray, Apache Spark, and Kafka
- Improve latency and reliability of ML model serving (GPU + CPU)
- Work with tools like MLFlow, Argo, Terraform, Kubernetes (EKS)
- Build internal tooling and automation to improve ML developer workflows
- Collaborate closely with cross-functional ML teams to enable experimentation at scale
Ideal Background
- 5+ years in MLOps, ML Platform Engineering, Data Engineering, or Infrastructure
- Strong experience with Apache Spark, Spark Structured Streaming, Kafka, Ray, or similar tools
- Proven experience building or scaling feature stores (e.g. Tecton, Feast)
- Deep understanding of online vs offline inference, and how to optimize for both
- Hands-on experience with Kubernetes (EKS), Terraform, and cloud-native infra (AWS preferred)
- Background in software engineering, with a strong focus on production-grade systems
- Bonus: experience managing GPU compute environments or working with CI/CD for ML workflows
Tech Stack Highlights
- Infra: Kubernetes (EKS), Terraform, Helm, Istio, CloudFlare
- Pipelines: Spark, Ray, Kafka, Airflow
- Languages: Python, Java, Scala
- Serving & Orchestration: MLFlow, Argo Workflows, ArgoCD
- Monitoring: Datadog, Prometheus
- Modeling tools: HuggingFace 🤗, PyTorch, TensorFlow, Metaflow
Why Apply
- Join at a pivotal time — huge ownership and technical influence
- Work on systems used by hundreds of millions of users
- Competitive compensation + strong equity upside
- Remote flexibility + preference for Bay Area engineers for in-person collaboration