Senior Platform Engineer, Evaluations (AI Infrastructure)
$200k–$400k base + equity
New York City (Onsite)
Company Overview:
This Series A startup is pioneering a new category in AI-native infrastructure automation. Their platform doesn’t just surface dashboards, it actively investigates, reasons over, and automates responses across production systems, giving engineers time back to build rather than firefight. With $3M+ in ARR and major pilots underway at enterprise customers, they’re entering a critical phase of scale, intelligence, and product depth.
The Role:
As the first dedicated Evaluations Engineer, you’ll own and operationalize how quality is defined for the AI agent at the heart of the product. This is a zero to one opportunity to build the evaluation stack: pipelines, metrics, tooling, and infrastructure from scratch, and shape how teams across platform, product, and research iterate confidently.
You’ll sit within Platform Engineering and partner closely with backend, ML, and research teams. This is a high-impact, high-autonomy role with strong executive buy-in and close collaboration with the CTO.
What You’ll Do:
- Build and own online and offline evaluation pipelines for agent behavior across MELT data, code, and unstructured inputs
- Define and refine quality metrics that capture reasoning trajectories, not just outputs
- Design evaluations in messy, real-world, high-volume, hard-to-label systems
- Extend core agent infrastructure (e.g., middleware, sub-agents, orchestration) to support evaluation and iteration
- Productionize the stack with strong observability, uptime, and performance guarantees
- Develop internal tools (e.g., CLIs, validation harnesses) to accelerate iteration for platform, research, and product
- Collaborate with research engineers to bring new agent architectures (e.g., multi-path reasoning) into production
What You’ll Bring:
- Experience in ML platform, data, or backend engineering, ideally in zero to one or ambiguous domains
- Experience designing evaluation systems in noisy, context-rich environments (e.g., CV, research, infra ML)
- Strong systems thinking: can reason across distributed systems, data modeling, and product surface areas
- Clean, scalable coding in Python and TypeScript
- Ability to define ground truth in approximate domains and defend your metrics with rigor
- Track record of platformizing work to unlock others
Tech Stack:
- Python, TypeScript
- MELT data pipelines
- Distributed systems
- Internal agent frameworks and orchestration
Why Join?
- Own a foundational product surface with direct executive support
- Work with top-tier engineers from companies like Datadog and build tooling for real enterprise complexity
- Define what “quality” means in one of the most interesting ML applications today
- Shape the evaluation discipline from the ground up in a fast-moving, high-conviction environment
- Competitive salary, generous equity, and thoughtful perks: full coverage, retirement, fitness stipend, unlimited PTO, and more
About People In AI: People In AI is a specialized recruiting partner for cutting-edge AI startups. We help top engineers find high-impact roles in applied AI, infrastructure, and research. All roles we share are with well-funded companies building category-defining products.
Ready to build a zero to one evaluations system with real production impact? Let’s talk.