Senior Platform Engineer, Evaluations (AI Infrastructure)

$200k–$400k base + equity

New York City (Onsite)

Company Overview:

This Series A startup is pioneering a new category in AI-native infrastructure automation. Their platform doesn’t just surface dashboards, it actively investigates, reasons over, and automates responses across production systems, giving engineers time back to build rather than firefight. With $3M+ in ARR and major pilots underway at enterprise customers, they’re entering a critical phase of scale, intelligence, and product depth.

The Role:

As the first dedicated Evaluations Engineer, you’ll own and operationalize how quality is defined for the AI agent at the heart of the product. This is a zero to one opportunity to build the evaluation stack: pipelines, metrics, tooling, and infrastructure from scratch, and shape how teams across platform, product, and research iterate confidently.

You’ll sit within Platform Engineering and partner closely with backend, ML, and research teams. This is a high-impact, high-autonomy role with strong executive buy-in and close collaboration with the CTO.

What You’ll Do:

Build and own online and offline evaluation pipelines for agent behavior across MELT data, code, and unstructured inputs
Define and refine quality metrics that capture reasoning trajectories, not just outputs
Design evaluations in messy, real-world, high-volume, hard-to-label systems
Extend core agent infrastructure (e.g., middleware, sub-agents, orchestration) to support evaluation and iteration
Productionize the stack with strong observability, uptime, and performance guarantees
Develop internal tools (e.g., CLIs, validation harnesses) to accelerate iteration for platform, research, and product
Collaborate with research engineers to bring new agent architectures (e.g., multi-path reasoning) into production

What You’ll Bring:

Experience in ML platform, data, or backend engineering, ideally in zero to one or ambiguous domains
Experience designing evaluation systems in noisy, context-rich environments (e.g., CV, research, infra ML)
Strong systems thinking: can reason across distributed systems, data modeling, and product surface areas
Clean, scalable coding in Python and TypeScript
Ability to define ground truth in approximate domains and defend your metrics with rigor
Track record of platformizing work to unlock others

Tech Stack:

Python, TypeScript
MELT data pipelines
Distributed systems
Internal agent frameworks and orchestration

Why Join?

Own a foundational product surface with direct executive support
Work with top-tier engineers from companies like Datadog and build tooling for real enterprise complexity
Define what “quality” means in one of the most interesting ML applications today
Shape the evaluation discipline from the ground up in a fast-moving, high-conviction environment
Competitive salary, generous equity, and thoughtful perks: full coverage, retirement, fitness stipend, unlimited PTO, and more

About People In AI: People In AI is a specialized recruiting partner for cutting-edge AI startups. We help top engineers find high-impact roles in applied AI, infrastructure, and research. All roles we share are with well-funded companies building category-defining products.

Ready to build a zero to one evaluations system with real production impact? Let’s talk.

Platform Engineer (Applied Evaluations)

Similar roles

Product Engineer

Upload resume