Senior Machine Learning Evaluation Engineer (LLMs & Decision Quality)

NYC

$500,000 - $900,000 total comp

We’re working with a world-class, research-driven organization operating in a high-stakes decision-making environment to hire a Senior Machine Learning Evaluation Engineer.

This role sits at the intersection of AI, data, and judgment. The focus is not on building flashy demos or optimizing infrastructure, but on answering a harder question:

When should an AI system be trusted?

The role

You’ll be responsible for designing and owning the evaluation layer for large language models and AI systems used to support real, consequential decisions.

This includes:

Building and maintaining golden evaluation datasets grounded in expert judgment
Designing offline benchmarks to compare models, prompts, and retrieval strategies
Defining quality metrics that go beyond surface-level accuracy
Partnering closely with researchers and domain experts to translate intuition into measurable criteria
Connecting offline evaluation results with online behavior once systems are live
Detecting regressions, drift, and subtle failures over time
Iterating on evaluation frameworks as models and use cases evolve

You’ll operate as a lead individual contributor with real influence over how AI quality is defined, measured, and enforced.

What this role is not

Not an ML infrastructure or serving role
Not a prompt-engineering or chatbot UX role
Not pure research with no production ownership
Not dashboard-only analytics

This is a hands-on engineering role focused on evaluation, trust, and decision quality.

What we’re looking for

We’re looking for someone who has actually owned model evaluation, not just consumed metrics.

Strong signals include:

Experience designing offline evaluation or experimentation frameworks
Deep understanding of online vs offline mismatch and how to manage it
Ownership of model monitoring, drift detection, or regression analysis
Comfort working with ambiguous, qualitative outputs (e.g. LLMs)
Experience partnering with domain experts or stakeholders to define “what good looks like”
Strong Python and data skills; comfort building lightweight pipelines and analysis tooling

Backgrounds that tend to work well:

Applied ML engineers
ML tech leads
Research engineers with production exposure
Engineers from decisioning, risk, pricing, trust & safety, or ranking systems

LLM experience

Hands-on experience with LLMs is highly relevant, especially around:

LLM evaluation and benchmarking
RAG evaluation
Hallucination or grounding checks
Prompt or system regression testing
Human-in-the-loop review workflows

That said, evaluation maturity matters more than novelty.

Why this role is interesting

You’ll shape how AI systems are measured and trusted, not just how they’re built
You’ll work on problems where mistakes matter
You’ll influence real decisions, not vanity metrics
You’ll have autonomy to define standards, not just follow them

Compensation & location

Senior individual contributor level
Total compensation is highly competitive and aligned with top-tier technology firms
Location flexible within the U.S.

If you’re excited by the idea of building the yardsticks that decide whether AI systems are actually helping or quietly harming decision-making, we’d love to hear from you.

Data Partner / Senior DP

Senior Machine Learning Evaluation Engineer (LLMs & Decision Quality)

NYC

$500,000 - $900,000 total comp

The role

What this role is not

What we’re looking for

LLM experience

Why this role is interesting

Compensation & location

Similar roles

AI Engineer - Agent Workflows

AI Engineer - Agent Interactions

Lead AI Engineer

SWE

Director of AI

ML Platform Engineer

Machine Learning Engineer

Principal Data Scientist – Forecasting

Upload resume