Senior Machine Learning Evaluation Engineer (LLMs & Decision Quality)
NYC
$500,000 - $900,000 total comp
We’re working with a world-class, research-driven organization operating in a high-stakes decision-making environment to hire a Senior Machine Learning Evaluation Engineer.
This role sits at the intersection of AI, data, and judgment. The focus is not on building flashy demos or optimizing infrastructure, but on answering a harder question:
When should an AI system be trusted?
The role
You’ll be responsible for designing and owning the evaluation layer for large language models and AI systems used to support real, consequential decisions.
This includes:
- Building and maintaining golden evaluation datasets grounded in expert judgment
- Designing offline benchmarks to compare models, prompts, and retrieval strategies
- Defining quality metrics that go beyond surface-level accuracy
- Partnering closely with researchers and domain experts to translate intuition into measurable criteria
- Connecting offline evaluation results with online behavior once systems are live
- Detecting regressions, drift, and subtle failures over time
- Iterating on evaluation frameworks as models and use cases evolve
You’ll operate as a lead individual contributor with real influence over how AI quality is defined, measured, and enforced.
What this role is not
- Not an ML infrastructure or serving role
- Not a prompt-engineering or chatbot UX role
- Not pure research with no production ownership
- Not dashboard-only analytics
This is a hands-on engineering role focused on evaluation, trust, and decision quality.
What we’re looking for
We’re looking for someone who has actually owned model evaluation, not just consumed metrics.
Strong signals include:
- Experience designing offline evaluation or experimentation frameworks
- Deep understanding of online vs offline mismatch and how to manage it
- Ownership of model monitoring, drift detection, or regression analysis
- Comfort working with ambiguous, qualitative outputs (e.g. LLMs)
- Experience partnering with domain experts or stakeholders to define “what good looks like”
- Strong Python and data skills; comfort building lightweight pipelines and analysis tooling
Backgrounds that tend to work well:
- Applied ML engineers
- ML tech leads
- Research engineers with production exposure
- Engineers from decisioning, risk, pricing, trust & safety, or ranking systems
LLM experience
Hands-on experience with LLMs is highly relevant, especially around:
- LLM evaluation and benchmarking
- RAG evaluation
- Hallucination or grounding checks
- Prompt or system regression testing
- Human-in-the-loop review workflows
That said, evaluation maturity matters more than novelty.
Why this role is interesting
- You’ll shape how AI systems are measured and trusted, not just how they’re built
- You’ll work on problems where mistakes matter
- You’ll influence real decisions, not vanity metrics
- You’ll have autonomy to define standards, not just follow them
Compensation & location
- Senior individual contributor level
- Total compensation is highly competitive and aligned with top-tier technology firms
- Location flexible within the U.S.
If you’re excited by the idea of building the yardsticks that decide whether AI systems are actually helping or quietly harming decision-making, we’d love to hear from you.