Head of Inference Optimization

Location: San Jose, CA (In-Person Only)
Type: Full-Time | Competitive Salary + Equity | Relocation + Housing Support

We’re partnering with a well-funded, stealth-stage AI systems company that’s building next-generation infrastructure for running large AI models at unprecedented speeds and efficiency.

They’ve developed a custom inference architecture specifically tailored to transformer workloads — and it’s already demonstrating significant performance advantages over leading GPU platforms. Now they’re looking for a Head of Inference Optimization to lead the team responsible for delivering production-grade, ultra-efficient inference performance at the kernel and systems level.

What You’ll Do

Design and implement fused, high-performance kernels tailored to transformer inference — including memory-aware scheduling, simultaneous compute/transmit logic, and matmul fusion strategies.
Build and lead a team of elite performance engineers focused on end-to-end inference optimization across the software stack.
Own the mapping of state-of-the-art models (e.g., Llama-3/4, DeepSeek, Qwen, Stable Diffusion) to custom silicon for maximum throughput and minimal latency.
Drive the development of inference-time algorithmic techniques like speculative decoding, KV-cache offloading, and batch interleaving.
Collaborate closely with hardware, runtime, and compiler teams to drive hardware-software co-design and shape instruction-level performance strategy.

What We’re Looking For

Proven experience designing and optimizing low-level inference kernels using CUDA, PTX, AVX-512, or similar.
Deep understanding of memory hierarchies, throughput ceilings, and interconnect performance in modern compute architectures.
Demonstrated history of delivering major performance wins — ideally in transformer inference or numerical compute systems.
Experience managing and scaling high-performing engineering teams.
Exposure to frameworks like FlashAttention, Triton, or vLLM is a bonus, but deep systems-level performance engineering is the priority.

Why This Role

Zero-to-one impact — define and build the kernel stack for a custom inference architecture from scratch.
High degree of technical ownership and visibility across software and hardware teams.
Fully in-person, tight-knit engineering culture — no bureaucracy, all execution.
Competitive compensation, full benefits, $2,000/month housing subsidy, and daily in-office meals.

Ready to build the fastest inference stack in AI?
Apply now or reach out to our team to learn more.

Head of Inference Kernels

Head of Inference Optimization

What You’ll Do

What We’re Looking For

Why This Role

Similar roles

AI Engineer

Research Scientist

Machine Learning Engineer

Machine Learning Engineer - PyTorch

Product Manager - AI / ML

Founding Data Engineer

AI Engineer

Principal Data Scientist – Bayesian Forecasting

Upload resume