Image
img
img

Head of Inference Kernels

  • Permanent
  • $300,000 - $1,500,000
  • San Jose, United States, California
  • Data Infrastructure & MLOps
Image

Head of Inference Optimization

Location: San Jose, CA (In-Person Only)
Type: Full-Time | Competitive Salary + Equity | Relocation + Housing Support

We’re partnering with a well-funded, stealth-stage AI systems company that’s building next-generation infrastructure for running large AI models at unprecedented speeds and efficiency.

They’ve developed a custom inference architecture specifically tailored to transformer workloads — and it’s already demonstrating significant performance advantages over leading GPU platforms. Now they’re looking for a Head of Inference Optimization to lead the team responsible for delivering production-grade, ultra-efficient inference performance at the kernel and systems level.


What You’ll Do

  • Design and implement fused, high-performance kernels tailored to transformer inference — including memory-aware scheduling, simultaneous compute/transmit logic, and matmul fusion strategies.
  • Build and lead a team of elite performance engineers focused on end-to-end inference optimization across the software stack.
  • Own the mapping of state-of-the-art models (e.g., Llama-3/4, DeepSeek, Qwen, Stable Diffusion) to custom silicon for maximum throughput and minimal latency.
  • Drive the development of inference-time algorithmic techniques like speculative decoding, KV-cache offloading, and batch interleaving.
  • Collaborate closely with hardware, runtime, and compiler teams to drive hardware-software co-design and shape instruction-level performance strategy.

What We’re Looking For

  • Proven experience designing and optimizing low-level inference kernels using CUDA, PTX, AVX-512, or similar.
  • Deep understanding of memory hierarchies, throughput ceilings, and interconnect performance in modern compute architectures.
  • Demonstrated history of delivering major performance wins — ideally in transformer inference or numerical compute systems.
  • Experience managing and scaling high-performing engineering teams.
  • Exposure to frameworks like FlashAttention, Triton, or vLLM is a bonus, but deep systems-level performance engineering is the priority.

Why This Role

  • Zero-to-one impact — define and build the kernel stack for a custom inference architecture from scratch.
  • High degree of technical ownership and visibility across software and hardware teams.
  • Fully in-person, tight-knit engineering culture — no bureaucracy, all execution.
  • Competitive compensation, full benefits, $2,000/month housing subsidy, and daily in-office meals.

Ready to build the fastest inference stack in AI?
Apply now or reach out to our team to learn more.

Share job:
Decor
Image

Upload resume

Boost your career with expert recruitment solutions!

Your resume will be confidentially submitted to our team, who will be in touch if we have a match for your job search

Upload resume
Image
jobs