Head of Inference Optimization
Location: San Jose, CA (In-Person Only)
Type: Full-Time | Competitive Salary + Equity | Relocation + Housing Support
We’re partnering with a well-funded, stealth-stage AI systems company that’s building next-generation infrastructure for running large AI models at unprecedented speeds and efficiency.
They’ve developed a custom inference architecture specifically tailored to transformer workloads — and it’s already demonstrating significant performance advantages over leading GPU platforms. Now they’re looking for a Head of Inference Optimization to lead the team responsible for delivering production-grade, ultra-efficient inference performance at the kernel and systems level.
What You’ll Do
- Design and implement fused, high-performance kernels tailored to transformer inference — including memory-aware scheduling, simultaneous compute/transmit logic, and matmul fusion strategies.
- Build and lead a team of elite performance engineers focused on end-to-end inference optimization across the software stack.
- Own the mapping of state-of-the-art models (e.g., Llama-3/4, DeepSeek, Qwen, Stable Diffusion) to custom silicon for maximum throughput and minimal latency.
- Drive the development of inference-time algorithmic techniques like speculative decoding, KV-cache offloading, and batch interleaving.
- Collaborate closely with hardware, runtime, and compiler teams to drive hardware-software co-design and shape instruction-level performance strategy.
What We’re Looking For
- Proven experience designing and optimizing low-level inference kernels using CUDA, PTX, AVX-512, or similar.
- Deep understanding of memory hierarchies, throughput ceilings, and interconnect performance in modern compute architectures.
- Demonstrated history of delivering major performance wins — ideally in transformer inference or numerical compute systems.
- Experience managing and scaling high-performing engineering teams.
- Exposure to frameworks like FlashAttention, Triton, or vLLM is a bonus, but deep systems-level performance engineering is the priority.
Why This Role
- Zero-to-one impact — define and build the kernel stack for a custom inference architecture from scratch.
- High degree of technical ownership and visibility across software and hardware teams.
- Fully in-person, tight-knit engineering culture — no bureaucracy, all execution.
- Competitive compensation, full benefits, $2,000/month housing subsidy, and daily in-office meals.
Ready to build the fastest inference stack in AI?
Apply now or reach out to our team to learn more.