Member of Research Staff, Inference

New York · San FranciscoFullTime$150k–$350kPosted Jul 5, 2026

About Us:

AI needs a new infrastructure layer. We're building it at Modal.

Every era of computing brought new workloads that previous infrastructure couldn't support: mainframes, databases, and the cloud. Each time, the company that rebuilt the layer underneath defined the decade. AI is no different, except it touches everything instead of one slice, and the window to build the layer underneath it is open right now.

Our customers include category-defining companies like Lovable, Ramp, Cognition, DoorDash, and Suno. They rely on Modal for instant GPU access, sub-second container starts, and native storage, so it's simple to serve low-latency inference, fine-tune models, and access production-ready sandboxes at scale.

We recently raised a $355M Series C at a $4.65B valuation, led by General Catalyst and Redpoint Ventures. We've crossed $300M+ ARR and grown fivefold since September.

Our team includes creators of popular open-source projects (e.g.,Seaborn,Luigi), academic researchers, international olympiad medalists, and experienced engineering and product leaders with decades of experience.

The Role:

Most of the value of owning a model shows up at serving time. We're building a platform that covers the whole life of an LLM -- train it, deploy it, observe it -- and inference is where teams feel the difference every day. We already run elastic inference, sandboxes, distributed volumes, and multi-node training, and we control the infrastructure underneath, so the serving stack is ours to shape rather than something we resell.

You will do hands-on inference research at Modal, working with the research lead to pick high-impact bets and owning them end to end. The bets that matter most are the ones that move cost per token and tail latency on the workloads our customers actually run.

What you'll do:

Own end-to-end inference research bets: speculative decoding, disaggregated prefill/decode, quantization (FP8, INT4), KV-cache and memory management, autoscaling for spiky serverless traffic, and whatever else the research agenda calls for.
Train custom speculators against real production traffic and feed what you learn back into target models -- acceptance length is the metric that decides the win.
Work directly with customers alongside our Forward Deployed Engineers to deploy and tune models, and bring what you learn back into the research.
Carry and expand collaborations with outside research labs, for example:
- our work with ZLab on DFlash, a speculator design built on KV injection and blockwise parallel drafting
- our work with SGLang on specdec and multimodal inference performance
- our work on Flash Attention 4 kernels
Work with engineering to turn frontier serving techniques into products: primitives for disaggregation, fast weight refresh for models that keep training after deployment, observability for quality and latency in production, or even a next-generation inference engine.
Help shape the research agenda. None of the above is prescriptive; your work will help guide our future.

Requirements:

A research-leaning or systems background in LLM inference, with work you can point to.
Fluency in the LLM serving stack, from kernels and quantization up to schedulers and autoscaling.
A record of shipping research or systems that other people build on, whether in a lab or in industry.
The drive to independently take a research bet from idea to result, working in the open with the rest of the team.
Ability to work in-person, in our NYC or San Francisco office.