Speculative Decoding

Inference acceleration technique using a smaller draft model to propose tokens that the larger target model verifies in parallel.

1.
Google uses speculative decoding in Gemini serving - a smaller Gemini Nano drafts candidate tokens that Gemini Pro verifies in one forward pass, achieving 2-3x throughput improvement for standard outputs.
2.
Llama.cpp implements speculative decoding with a Llama 3.2 1B draft model and Llama 3.1 70B target model - reducing time-to-first-token by 40% for developer workstations with limited GPU memory.
3.
Medusa (UC Berkeley) extends speculative decoding with multiple draft heads on the target model itself - achieving 2-3x speedup with no quality loss, deployed in vLLM for production LLM serving.

Loading…