1.7: Optimization Landscape — Phase-Specific Strategies

Community Article Published February 2, 2026

The Optimization Framework

Now that we understand WHY prefill is compute-bound and decode is memory-bound, we can understand HOW various optimization techniques work. Each technique attacks a specific bottleneck in a specific phase.

The fundamental insight from our analysis:

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE TWO BOTTLENECKS                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PREFILL (2% of time)              DECODE (98% of time)                │
│  ─────────────────────             ────────────────────                 │
│  Bottleneck: Compute               Bottleneck: Memory bandwidth         │
│  Arithmetic intensity: HIGH        Arithmetic intensity: LOW            │
│  GPU utilization: HIGH             GPU utilization: LOW                 │
│                                                                         │
│  To speed up prefill:              To speed up decode:                  │
│  → Do less computation             → Read less data from memory         │
│  → Compute faster                  → Read data faster                   │
│  → Better parallelization          → Do more work per byte read         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Let's examine how each major optimization technique maps to these bottlenecks.


Taxonomy of Optimizations

┌─────────────────────────────────────────────────────────────────────────┐
│                    OPTIMIZATION TAXONOMY                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PRIMARILY TARGETS DECODE (memory-bound)                                │
│  ────────────────────────────────────────                               │
│  • Batching / Continuous batching     [increases arithmetic intensity]  │
│  • Quantization                       [reduces bytes to read]           │
│  • Speculative decoding               [amortizes memory reads]          │
│  • KV cache compression (MQA/GQA/MLA) [reduces cache reads]            │
│  • PagedAttention                     [better memory management]        │
│                                                                         │
│  PRIMARILY TARGETS PREFILL (compute-bound)                              │
│  ─────────────────────────────────────────                              │
│  • FlashAttention                     [reduces memory traffic]          │
│  • Chunked prefill                    [scheduling optimization]         │
│  • Prompt caching                     [avoids redundant compute]        │
│                                                                         │
│  TARGETS BOTH PHASES                                                    │
│  ───────────────────                                                    │
│  • Tensor parallelism                 [distributes work across GPUs]    │
│  • Pipeline parallelism               [overlaps compute stages]         │
│  • Better hardware                    [more bandwidth + more compute]   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Now let's examine each technique in detail.


Decode Optimizations (The High-Impact Zone)

Since decode accounts for 98% of inference time, these optimizations have the greatest impact on end-to-end latency and throughput.

1. Batching: Increasing Arithmetic Intensity

The Problem: During decode, we read 14 GB of weights to process 1 token. Arithmetic intensity is 0.5 FLOPs/byte—far below the 156 FLOPs/byte threshold.

The Solution: Process multiple requests simultaneously. If we batch 32 requests together, we read the weights once and use them for 32 tokens instead of 1.

┌─────────────────────────────────────────────────────────────────────────┐
│                         BATCHING EFFECT                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT BATCHING (batch_size = 1):                                     │
│  ──────────────────────────────────                                     │
│  Read 14 GB weights → process 1 token → output 1 token                  │
│  Arithmetic intensity: 0.5 FLOPs/byte                                   │
│  Time: 10 ms for 1 token = 100 tokens/sec                              │
│                                                                         │
│  WITH BATCHING (batch_size = 32):                                       │
│  ─────────────────────────────────                                      │
│  Read 14 GB weights → process 32 tokens → output 32 tokens              │
│  Arithmetic intensity: 32 × 0.5 = 16 FLOPs/byte                        │
│  Time: ~12 ms for 32 tokens = 2,666 tokens/sec                         │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Throughput improvement: 26.7× with only 1.2× latency increase!        │
│                                                                         │
│  Why it works:                                                          │
│  • Weight reads are AMORTIZED across batch                              │
│  • Matrix multiply [B, d] @ [d, d] instead of [1, d] @ [d, d]          │
│  • Same memory bandwidth, B× more computation                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The arithmetic intensity scales linearly with batch size:

Batch size:    1      8      32     64     128    256
Arith. int.:   0.5    4      16     32     64     128   FLOPs/byte
Status:        ✗      ✗      ✗      ✗      ✗      ~threshold

Even batch_size=256 barely reaches compute-bound territory.
This shows how severely memory-bound single-request decode is.

Continuous Batching: Rather than waiting for a batch to complete, modern serving systems dynamically add/remove requests from the batch as they start/finish. This maximizes GPU utilization by keeping the batch full.

Traditional batching:            Continuous batching:

Req 1: ████████████░░░░          Req 1: ████████████
Req 2: ████████████████          Req 2: ████████████████
Req 3: ████░░░░░░░░░░░░          Req 3: ████
       ↑                         Req 4:     ████████████
       Padding waste!            Req 5:         ████████
                                        ↑
                                        No wasted slots!

2. Quantization: Reducing Bytes to Read

The Problem: We must read 14 GB of weights every decode step. Memory bandwidth is the bottleneck.

The Solution: Represent weights with fewer bits. Instead of 16-bit floats (FP16), use 8-bit integers (INT8) or 4-bit integers (INT4).

┌─────────────────────────────────────────────────────────────────────────┐
│                       QUANTIZATION EFFECT                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Model: LLaMA-7B                                                        │
│                                                                         │
│  FP16 (baseline):                                                       │
│    Model size: 14 GB                                                    │
│    Decode step time: 14 GB / 2 TB/s = 7 ms                             │
│                                                                         │
│  INT8 quantization:                                                     │
│    Model size: 7 GB (2× smaller)                                        │
│    Decode step time: 7 GB / 2 TB/s = 3.5 ms (2× faster)                │
│                                                                         │
│  INT4 quantization:                                                     │
│    Model size: 3.5 GB (4× smaller)                                      │
│    Decode step time: 3.5 GB / 2 TB/s = 1.75 ms (4× faster)             │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Why it works for decode:                                               │
│  • Decode is memory-bound → read less = go faster                       │
│  • Same computation, fewer bytes                                        │
│  • Arithmetic intensity increases (same FLOPs, fewer bytes)            │
│                                                                         │
│  Trade-off:                                                             │
│  • Some accuracy loss (model quality degrades)                          │
│  • Careful calibration required                                         │
│  • INT4 approaches limits of acceptable quality loss                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quantization impact by phase:

Phase Effect of Quantization Why
Decode MAJOR speedup (near-linear with compression) Memory-bound → fewer bytes = faster
Prefill Moderate speedup Compute-bound → helps somewhat via faster memory, but compute is bottleneck

3. Speculative Decoding: Amortizing Memory Reads

The Problem: Each decode step reads the full model (14 GB) but produces just 1 token. We pay the memory bandwidth cost 200 times to generate 200 tokens.

The Solution: Generate multiple tokens per "verification" step. Use a small, fast "draft" model to propose several tokens, then verify them in parallel with the large model.

┌─────────────────────────────────────────────────────────────────────────┐
│                    SPECULATIVE DECODING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STANDARD DECODING:                                                     │
│  ──────────────────                                                     │
│  Step 1: Read 14GB → generate token 1                                   │
│  Step 2: Read 14GB → generate token 2                                   │
│  Step 3: Read 14GB → generate token 3                                   │
│  Step 4: Read 14GB → generate token 4                                   │
│  ...                                                                    │
│  Total for 4 tokens: 4 × 14GB = 56 GB read                             │
│                                                                         │
│  SPECULATIVE DECODING:                                                  │
│  ─────────────────────                                                  │
│  Step 1: Draft model proposes: [tok1, tok2, tok3, tok4] (fast, small)  │
│  Step 2: Large model verifies all 4 in ONE forward pass                │
│          Read 14GB → verify 4 tokens in parallel                        │
│  Step 3: Accept verified tokens (say 3 of 4 accepted)                   │
│                                                                         │
│  Total for 3 tokens: ~14 GB read (+ small draft overhead)              │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Why it works:                                                          │
│  • Verification is like a mini-prefill (parallel, compute-bound)       │
│  • Multiple tokens verified per weight read                             │
│  • Draft model is small enough to be fast (fits in cache)              │
│  • Acceptance rate determines speedup                                   │
│                                                                         │
│  Speedup depends on:                                                    │
│  • Draft model quality (higher acceptance = better)                     │
│  • Draft model speed (smaller = faster)                                 │
│  • Domain match (draft trained on similar data)                         │
│                                                                         │
│  Typical speedup: 2-3× for well-matched draft models                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The key insight: Speculative decoding converts multiple sequential decode steps into fewer parallel verification steps. Each verification is more like prefill (parallel tokens, higher arithmetic intensity) than like decode.


4. KV Cache Compression (MQA, GQA, MLA)

The Problem: The KV cache grows with sequence length and must be read every decode step. For long sequences, KV cache reads become significant.

The Solution: Share Key and Value heads across multiple Query heads, reducing what must be stored and read.

┌─────────────────────────────────────────────────────────────────────────┐
│                    KV CACHE COMPRESSION TECHNIQUES                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  MULTI-HEAD ATTENTION (MHA) - Standard                                  │
│  ─────────────────────────────────────                                  │
│  32 Query heads, 32 Key heads, 32 Value heads                           │
│  KV cache per token: 32 × 2 × 128 × 2 bytes = 16 KB                    │
│  For 4K context: 64 MB per layer, 2 GB total                           │
│                                                                         │
│  MULTI-QUERY ATTENTION (MQA)                                            │
│  ───────────────────────────                                            │
│  32 Query heads, 1 Key head, 1 Value head (shared!)                    │
│  KV cache per token: 1 × 2 × 128 × 2 bytes = 0.5 KB                    │
│  For 4K context: 2 MB per layer, 64 MB total                           │
│  Reduction: 32× smaller cache!                                          │
│                                                                         │
│  GROUPED-QUERY ATTENTION (GQA)                                          │
│  ─────────────────────────────                                          │
│  32 Query heads, 8 Key heads, 8 Value heads (balanced)                 │
│  KV cache per token: 8 × 2 × 128 × 2 bytes = 4 KB                      │
│  For 4K context: 16 MB per layer, 512 MB total                         │
│  Reduction: 4× smaller cache                                            │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Visual comparison:                                                     │
│                                                                         │
│  MHA:  Q₁ Q₂ Q₃ Q₄ ... Q₃₂     Each Q has its own K,V                  │
│        ↓  ↓  ↓  ↓      ↓                                               │
│        K₁ K₂ K₃ K₄ ... K₃₂                                             │
│        V₁ V₂ V₃ V₄ ... V₃₂                                             │
│                                                                         │
│  GQA:  Q₁ Q₂ Q₃ Q₄ | Q₅ Q₆ Q₇ Q₈ | ...   Groups of Q share K,V        │
│           ↓              ↓                                              │
│           K₁             K₂         (8 groups)                         │
│           V₁             V₂                                             │
│                                                                         │
│  MQA:  Q₁ Q₂ Q₃ Q₄ ... Q₃₂         All Q share one K,V                 │
│              ↓                                                          │
│              K₁  (single)                                               │
│              V₁                                                         │
│                                                                         │
│  Trade-off: Smaller cache = faster decode, but potential quality loss  │
│  GQA is the popular middle ground (used in LLaMA-2, Mistral, etc.)    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Impact on decode:

For a 4K context sequence, decode step with MHA reads ~2 GB of KV cache. With GQA (8 groups), this drops to ~512 MB. With MQA, just ~64 MB. Since decode is memory-bound, reading less cache directly translates to faster decode.


5. PagedAttention: Better Memory Management

The Problem: KV cache memory is typically allocated contiguously for each request, leading to fragmentation and waste when sequences have variable lengths.

The Solution: Manage KV cache like virtual memory—allocate in small, non-contiguous "pages" that can be dynamically assigned.

┌─────────────────────────────────────────────────────────────────────────┐
│                         PAGEDATTENTION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  TRADITIONAL KV CACHE ALLOCATION:                                       │
│  ─────────────────────────────────                                      │
│  GPU Memory:                                                            │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │ Req1 KV (4K reserved) │ Req2 KV (4K reserved) │░░░WASTED░░░░░░│    │
│  │ [████████░░░░░░░░░░░] │ [██████████████░░░░░] │░░░░░░░░░░░░░░░│    │
│  │  2K used    2K waste  │  3.5K used  0.5K waste│  Fragmentation│    │
│  └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Problem: Must reserve max_seq_len for each request, wasting memory.   │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  PAGEDATTENTION:                                                        │
│  ───────────────                                                        │
│  GPU Memory (paged):                                                    │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │ P8 │ P9 │... │FREE│FREE│   │    │
│  └────────────────────────────────────────────────────────────────┘    │
│     ↓    ↓    ↓    ↓    ↓    ↓    ↓                                    │
│   Req1 Req1 Req2 Req1 Req2 Req2 Req2                                    │
│                                                                         │
│  Page table:                                                            │
│  Req1: [P1, P2, P4]         (3 pages, grows as needed)                 │
│  Req2: [P3, P5, P6, P7]     (4 pages, non-contiguous is fine)         │
│                                                                         │
│  Benefits:                                                              │
│  • Near-zero memory waste (allocate only what's needed)                │
│  • Higher batch sizes possible (more requests fit in memory)           │
│  • Memory sharing for common prefixes (beam search, system prompts)    │
│                                                                         │
│  Impact on decode:                                                      │
│  • Indirect: enables larger batches → higher arithmetic intensity      │
│  • Direct: no wasted memory reads for unused cache slots               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

PagedAttention (from the vLLM paper) is now standard in production serving systems. It doesn't change the fundamental compute or memory access patterns, but it enables running larger batches by using memory more efficiently.


Prefill Optimizations

While decode dominates total time, prefill optimizations still matter for time-to-first-token (TTFT) and for very long prompt scenarios.

1. FlashAttention: Memory-Efficient Attention

The Problem: Standard attention materializes the full N×N attention matrix, which for long sequences requires massive memory and bandwidth.

The Solution: Compute attention in tiles, never materializing the full matrix. Keep intermediate results in fast SRAM instead of slow HBM.

┌─────────────────────────────────────────────────────────────────────────┐
│                         FLASHATTENTION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STANDARD ATTENTION:                                                    │
│  ───────────────────                                                    │
│  1. Compute S = Q @ Kᵀ           [N, N] matrix in HBM                  │
│  2. Write S to HBM                                                      │
│  3. Read S from HBM                                                     │
│  4. Compute P = softmax(S)       [N, N] matrix in HBM                  │
│  5. Write P to HBM                                                      │
│  6. Read P from HBM                                                     │
│  7. Compute O = P @ V            [N, d] output                         │
│                                                                         │
│  Memory: O(N²) — prohibitive for long sequences                        │
│  HBM reads/writes: Multiple round trips for N×N matrices               │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  FLASHATTENTION:                                                        │
│  ───────────────                                                        │
│  Process in tiles that fit in SRAM:                                     │
│                                                                         │
│  for each Q_tile (fits in SRAM):                                        │
│      for each K_tile, V_tile:                                           │
│          S_tile = Q_tile @ K_tileᵀ     [in SRAM, never hits HBM]       │
│          P_tile = softmax(S_tile)       [in SRAM]                       │
│          O_tile += P_tile @ V_tile      [accumulate in SRAM]           │
│      write O_tile to HBM               [only final output]             │
│                                                                         │
│  Memory: O(N) — linear, not quadratic!                                 │
│  HBM reads/writes: Minimal (just Q, K, V in, O out)                    │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Impact:                                                                │
│  • 2-4× faster prefill (reduced memory traffic)                        │
│  • Enables much longer context (no N² memory explosion)                │
│  • Makes prefill more compute-bound (less memory overhead)             │
│                                                                         │
│  Note: FlashAttention is now standard in all major frameworks          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Why FlashAttention helps prefill more than decode:

Prefill has large attention matrices (N×N where N is prompt length). The memory savings from FlashAttention are significant. Decode has tiny attention matrices (1×seq_len), so FlashAttention helps less—the bottleneck is weight loading, not attention computation.


2. Prompt Caching / Prefix Caching

The Problem: Many requests share common prefixes (system prompts, few-shot examples, document context). Each request recomputes KV cache for this shared content.

The Solution: Cache the KV values for common prefixes and reuse them across requests.

┌─────────────────────────────────────────────────────────────────────────┐
│                        PROMPT CACHING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT PROMPT CACHING:                                                │
│  ───────────────────────                                                │
│  Request 1: [System prompt (500 tok)] + [User query A (50 tok)]        │
│             ↓                                                           │
│             Prefill ALL 550 tokens, compute KV for all                 │
│                                                                         │
│  Request 2: [System prompt (500 tok)] + [User query B (50 tok)]        │
│             ↓                                                           │
│             Prefill ALL 550 tokens again! Redundant work.              │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  WITH PROMPT CACHING:                                                   │
│  ────────────────────                                                   │
│  First request: [System prompt (500 tok)] + [User query A (50 tok)]    │
│                  ↓                                                      │
│                  Prefill all 550, CACHE KV for system prompt           │
│                                                                         │
│  Subsequent:    [System prompt] + [User query B (50 tok)]              │
│                  ↓              ↓                                       │
│                  Load KV cache  Prefill only 50 new tokens             │
│                  (instant!)     (10× less prefill work)                │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Impact:                                                                │
│  • TTFT dramatically reduced for repeated prefixes                     │
│  • Great for chatbots (system prompts), RAG (document context)         │
│  • Memory cost: must store cached KV in GPU/CPU memory                 │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

3. Chunked Prefill

The Problem: In serving systems processing mixed prefill and decode, a long prefill can block decode steps for other requests, causing latency spikes.

The Solution: Break prefill into chunks and interleave with decode steps from other requests.

┌─────────────────────────────────────────────────────────────────────────┐
│                        CHUNKED PREFILL                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT CHUNKING:                                                      │
│  ─────────────────                                                      │
│  Time: ──────────────────────────────────────────────────────────►      │
│                                                                         │
│  GPU:  [═══════ PREFILL (2000 tokens, 200ms) ═══════][decode][decode]  │
│                                                       ↑                 │
│                     Request B's decode waits 200ms ──┘                  │
│                     (latency spike!)                                    │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  WITH CHUNKED PREFILL:                                                  │
│  ─────────────────────                                                  │
│  Time: ──────────────────────────────────────────────────────────►      │
│                                                                         │
│  GPU:  [prefill][decode][prefill][decode][prefill][decode][prefill]    │
│        chunk 1  Req B   chunk 2  Req B   chunk 3  Req B   chunk 4      │
│                 ↑                ↑                ↑                     │
│                 No long waits! Decode interleaved                      │
│                                                                         │
│  Impact:                                                                │
│  • Smoother latency distribution                                        │
│  • Decode requests not blocked by large prefills                       │
│  • Small overhead from chunking boundaries                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Hardware Considerations

The optimal hardware differs based on which phase you're optimizing for:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HARDWARE FOR EACH PHASE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  FOR PREFILL (compute-bound):                                           │
│  ────────────────────────────                                           │
│  Want: More FLOPS                                                       │
│  • Higher compute throughput (TFLOPS)                                   │
│  • More tensor cores                                                    │
│  • Example: H100 (989 TFLOPS) vs A100 (312 TFLOPS) = 3.2× prefill      │
│                                                                         │
│  FOR DECODE (memory-bound):                                             │
│  ───────────────────────────                                            │
│  Want: More memory bandwidth                                            │
│  • Higher GB/s bandwidth                                                │
│  • More memory channels                                                 │
│  • Example: H100 (3.35 TB/s) vs A100 (2 TB/s) = 1.67× decode           │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  GPU COMPARISON FOR LLM INFERENCE:                                      │
│                                                                         │
│  │ GPU      │ Compute    │ Bandwidth │ Prefill │ Decode │              │
│  │          │ (TFLOPS)   │ (TB/s)    │ Speed   │ Speed  │              │
│  ├──────────┼────────────┼───────────┼─────────┼────────┤              │
│  │ A100     │ 312        │ 2.0       │ 1.0×    │ 1.0×   │              │
│  │ H100 SXM │ 989        │ 3.35      │ 3.2×    │ 1.7×   │              │
│  │ H100 NVL │ 835        │ 3.9       │ 2.7×    │ 2.0×   │              │
│  │ AMD MI300│ 1300       │ 5.3       │ 4.2×    │ 2.7×   │              │
│                                                                         │
│  Notice: Compute scales faster than bandwidth across generations.       │
│  This means decode remains the bottleneck even with newer GPUs.        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Putting It Together: Optimization Decision Framework

┌─────────────────────────────────────────────────────────────────────────┐
│              OPTIMIZATION DECISION FRAMEWORK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STEP 1: PROFILE YOUR WORKLOAD                                          │
│  ─────────────────────────────                                          │
│  • What's your prompt length distribution?                              │
│  • What's your output length distribution?                              │
│  • What's your batch size / concurrent requests?                        │
│  • What's your latency vs throughput priority?                          │
│                                                                         │
│  STEP 2: IDENTIFY YOUR BOTTLENECK                                       │
│  ────────────────────────────────                                       │
│                                                                         │
│       Long prompts + short outputs?          Short/medium prompts +     │
│       ────────────────────────────           long outputs?              │
│       Prefill may be significant             ──────────────────────     │
│       (but decode still matters)             Decode dominates           │
│              │                                      │                   │
│              ▼                                      ▼                   │
│       Consider:                              Consider:                  │
│       • FlashAttention                       • Batching                 │
│       • Prompt caching                       • Quantization             │
│       • Chunked prefill                      • Speculative decoding     │
│       • Tensor parallelism                   • GQA/MQA models           │
│                                              • PagedAttention           │
│                                                                         │
│  STEP 3: APPLY OPTIMIZATIONS IN ORDER OF IMPACT                         │
│  ──────────────────────────────────────────────                         │
│                                                                         │
│  ALWAYS DO (baseline optimizations):                                    │
│  1. FlashAttention (no downside)                                        │
│  2. Efficient batching / continuous batching                            │
│  3. PagedAttention for memory efficiency                                │
│                                                                         │
│  FOR THROUGHPUT (maximize tokens/sec):                                  │
│  4. Increase batch size as much as possible                             │
│  5. Quantization (INT8 or INT4)                                         │
│  6. Tensor parallelism across GPUs                                      │
│                                                                         │
│  FOR LATENCY (minimize time per request):                               │
│  4. Speculative decoding                                                │
│  5. Quantization                                                        │
│  6. Smaller models if quality allows                                    │
│                                                                         │
│  FOR LONG CONTEXT:                                                      │
│  4. GQA/MQA models (smaller KV cache)                                   │
│  5. KV cache quantization                                               │
│  6. Sliding window attention                                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Summary: Why These Optimizations Work

Every optimization we discussed attacks a fundamental bottleneck identified in our analysis:

Optimization Targets Mechanism Why It Works
Batching Decode ↑ Arithmetic intensity Amortize weight reads across tokens
Quantization Decode (mainly) ↓ Bytes to read Less memory bandwidth needed
Speculative decoding Decode Amortize per-step cost Multiple tokens per weight read
GQA/MQA Decode ↓ KV cache size Less cache to read per step
PagedAttention Decode Better memory utilization Enables larger batches
FlashAttention Prefill ↓ Memory traffic Keep attention computation in SRAM
Prompt caching Prefill Skip redundant compute Reuse KV for common prefixes
Chunked prefill Both Better scheduling Reduce latency spikes

The unifying theme: Decode is memory-bound, so successful optimizations either read less data (quantization, compression) or do more work per byte read (batching, speculative decoding). Prefill is compute-bound, so successful optimizations either reduce computation or reduce memory overhead (FlashAttention).


Check Your Understanding

  1. If you're serving a chatbot with short responses (20-30 tokens) but want to maximize throughput, which optimizations would have the biggest impact?
  2. Why does quantization help decode more than prefill, even though the model weights are the same?
  3. A colleague suggests: "Let's use speculative decoding AND large batches together." Is this a good combination? Why or why not?
  4. For a long-document QA system (10K token documents, 100 token answers), would you prioritize prefill or decode optimizations?

Community

Sign up or log in to comment

1.7: Optimization Landscape — Phase-Specific Strategies

1.7: Optimization Landscape — Phase-Specific Strategies

Community Article Published February 2, 2026

The Optimization Framework

Now that we understand WHY prefill is compute-bound and decode is memory-bound, we can understand HOW various optimization techniques work. Each technique attacks a specific bottleneck in a specific phase.

The fundamental insight from our analysis:

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE TWO BOTTLENECKS                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PREFILL (2% of time)              DECODE (98% of time)                │
│  ─────────────────────             ────────────────────                 │
│  Bottleneck: Compute               Bottleneck: Memory bandwidth         │
│  Arithmetic intensity: HIGH        Arithmetic intensity: LOW            │
│  GPU utilization: HIGH             GPU utilization: LOW                 │
│                                                                         │
│  To speed up prefill:              To speed up decode:                  │
│  → Do less computation             → Read less data from memory         │
│  → Compute faster                  → Read data faster                   │
│  → Better parallelization          → Do more work per byte read         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Let's examine how each major optimization technique maps to these bottlenecks.


Taxonomy of Optimizations

┌─────────────────────────────────────────────────────────────────────────┐
│                    OPTIMIZATION TAXONOMY                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PRIMARILY TARGETS DECODE (memory-bound)                                │
│  ────────────────────────────────────────                               │
│  • Batching / Continuous batching     [increases arithmetic intensity]  │
│  • Quantization                       [reduces bytes to read]           │
│  • Speculative decoding               [amortizes memory reads]          │
│  • KV cache compression (MQA/GQA/MLA) [reduces cache reads]            │
│  • PagedAttention                     [better memory management]        │
│                                                                         │
│  PRIMARILY TARGETS PREFILL (compute-bound)                              │
│  ─────────────────────────────────────────                              │
│  • FlashAttention                     [reduces memory traffic]          │
│  • Chunked prefill                    [scheduling optimization]         │
│  • Prompt caching                     [avoids redundant compute]        │
│                                                                         │
│  TARGETS BOTH PHASES                                                    │
│  ───────────────────                                                    │
│  • Tensor parallelism                 [distributes work across GPUs]    │
│  • Pipeline parallelism               [overlaps compute stages]         │
│  • Better hardware                    [more bandwidth + more compute]   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Now let's examine each technique in detail.


Decode Optimizations (The High-Impact Zone)

Since decode accounts for 98% of inference time, these optimizations have the greatest impact on end-to-end latency and throughput.

1. Batching: Increasing Arithmetic Intensity

The Problem: During decode, we read 14 GB of weights to process 1 token. Arithmetic intensity is 0.5 FLOPs/byte—far below the 156 FLOPs/byte threshold.

The Solution: Process multiple requests simultaneously. If we batch 32 requests together, we read the weights once and use them for 32 tokens instead of 1.

┌─────────────────────────────────────────────────────────────────────────┐
│                         BATCHING EFFECT                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT BATCHING (batch_size = 1):                                     │
│  ──────────────────────────────────                                     │
│  Read 14 GB weights → process 1 token → output 1 token                  │
│  Arithmetic intensity: 0.5 FLOPs/byte                                   │
│  Time: 10 ms for 1 token = 100 tokens/sec                              │
│                                                                         │
│  WITH BATCHING (batch_size = 32):                                       │
│  ─────────────────────────────────                                      │
│  Read 14 GB weights → process 32 tokens → output 32 tokens              │
│  Arithmetic intensity: 32 × 0.5 = 16 FLOPs/byte                        │
│  Time: ~12 ms for 32 tokens = 2,666 tokens/sec                         │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Throughput improvement: 26.7× with only 1.2× latency increase!        │
│                                                                         │
│  Why it works:                                                          │
│  • Weight reads are AMORTIZED across batch                              │
│  • Matrix multiply [B, d] @ [d, d] instead of [1, d] @ [d, d]          │
│  • Same memory bandwidth, B× more computation                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The arithmetic intensity scales linearly with batch size:

Batch size:    1      8      32     64     128    256
Arith. int.:   0.5    4      16     32     64     128   FLOPs/byte
Status:        ✗      ✗      ✗      ✗      ✗      ~threshold

Even batch_size=256 barely reaches compute-bound territory.
This shows how severely memory-bound single-request decode is.

Continuous Batching: Rather than waiting for a batch to complete, modern serving systems dynamically add/remove requests from the batch as they start/finish. This maximizes GPU utilization by keeping the batch full.

Traditional batching:            Continuous batching:

Req 1: ████████████░░░░          Req 1: ████████████
Req 2: ████████████████          Req 2: ████████████████
Req 3: ████░░░░░░░░░░░░          Req 3: ████
       ↑                         Req 4:     ████████████
       Padding waste!            Req 5:         ████████
                                        ↑
                                        No wasted slots!

2. Quantization: Reducing Bytes to Read

The Problem: We must read 14 GB of weights every decode step. Memory bandwidth is the bottleneck.

The Solution: Represent weights with fewer bits. Instead of 16-bit floats (FP16), use 8-bit integers (INT8) or 4-bit integers (INT4).

┌─────────────────────────────────────────────────────────────────────────┐
│                       QUANTIZATION EFFECT                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Model: LLaMA-7B                                                        │
│                                                                         │
│  FP16 (baseline):                                                       │
│    Model size: 14 GB                                                    │
│    Decode step time: 14 GB / 2 TB/s = 7 ms                             │
│                                                                         │
│  INT8 quantization:                                                     │
│    Model size: 7 GB (2× smaller)                                        │
│    Decode step time: 7 GB / 2 TB/s = 3.5 ms (2× faster)                │
│                                                                         │
│  INT4 quantization:                                                     │
│    Model size: 3.5 GB (4× smaller)                                      │
│    Decode step time: 3.5 GB / 2 TB/s = 1.75 ms (4× faster)             │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Why it works for decode:                                               │
│  • Decode is memory-bound → read less = go faster                       │
│  • Same computation, fewer bytes                                        │
│  • Arithmetic intensity increases (same FLOPs, fewer bytes)            │
│                                                                         │
│  Trade-off:                                                             │
│  • Some accuracy loss (model quality degrades)                          │
│  • Careful calibration required                                         │
│  • INT4 approaches limits of acceptable quality loss                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quantization impact by phase:

Phase Effect of Quantization Why
Decode MAJOR speedup (near-linear with compression) Memory-bound → fewer bytes = faster
Prefill Moderate speedup Compute-bound → helps somewhat via faster memory, but compute is bottleneck

3. Speculative Decoding: Amortizing Memory Reads

The Problem: Each decode step reads the full model (14 GB) but produces just 1 token. We pay the memory bandwidth cost 200 times to generate 200 tokens.

The Solution: Generate multiple tokens per "verification" step. Use a small, fast "draft" model to propose several tokens, then verify them in parallel with the large model.

┌─────────────────────────────────────────────────────────────────────────┐
│                    SPECULATIVE DECODING                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STANDARD DECODING:                                                     │
│  ──────────────────                                                     │
│  Step 1: Read 14GB → generate token 1                                   │
│  Step 2: Read 14GB → generate token 2                                   │
│  Step 3: Read 14GB → generate token 3                                   │
│  Step 4: Read 14GB → generate token 4                                   │
│  ...                                                                    │
│  Total for 4 tokens: 4 × 14GB = 56 GB read                             │
│                                                                         │
│  SPECULATIVE DECODING:                                                  │
│  ─────────────────────                                                  │
│  Step 1: Draft model proposes: [tok1, tok2, tok3, tok4] (fast, small)  │
│  Step 2: Large model verifies all 4 in ONE forward pass                │
│          Read 14GB → verify 4 tokens in parallel                        │
│  Step 3: Accept verified tokens (say 3 of 4 accepted)                   │
│                                                                         │
│  Total for 3 tokens: ~14 GB read (+ small draft overhead)              │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Why it works:                                                          │
│  • Verification is like a mini-prefill (parallel, compute-bound)       │
│  • Multiple tokens verified per weight read                             │
│  • Draft model is small enough to be fast (fits in cache)              │
│  • Acceptance rate determines speedup                                   │
│                                                                         │
│  Speedup depends on:                                                    │
│  • Draft model quality (higher acceptance = better)                     │
│  • Draft model speed (smaller = faster)                                 │
│  • Domain match (draft trained on similar data)                         │
│                                                                         │
│  Typical speedup: 2-3× for well-matched draft models                   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The key insight: Speculative decoding converts multiple sequential decode steps into fewer parallel verification steps. Each verification is more like prefill (parallel tokens, higher arithmetic intensity) than like decode.


4. KV Cache Compression (MQA, GQA, MLA)

The Problem: The KV cache grows with sequence length and must be read every decode step. For long sequences, KV cache reads become significant.

The Solution: Share Key and Value heads across multiple Query heads, reducing what must be stored and read.

┌─────────────────────────────────────────────────────────────────────────┐
│                    KV CACHE COMPRESSION TECHNIQUES                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  MULTI-HEAD ATTENTION (MHA) - Standard                                  │
│  ─────────────────────────────────────                                  │
│  32 Query heads, 32 Key heads, 32 Value heads                           │
│  KV cache per token: 32 × 2 × 128 × 2 bytes = 16 KB                    │
│  For 4K context: 64 MB per layer, 2 GB total                           │
│                                                                         │
│  MULTI-QUERY ATTENTION (MQA)                                            │
│  ───────────────────────────                                            │
│  32 Query heads, 1 Key head, 1 Value head (shared!)                    │
│  KV cache per token: 1 × 2 × 128 × 2 bytes = 0.5 KB                    │
│  For 4K context: 2 MB per layer, 64 MB total                           │
│  Reduction: 32× smaller cache!                                          │
│                                                                         │
│  GROUPED-QUERY ATTENTION (GQA)                                          │
│  ─────────────────────────────                                          │
│  32 Query heads, 8 Key heads, 8 Value heads (balanced)                 │
│  KV cache per token: 8 × 2 × 128 × 2 bytes = 4 KB                      │
│  For 4K context: 16 MB per layer, 512 MB total                         │
│  Reduction: 4× smaller cache                                            │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Visual comparison:                                                     │
│                                                                         │
│  MHA:  Q₁ Q₂ Q₃ Q₄ ... Q₃₂     Each Q has its own K,V                  │
│        ↓  ↓  ↓  ↓      ↓                                               │
│        K₁ K₂ K₃ K₄ ... K₃₂                                             │
│        V₁ V₂ V₃ V₄ ... V₃₂                                             │
│                                                                         │
│  GQA:  Q₁ Q₂ Q₃ Q₄ | Q₅ Q₆ Q₇ Q₈ | ...   Groups of Q share K,V        │
│           ↓              ↓                                              │
│           K₁             K₂         (8 groups)                         │
│           V₁             V₂                                             │
│                                                                         │
│  MQA:  Q₁ Q₂ Q₃ Q₄ ... Q₃₂         All Q share one K,V                 │
│              ↓                                                          │
│              K₁  (single)                                               │
│              V₁                                                         │
│                                                                         │
│  Trade-off: Smaller cache = faster decode, but potential quality loss  │
│  GQA is the popular middle ground (used in LLaMA-2, Mistral, etc.)    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Impact on decode:

For a 4K context sequence, decode step with MHA reads ~2 GB of KV cache. With GQA (8 groups), this drops to ~512 MB. With MQA, just ~64 MB. Since decode is memory-bound, reading less cache directly translates to faster decode.


5. PagedAttention: Better Memory Management

The Problem: KV cache memory is typically allocated contiguously for each request, leading to fragmentation and waste when sequences have variable lengths.

The Solution: Manage KV cache like virtual memory—allocate in small, non-contiguous "pages" that can be dynamically assigned.

┌─────────────────────────────────────────────────────────────────────────┐
│                         PAGEDATTENTION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  TRADITIONAL KV CACHE ALLOCATION:                                       │
│  ─────────────────────────────────                                      │
│  GPU Memory:                                                            │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │ Req1 KV (4K reserved) │ Req2 KV (4K reserved) │░░░WASTED░░░░░░│    │
│  │ [████████░░░░░░░░░░░] │ [██████████████░░░░░] │░░░░░░░░░░░░░░░│    │
│  │  2K used    2K waste  │  3.5K used  0.5K waste│  Fragmentation│    │
│  └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Problem: Must reserve max_seq_len for each request, wasting memory.   │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  PAGEDATTENTION:                                                        │
│  ───────────────                                                        │
│  GPU Memory (paged):                                                    │
│  ┌────────────────────────────────────────────────────────────────┐    │
│  │ P1 │ P2 │ P3 │ P4 │ P5 │ P6 │ P7 │ P8 │ P9 │... │FREE│FREE│   │    │
│  └────────────────────────────────────────────────────────────────┘    │
│     ↓    ↓    ↓    ↓    ↓    ↓    ↓                                    │
│   Req1 Req1 Req2 Req1 Req2 Req2 Req2                                    │
│                                                                         │
│  Page table:                                                            │
│  Req1: [P1, P2, P4]         (3 pages, grows as needed)                 │
│  Req2: [P3, P5, P6, P7]     (4 pages, non-contiguous is fine)         │
│                                                                         │
│  Benefits:                                                              │
│  • Near-zero memory waste (allocate only what's needed)                │
│  • Higher batch sizes possible (more requests fit in memory)           │
│  • Memory sharing for common prefixes (beam search, system prompts)    │
│                                                                         │
│  Impact on decode:                                                      │
│  • Indirect: enables larger batches → higher arithmetic intensity      │
│  • Direct: no wasted memory reads for unused cache slots               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

PagedAttention (from the vLLM paper) is now standard in production serving systems. It doesn't change the fundamental compute or memory access patterns, but it enables running larger batches by using memory more efficiently.


Prefill Optimizations

While decode dominates total time, prefill optimizations still matter for time-to-first-token (TTFT) and for very long prompt scenarios.

1. FlashAttention: Memory-Efficient Attention

The Problem: Standard attention materializes the full N×N attention matrix, which for long sequences requires massive memory and bandwidth.

The Solution: Compute attention in tiles, never materializing the full matrix. Keep intermediate results in fast SRAM instead of slow HBM.

┌─────────────────────────────────────────────────────────────────────────┐
│                         FLASHATTENTION                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STANDARD ATTENTION:                                                    │
│  ───────────────────                                                    │
│  1. Compute S = Q @ Kᵀ           [N, N] matrix in HBM                  │
│  2. Write S to HBM                                                      │
│  3. Read S from HBM                                                     │
│  4. Compute P = softmax(S)       [N, N] matrix in HBM                  │
│  5. Write P to HBM                                                      │
│  6. Read P from HBM                                                     │
│  7. Compute O = P @ V            [N, d] output                         │
│                                                                         │
│  Memory: O(N²) — prohibitive for long sequences                        │
│  HBM reads/writes: Multiple round trips for N×N matrices               │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  FLASHATTENTION:                                                        │
│  ───────────────                                                        │
│  Process in tiles that fit in SRAM:                                     │
│                                                                         │
│  for each Q_tile (fits in SRAM):                                        │
│      for each K_tile, V_tile:                                           │
│          S_tile = Q_tile @ K_tileᵀ     [in SRAM, never hits HBM]       │
│          P_tile = softmax(S_tile)       [in SRAM]                       │
│          O_tile += P_tile @ V_tile      [accumulate in SRAM]           │
│      write O_tile to HBM               [only final output]             │
│                                                                         │
│  Memory: O(N) — linear, not quadratic!                                 │
│  HBM reads/writes: Minimal (just Q, K, V in, O out)                    │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Impact:                                                                │
│  • 2-4× faster prefill (reduced memory traffic)                        │
│  • Enables much longer context (no N² memory explosion)                │
│  • Makes prefill more compute-bound (less memory overhead)             │
│                                                                         │
│  Note: FlashAttention is now standard in all major frameworks          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Why FlashAttention helps prefill more than decode:

Prefill has large attention matrices (N×N where N is prompt length). The memory savings from FlashAttention are significant. Decode has tiny attention matrices (1×seq_len), so FlashAttention helps less—the bottleneck is weight loading, not attention computation.


2. Prompt Caching / Prefix Caching

The Problem: Many requests share common prefixes (system prompts, few-shot examples, document context). Each request recomputes KV cache for this shared content.

The Solution: Cache the KV values for common prefixes and reuse them across requests.

┌─────────────────────────────────────────────────────────────────────────┐
│                        PROMPT CACHING                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT PROMPT CACHING:                                                │
│  ───────────────────────                                                │
│  Request 1: [System prompt (500 tok)] + [User query A (50 tok)]        │
│             ↓                                                           │
│             Prefill ALL 550 tokens, compute KV for all                 │
│                                                                         │
│  Request 2: [System prompt (500 tok)] + [User query B (50 tok)]        │
│             ↓                                                           │
│             Prefill ALL 550 tokens again! Redundant work.              │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  WITH PROMPT CACHING:                                                   │
│  ────────────────────                                                   │
│  First request: [System prompt (500 tok)] + [User query A (50 tok)]    │
│                  ↓                                                      │
│                  Prefill all 550, CACHE KV for system prompt           │
│                                                                         │
│  Subsequent:    [System prompt] + [User query B (50 tok)]              │
│                  ↓              ↓                                       │
│                  Load KV cache  Prefill only 50 new tokens             │
│                  (instant!)     (10× less prefill work)                │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Impact:                                                                │
│  • TTFT dramatically reduced for repeated prefixes                     │
│  • Great for chatbots (system prompts), RAG (document context)         │
│  • Memory cost: must store cached KV in GPU/CPU memory                 │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

3. Chunked Prefill

The Problem: In serving systems processing mixed prefill and decode, a long prefill can block decode steps for other requests, causing latency spikes.

The Solution: Break prefill into chunks and interleave with decode steps from other requests.

┌─────────────────────────────────────────────────────────────────────────┐
│                        CHUNKED PREFILL                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  WITHOUT CHUNKING:                                                      │
│  ─────────────────                                                      │
│  Time: ──────────────────────────────────────────────────────────►      │
│                                                                         │
│  GPU:  [═══════ PREFILL (2000 tokens, 200ms) ═══════][decode][decode]  │
│                                                       ↑                 │
│                     Request B's decode waits 200ms ──┘                  │
│                     (latency spike!)                                    │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  WITH CHUNKED PREFILL:                                                  │
│  ─────────────────────                                                  │
│  Time: ──────────────────────────────────────────────────────────►      │
│                                                                         │
│  GPU:  [prefill][decode][prefill][decode][prefill][decode][prefill]    │
│        chunk 1  Req B   chunk 2  Req B   chunk 3  Req B   chunk 4      │
│                 ↑                ↑                ↑                     │
│                 No long waits! Decode interleaved                      │
│                                                                         │
│  Impact:                                                                │
│  • Smoother latency distribution                                        │
│  • Decode requests not blocked by large prefills                       │
│  • Small overhead from chunking boundaries                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Hardware Considerations

The optimal hardware differs based on which phase you're optimizing for:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HARDWARE FOR EACH PHASE                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  FOR PREFILL (compute-bound):                                           │
│  ────────────────────────────                                           │
│  Want: More FLOPS                                                       │
│  • Higher compute throughput (TFLOPS)                                   │
│  • More tensor cores                                                    │
│  • Example: H100 (989 TFLOPS) vs A100 (312 TFLOPS) = 3.2× prefill      │
│                                                                         │
│  FOR DECODE (memory-bound):                                             │
│  ───────────────────────────                                            │
│  Want: More memory bandwidth                                            │
│  • Higher GB/s bandwidth                                                │
│  • More memory channels                                                 │
│  • Example: H100 (3.35 TB/s) vs A100 (2 TB/s) = 1.67× decode           │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  GPU COMPARISON FOR LLM INFERENCE:                                      │
│                                                                         │
│  │ GPU      │ Compute    │ Bandwidth │ Prefill │ Decode │              │
│  │          │ (TFLOPS)   │ (TB/s)    │ Speed   │ Speed  │              │
│  ├──────────┼────────────┼───────────┼─────────┼────────┤              │
│  │ A100     │ 312        │ 2.0       │ 1.0×    │ 1.0×   │              │
│  │ H100 SXM │ 989        │ 3.35      │ 3.2×    │ 1.7×   │              │
│  │ H100 NVL │ 835        │ 3.9       │ 2.7×    │ 2.0×   │              │
│  │ AMD MI300│ 1300       │ 5.3       │ 4.2×    │ 2.7×   │              │
│                                                                         │
│  Notice: Compute scales faster than bandwidth across generations.       │
│  This means decode remains the bottleneck even with newer GPUs.        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Putting It Together: Optimization Decision Framework

┌─────────────────────────────────────────────────────────────────────────┐
│              OPTIMIZATION DECISION FRAMEWORK                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  STEP 1: PROFILE YOUR WORKLOAD                                          │
│  ─────────────────────────────                                          │
│  • What's your prompt length distribution?                              │
│  • What's your output length distribution?                              │
│  • What's your batch size / concurrent requests?                        │
│  • What's your latency vs throughput priority?                          │
│                                                                         │
│  STEP 2: IDENTIFY YOUR BOTTLENECK                                       │
│  ────────────────────────────────                                       │
│                                                                         │
│       Long prompts + short outputs?          Short/medium prompts +     │
│       ────────────────────────────           long outputs?              │
│       Prefill may be significant             ──────────────────────     │
│       (but decode still matters)             Decode dominates           │
│              │                                      │                   │
│              ▼                                      ▼                   │
│       Consider:                              Consider:                  │
│       • FlashAttention                       • Batching                 │
│       • Prompt caching                       • Quantization             │
│       • Chunked prefill                      • Speculative decoding     │
│       • Tensor parallelism                   • GQA/MQA models           │
│                                              • PagedAttention           │
│                                                                         │
│  STEP 3: APPLY OPTIMIZATIONS IN ORDER OF IMPACT                         │
│  ──────────────────────────────────────────────                         │
│                                                                         │
│  ALWAYS DO (baseline optimizations):                                    │
│  1. FlashAttention (no downside)                                        │
│  2. Efficient batching / continuous batching                            │
│  3. PagedAttention for memory efficiency                                │
│                                                                         │
│  FOR THROUGHPUT (maximize tokens/sec):                                  │
│  4. Increase batch size as much as possible                             │
│  5. Quantization (INT8 or INT4)                                         │
│  6. Tensor parallelism across GPUs                                      │
│                                                                         │
│  FOR LATENCY (minimize time per request):                               │
│  4. Speculative decoding                                                │
│  5. Quantization                                                        │
│  6. Smaller models if quality allows                                    │
│                                                                         │
│  FOR LONG CONTEXT:                                                      │
│  4. GQA/MQA models (smaller KV cache)                                   │
│  5. KV cache quantization                                               │
│  6. Sliding window attention                                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Summary: Why These Optimizations Work

Every optimization we discussed attacks a fundamental bottleneck identified in our analysis:

Optimization Targets Mechanism Why It Works
Batching Decode ↑ Arithmetic intensity Amortize weight reads across tokens
Quantization Decode (mainly) ↓ Bytes to read Less memory bandwidth needed
Speculative decoding Decode Amortize per-step cost Multiple tokens per weight read
GQA/MQA Decode ↓ KV cache size Less cache to read per step
PagedAttention Decode Better memory utilization Enables larger batches
FlashAttention Prefill ↓ Memory traffic Keep attention computation in SRAM
Prompt caching Prefill Skip redundant compute Reuse KV for common prefixes
Chunked prefill Both Better scheduling Reduce latency spikes

The unifying theme: Decode is memory-bound, so successful optimizations either read less data (quantization, compression) or do more work per byte read (batching, speculative decoding). Prefill is compute-bound, so successful optimizations either reduce computation or reduce memory overhead (FlashAttention).


Check Your Understanding

  1. If you're serving a chatbot with short responses (20-30 tokens) but want to maximize throughput, which optimizations would have the biggest impact?
  2. Why does quantization help decode more than prefill, even though the model weights are the same?
  3. A colleague suggests: "Let's use speculative decoding AND large batches together." Is this a good combination? Why or why not?
  4. For a long-document QA system (10K token documents, 100 token answers), would you prioritize prefill or decode optimizations?

Community

Sign up or log in to comment