---
id: "concept-kv-cache"
type: "concept"
source_timestamps: ["01:05:00", "07:34:00", "09:27:00"]
tags: ["llm-architecture", "memory", "inference"]
related: ["concept-turboquant", "claim-memory-bottleneck", "action-evaluate-full-stack-concurrency", "prereq-llm-transformer-architecture"]
definition: "The working memory of an LLM during inference, storing previously processed tokens as key-value pairs to prevent redundant computation, which becomes a primary bottleneck at scale."
sources: ["s49-killed-ram-limits"]
sourceVaultSlug: "s49-killed-ram-limits"
originDay: 49
---
# Key-Value (KV) Cache

The Key-Value (KV) cache is the fundamental working memory mechanism for Large Language Models during inference. Because LLMs compute autoregressively (generating one token at a time), re-evaluating the entire preceding context for every new token would be computationally ruinous.

To solve this, every token the model processes is stored as a key-value pair in the KV cache. The model then computes over these stored pairs for every subsequent token generated. The speaker [[entity-nate-b-jones]] uses the analogy that the model **weights** are the 'processor' while the KV cache is the 'hard drive' or 'RAM' that allows the model to hold a conversation, follow an argument, or track a codebase.

As context windows grow to millions of tokens and agentic loops burn through 100M-1B tokens per task, the KV cache expands **linearly** with context length. This creates a massive memory bottleneck that dictates the profitability and concurrency limits of GPU deployments — see [[claim-memory-bottleneck]] and the broader [[concept-ai-memory-crisis]].

The KV cache is the direct target of [[concept-turboquant]] (compression), [[concept-multi-head-latent-attention]] (architectural redesign), and the eviction/tiering approaches catalogued in [[framework-memory-optimization-landscape]].

Understanding why the KV cache exists requires familiarity with [[prereq-llm-transformer-architecture]]. Operationalizing compression on top of the KV cache requires the audit discipline of [[action-evaluate-full-stack-concurrency]].
