---
id: "concept-multi-head-latent-attention"
type: "concept"
source_timestamps: ["17:20:00"]
tags: ["llm-architecture", "attention-mechanism"]
related: ["entity-deepseek-v2", "framework-memory-optimization-landscape", "concept-kv-cache", "prereq-llm-transformer-architecture"]
definition: "An architectural redesign that projects keys and values into a lower-dimensional latent space during training, structurally reducing the KV cache footprint by design."
sources: ["s49-killed-ram-limits"]
sourceVaultSlug: "s49-killed-ram-limits"
originDay: 49
---
# Multi-Head Latent Attention (MLA)

Multi-Head Latent Attention (MLA) is an architectural redesign of the transformer attention mechanism, notably implemented in [[entity-deepseek-v2]].

Instead of maintaining standard, massive Key and Value matrices for every token, MLA **projects keys and values into a lower-dimensional latent space during the training phase itself**. By doing this, it shrinks the required footprint of the [[concept-kv-cache]] **by design**, rather than trying to compress it after the fact during inference.

This is the critical contrast with [[concept-turboquant]]:
- **Turboquant** = post-hoc compression applied at inference to a model trained normally
- **MLA** = architectural — model is trained from scratch with a low-dim latent space for KV

They are **complementary**: an MLA-architected model can additionally use Turboquant compression for further savings.

MLA falls into bucket 3 ('Architectural Redesign') of the [[framework-memory-optimization-landscape]], alongside efforts like IBM Granite 4.0. It represents a fundamental shift from post-hoc software compression to structural architectural efficiency.
