Most LLMs waste processing power by activating every parameter for every word. DeepSeek V3 flips that approach completely, firing up just 37 billion of its 671 billion parameters per token through a Mixture of Experts system that routes inputs to specialized networks instead of running full computations everywhere. This architecture combines selective expert activation with compressed attention caching and FP8 precision training to cut memory usage and inference costs while maintaining quality. Here’s how the layers, routing mechanism, and retrieval system actually work under the hood.

Core Components of DeepSeek V3’s Architecture

DFwr4PsNTV2MGH-Uq2Ub_Q

DeepSeek V3 runs on 671 billion total parameters but only activates 37 billion per token. It’s a Mixture of Experts model that selectively fires up computational resources instead of throwing everything at every input. This sparse activation setup works completely differently from dense models where all parameters process every token, letting DeepSeek V3 stay massive while keeping inference actually doable. The architecture pulls this off through a routing mechanism that sends each token to a small subset of specialized expert networks, activating roughly 5.5% of total parameters per forward pass.

The model builds on traditional Transformer blocks with three specific tweaks that improve training stability and computational efficiency.

DeepSeek V3’s decoder architecture spans 61 layers. Each layer contains 256 routed experts plus 1 shared expert that processes all inputs. The routing system activates 1,354 experts total during token processing: 58 layers select 9 experts each (8 routed plus 1 shared), while 3 layers activate all 257 experts (256 routed plus 1 shared) to capture comprehensive input data at critical network points. This creates a total expert pool distributed across the network depth.

Each individual expert contains 29.36 million parameters. The selective activation pattern results in 37.96 billion activated feed-forward network parameters per token. This parameter distribution enables specialized computation where different expert networks learn distinct patterns in the training data, while shared experts maintain consistency across all processing paths by capturing universal patterns that apply to every input.

Multi-Head Latent Attention Mechanism in DeepSeek V3

2W_UXOvlQaeajhErzTUMVQ

Multi-Head Latent Attention compresses the key-value cache that builds up during inference, tackling a critical memory bottleneck that grows linearly with sequence length in standard attention implementations.

Traditional attention stores full key and value vectors for every token in the context window. This creates substantial memory overhead during generation tasks. DeepSeek V3’s MLA achieves approximately 10x memory reduction by storing compressed latent vectors of size 32 instead of full 512-dimension key-value vectors per token. The compression works by projecting high-dimensional key and value representations into a shared low-dimensional latent space, then reconstructing the full representations only when computing attention scores.

For a concrete calculation: a model with hidden dimension 512, 8 attention heads, and 1000-token sequence length stores only 96 numbers per token under MLA (32 compressed latent dimensions plus 64 for RoPE positional encoding) versus the full key-value storage requirement.

The architecture makes a specific design choice for positional information. All attention heads share a single key RoPE component to reduce memory usage, while each head maintains its own query RoPE component for position-sensitive pattern matching. This asymmetric approach works because query vectors get recomputed at each decoding step and don’t pile up in the cache, while key vectors persist throughout the sequence. The shared key RoPE lets multiple heads reference the same positional encoding without redundant storage.

Multi-head attention enables the model to capture diverse relationship types between tokens at the same time. One head might focus on syntactic dependencies while another tracks semantic relationships or long-range discourse structure. This parallel processing across multiple subspaces reduces the information bottleneck that happens when forcing all relationships through a single attention mechanism. The scaled dot-product attention formula QK^T/√d_k prevents dot product magnitudes from growing too large (which would push softmax into regions with small gradients), dividing by the square root of the key dimension before applying softmax normalization and multiplying by values to produce the final attention output.

Values in MLA use only content components without positional encoding. Position information is already incorporated through the key-query attention scores, so adding RoPE to values would redundantly encode position twice and mess with the content-based weighted combination that attention performs.

Deployment efficiency advantages from the MLA architecture include:

10x key-value cache reduction enables serving longer context windows within fixed GPU memory budgets, supporting 128k token contexts that would otherwise require proportionally more memory

FP8 precision savings provide additional 50% memory reduction compared to bf16, compounding with MLA compression for total memory efficiency gains

Sparse activation benefits from the MoE framework mean only 37 billion parameters activate per token despite 671 billion total, reducing both memory bandwidth and computation

Long-sequence processing capabilities become practical for applications like document analysis and multi-turn conversations where context accumulates over thousands of tokens

Reduced GPU requirements for deployment lower inference costs, making the model viable on hardware configurations that couldn’t support equivalent dense models with full KV caching

DeepSeekMoE Framework and Expert Routing

J5xd_KhPSLugpztWjW831Q

DeepSeekMoE combines shared experts that process every input token with routed specialized experts that activate selectively based on input characteristics. It uses smaller and more numerous expert networks than traditional sparse models. The shared experts capture universal patterns applicable across all inputs while routed experts develop specializations for specific content types, syntax patterns, or semantic domains. This hybrid structure balances consistent baseline processing with adaptive specialized computation.

The architecture scales significantly from v2 to v3. Routed experts per layer jumped 60% from 160 to 256, expanding the pool of specializations available at each network depth. All-experts-activated layers expanded from 1 to 3, creating additional points where the model processes inputs through its complete expert set rather than routing to a subset. These layers typically appear at strategic depths: early for comprehensive input representation, mid-network for complex pattern integration, or late for output refinement.

The total expert activation pattern distributes computation unevenly across layers by design. 58 layers activate 9 experts each (8 routed plus 1 shared), while 3 layers activate all 257 experts (256 routed plus 1 shared). This totals 1,354 activated experts per forward pass. The selective activation in most layers provides computational efficiency, while full activation in designated layers ensures no specialized knowledge gets excluded when comprehensive processing matters most.

Traditional Mixture of Experts models use auxiliary loss functions to prevent routing collapse where all inputs route to a small subset of experts, leaving others undertrained. DeepSeek V3 eliminates these auxiliary losses entirely, replacing them with a dynamic bias adjustment mechanism that influences routing decisions without adding training objectives that might conflict with the primary language modeling loss. The bias-based approach adjusts routing scores directly through a learnable bias term for each expert, modified during training based on load distribution.

Key advantages of auxiliary-loss-free load balancing include:

Simplified training dynamics by removing competing optimization objectives that require hyperparameter tuning to balance auxiliary loss weight against primary task performance

Dynamic bias adjustment using hyperparameter γ provides direct control: bias decreases by γ for overloaded experts (reducing their routing probability) and increases by γ for underloaded experts (making them more likely to receive tokens)

Routing influence without gradient interference because bias terms affect score computation before softmax but don’t modify the gating values used to weight expert outputs, preserving expert specialization patterns

Maintained expert specialization occurs naturally as the bias mechanism prevents collapse without forcing artificial load distribution that might route incompatible inputs to inappropriate experts

Performance stability improves because the primary language modeling objective faces no competition from auxiliary terms that might pull gradients in conflicting directions during backpropagation

Training Efficiency Through FP8 Precision

VwBVmCOQTzq0DQH9eAs_rA

DeepSeek V3 represents the first open-source large language model to use FP8 (8-bit floating point) precision during pre-training. It achieves 2x compute efficiency compared to FP16 on Nvidia Tensor Cores through reduced bit-width operations that process more data per clock cycle.

FP8 reduces memory consumption by half compared to bf16 (16-bit brain float) while maintaining relative loss error below 0.25% compared to the bf16 baseline. This shows that carefully implemented low-precision training preserves model quality. This memory reduction compounds throughout training. Activations, intermediate computations, and cached attention values all consume half the space, enabling larger batch sizes or longer sequences within fixed GPU memory budgets.

Component Precision Format Purpose
Feed-forward layers and most operations FP8 (E4M3) Compute efficiency and memory reduction during forward/backward passes
Embedding and attention layers bf16 Preserve precision for critical representational bottlenecks where information flows between layers
Master weights and gradients fp32 Maintain high precision for parameter updates to prevent accumulated rounding errors over training
Linear layers (E4M3 choice over E5M2) FP8 E4M3 Allocate more mantissa bits for better Mean-Absolute-Error performance in linear operations
Post-attention activations E5M6 (custom fp12) Balance precision and memory for activation tensors where attention outputs feed downstream layers

The tile-wise quantization strategy divides tensors into 1×128 tiles rather than applying per-tensor scaling, providing fine-grained control where different regions of a tensor may have different dynamic ranges. Previously per-tensor scaled areas now use 128×128 blocks for even more localized precision management. Power-of-2 scaling factors apply to critical areas following attention operations because power-of-2 multipliers compile to simple bit shifts rather than full multiplication instructions, reducing computational overhead for frequent rescaling operations.

Resource utilization improvements emerge from multiple architectural decisions working together. Activating 37.96 billion parameters (29.36 million per expert) out of 671 billion total enables selective computation where only relevant expert networks engage for each token. The architecture design helps distributed system performance for long-sequence processing because the reduced memory footprint from FP8 and MLA allows individual GPUs to hold larger context windows locally, minimizing cross-device communication that becomes a bottleneck in distributed training and inference.

Performance Metric Value Architectural Component
Compute efficiency multiplier 2x vs FP16 FP8 precision on Nvidia Tensor Cores
Memory reduction factor 50% vs bf16 8-bit vs 16-bit representation for activations and weights
Relative loss error Below 0.25% Mixed precision strategy with selective high-precision components
Active parameter ratio 5.5% (37B/671B) Mixture of Experts sparse activation routing

Quantization Strategy and Numerical Stability

Q9agEnoESLukajFqSIiPPA

The tile-wise quantization approach represents a shift from per-tensor scaling to 128×128 blocks, enabling fine-grained precision control that adapts to local tensor characteristics rather than applying uniform scaling across entire weight matrices or activation tensors.

The E4M3 format selection over E5M2 allocates more bits to the mantissa (3 bits) at the expense of exponent range (4 bits versus 5 in E5M2). This provides finer precision in the linear scale that improves Mean-Absolute-Error performance for gradient updates and weight representations. This trade-off accepts a smaller representable range (adequate for normalized neural network values) in exchange for more granular precision between representable numbers, reducing quantization error when values cluster within a narrower range. The format choice particularly benefits feed-forward network weights and activations where values typically fall within predictable bounds after normalization.

Tensor Core accumulation precision bounded at 14 bits for FP8 operations creates a potential overflow issue. 2-bit overflow during FP8×FP8 multiplication can cause up to 2% error rate in worst-case scenarios where accumulated dot products exceed the accumulator’s dynamic range. The solution promotes MMA (Matrix Multiply-Accumulate) operations to CUDA Core execution rather than Tensor Core when precision requirements demand it. CUDA Cores provide wider accumulator bit-width at the cost of throughput, creating a precision-speed trade-off that DeepSeek V3 manages by profiling which operations tolerate reduced precision and which require CUDA Core promotion.

Online (dynamic range) quantization applies to all layers, not just activation layers. It computes scaling factors during the forward pass based on actual tensor value distributions rather than using static pre-computed scales. Power-of-2 scaling factors for critical post-attention areas enable efficient implementation. “After computing attention output, scale by 4” compiles to a simple 2-bit left shift rather than a full floating-point multiplication.

Multi-Token Prediction Architecture

0n6hPw__THqVKkLSQpp6Cg

Multi-Token Prediction maintains causal chain relationships when predicting multiple future tokens simultaneously. This contrasts with independent parallel predictions that treat each future position as a separate task. Traditional next-token prediction generates one token at a time in autoregressive fashion: predict t₁, append it to context, predict t₂, append it, and continue sequentially. Parallel prediction might attempt to predict t₁, t₂, and t₃ simultaneously from the same context, but this approach breaks causal dependencies because t₂’s prediction ignores t₁ and t₃ ignores both predecessors.

The depth-dependent causal dependency mechanism in DeepSeek V3’s multi-token prediction preserves the sequential relationship. Predictions for token t₃ use both t₁ and t₂ as context, creating interconnected prediction chains where each future token conditions on all previous predictions in the same forward pass. At each position, the model predicts not just the immediate next token but several steps ahead, with each prediction head incorporating outputs from earlier prediction heads. This creates a cascade where the t₂ prediction head receives the t₁ prediction as input, and the t₃ prediction head receives both t₁ and t₂ predictions, maintaining causal structure across multiple prediction depths.

Computational benefits include more efficient processing of longer sequences. The model amortizes the cost of encoding the input context across multiple token predictions rather than encoding separately for each generation step. Resource utilization improves across distributed training systems because multi-token prediction increases the computational intensity per forward pass, improving the ratio of computation to communication in distributed settings where cross-GPU synchronization overhead becomes a bottleneck. The architecture extracts more training signal per sequence by generating supervisory signals at multiple future positions simultaneously, effectively increasing the number of training examples derived from each input sequence without additional data collection.

Layer Configuration and Network Design

Oj2y9s9ZSDCqDx9TlOL3KA

The 61-layer depth represents a minimal increase from v2’s 60 layers. This suggests that DeepSeek researchers found additional depth provided diminishing returns compared to other scaling dimensions like expert count. Each layer follows the standard Transformer block foundation: multi-head attention followed by feed-forward network, with residual connections and normalization surrounding each sub-block.

Component choices reflect specific design rationale. SwiGLU activation functions in feed-forward networks provide gating mechanisms that allow the model to dynamically control information flow based on input content. RoPE (Rotary Position Embedding) encodes positional information through rotation matrices that naturally extend to arbitrary sequence lengths without retraining. RMSNorm (Root Mean Square Layer Normalization) simplifies standard LayerNorm by removing the mean centering operation while maintaining training stability with reduced computational cost. These components inherited from v2 proved effective enough that v3 maintained them unchanged while scaling other dimensions.

The strategic increase in all-experts-activated layers from 1 to 3 creates additional points where comprehensive processing occurs. The first all-expert layer near the input ensures rich initial representation capturing diverse input characteristics. The second provides mid-network integration where complex patterns combine. The third enables thorough final processing before output generation. This expansion enhances input data representation by ensuring critical integration points where specialized knowledge from all expert networks combines.

Key layer configuration elements include:

Residual connections bypass each sub-layer, adding the sub-layer output to its input to enable gradient flow through deep networks and allow lower layers to preserve information that upper layers might otherwise compress away.

Pre-normalization architecture applies RMSNorm before attention and feed-forward operations rather than after, stabilizing training in deep networks by ensuring normalized inputs to each transformation.

Feed-forward expansion ratio scales the intermediate dimension larger than the model dimension within each expert’s feed-forward network, creating a bottleneck-expansion-bottleneck pattern that increases representational capacity.

Expert isolation maintains separate weight parameters for each expert network within a layer, preventing interference between specialized functions while allowing the routing mechanism to direct computation appropriately.

Distributed Training and Parallelism Strategies

Tw9MUNEeQkSXra_CZs88FQ

Training 671 billion parameters with selective activation of 37.96 billion parameters per token requires sophisticated parallelism strategies that distribute computation, memory, and communication across GPU clusters efficiently.

The architecture employs multiple parallelism dimensions simultaneously. Pipeline parallelism divides the 61 layers across devices in a pipeline where different stages process different micro-batches concurrently. Tensor parallelism splits individual layer computations across devices by partitioning weight matrices and distributing matrix multiplications. Data parallelism replicates the model across devices with each replica processing different training examples in parallel. Expert parallelism assigns different expert networks to different devices so that expert computations distribute across available hardware. These parallelism strategies combine multiplicatively. A 4-way pipeline split with 8-way tensor parallelism and 16-way data parallelism enables training across 512 GPUs with each dimension addressing different bottlenecks.

Computational efficiency emerges from activating only 1,354 experts per forward pass from the total expert pool. Despite the model’s massive parameter count, any single training step engages a tractable subset of parameters.

The architecture design works well for distributed resource utilization across GPU clusters through several mechanisms. The MoE structure naturally partitions computation because different experts can reside on different devices with the routing mechanism directing tokens to appropriate locations. The reduced memory footprint from FP8 precision and MLA compression allows each device to maintain larger working sets locally rather than constantly swapping data across interconnects. The multi-token prediction objective increases computational intensity per example (more work per data point) which improves the computation-to-communication ratio by amortizing synchronization costs across more useful work. Training stability maintains despite distributed execution because the bias-based load balancing operates on routing scores computed locally before cross-device communication, and master weights stored in fp32 precision accumulate gradient updates with sufficient precision to prevent divergence from accumulated quantization errors across thousands of update steps.

Architectural Improvements from DeepSeek V2 to V3

YWfpJWcNQrazpqaByH3LRQ

V3 builds on V2’s foundation components (Multi-Head Latent Attention and Mixture of Experts) that proved effective at scale, while introducing targeted improvements for enhanced efficiency and scalability rather than wholesale architectural redesign.

Four major categories of improvements define the v2 to v3 transition. Expert scaling expands the pool of specialized networks available at each layer. Load balancing simplification removes auxiliary training objectives in favor of direct bias adjustment. Precision optimization introduces FP8 training as the first open-source implementation. Layer refinement adds minimal depth while strategically expanding comprehensive processing points.

Architectural Component DeepSeek V2 DeepSeek V3 Impact
Layer count 60 61 Minimal depth increase focuses scaling on other dimensions
Routed experts per layer 160 256 60% increase expands specialization capacity and model expressiveness
All-expert layers 1 3 Triple the comprehensive processing points for richer representations
Load balancing approach Three auxiliary losses (expert-level, device-level, communication-level) Bias-based adjustment with hyperparameter γ Simplified training dynamics without competing optimization objectives
Training precision bf16/fp16 FP8 (first open-source LLM) 2x compute efficiency on Tensor Cores, 50% memory reduction
Auxiliary losses 3 distinct loss terms 0 (eliminated entirely) Cleaner gradients focused solely on language modeling objective

The load balancing innovation eliminates three separate auxiliary losses that v2 used to maintain expert utilization balance. Expert-level loss prevented individual experts from being ignored. Device-level loss ensured even distribution across hardware to prevent bottlenecks. Communication-level loss minimized expensive cross-device routing. V3 replaces this complex multi-objective optimization with dynamic bias adjustment using hyperparameter γ that affects routing decisions but not final gating values. When an expert becomes overloaded (receiving more tokens than its capacity), its bias term decreases by γ, making it less likely to receive additional tokens in subsequent routing decisions. Underloaded experts have bias increased by γ, raising their routing probability. This adjustment influences the scoring function before softmax normalization, steering routing distributions without interfering with the gating values that weight expert outputs, maintaining expert specialization without performance penalties from forced load distribution.

Synthesizing the improvements reveals a coherent scaling strategy. Bias-based routing simplifies training dynamics by removing hyperparameter tuning for auxiliary loss weights and eliminating gradient conflicts between competing objectives. The 60% expert increase from 160 to 256 per layer enhances model capacity through greater specialization without proportionally increasing active parameters (still routing to similar top-k subset). FP8 training as the first open-source implementation doubles compute efficiency while maintaining quality within 0.25% relative loss error. The minimal layer increase from 60 to 61 suggests that depth scaling provided diminishing returns compared to expert scaling and precision optimization. The architectural evolution focuses on efficiency gains, doing more with existing computation through better precision management and cleaner training objectives, rather than simply adding more layers or parameters.

Final Words

DeepSeek V3’s architecture combines proven Transformer foundations with targeted innovations that directly address scale, efficiency, and deployment challenges.

The hybrid approach activating 37.96 billion parameters per token from a 671 billion parameter base delivers practical compute savings without sacrificing capability.

Multi-Head Latent Attention, bias-based load balancing, and FP8 training represent engineering decisions grounded in measurable improvements: 10x memory reduction, eliminated auxiliary losses, and 2x compute efficiency.

The deepseek v3 architecture demonstrates that scaling intelligent systems requires more than parameter counts—it demands careful optimization across attention mechanisms, expert routing, and numerical precision to make large-scale models actually deployable.

FAQ

What model does DeepSeek-V3 use?

DeepSeek-V3 uses a Mixture of Experts model architecture built on traditional Transformer blocks. The model contains 671 billion total parameters with 37 billion active per token, employing a decoder-only design with 61 layers and specialized components including SwiGLU activation functions, RoPE positional encoding, and RMSNorm layer normalization.

What is the architecture of DeepSeek model?

The DeepSeek V3 architecture combines Multi-Head Latent Attention with a Mixture of Experts framework across 61 layers. Each layer contains 256 routed experts plus 1 shared expert, totaling 1,354 activated experts per forward pass. The architecture uses compressed latent vectors for 10x memory reduction and activates 37.96 billion parameters per token from the 671 billion total.

How was DeepSeek-V3 trained?

DeepSeek-V3 was trained using FP8 precision during pre-training, making it the first open-source large language model to do so. The training approach provides 2x compute efficiency compared to FP16 on Nvidia Tensor Cores while maintaining relative loss error below 0.25%. The model uses tile-wise quantization with 128×128 blocks and dynamic bias adjustment for load balancing across distributed systems.

What is the difference between DeepSeek-V3 and R1 architecture?

DeepSeek-V3 and R1 use identical architectures based on traditional Transformer blocks with the same core components. Both models share the SwiGLU activation functions, RoPE positional encoding, and RMSNorm layer normalization design. There are no architectural differences between the two models in terms of fundamental structure or component choices.

How does Multi-Head Latent Attention work in DeepSeek-V3?

Multi-Head Latent Attention achieves 10x memory reduction by storing compressed latent vectors of size 32 instead of full 512-dimension key-value vectors per token. All attention heads share a single key RoPE component to reduce memory usage while each head maintains its own query RoPE component for position-sensitive patterns, enabling efficient long-sequence processing.

What is the expert routing mechanism in DeepSeek-V3?

The expert routing mechanism in DeepSeek-V3 uses dynamic bias adjustment instead of auxiliary loss functions to balance expert load. The bias term is adjusted by hyperparameter γ, decreasing for overloaded experts and increasing for underloaded experts. This approach affects routing decisions but not final gating values, maintaining expert specialization without performance penalties.

How many experts does DeepSeek-V3 activate per token?

DeepSeek-V3 activates 1,354 experts per token across its 61 layers. This includes 58 layers that activate 9 experts each and 3 all-experts-activated layers that engage 257 experts each. The selective activation enables efficient computation by using only 37.96 billion parameters per token from the 671 billion total.

What precision formats does DeepSeek-V3 use during training?

DeepSeek-V3 uses multiple precision formats strategically across components. General layers use FP8, embedding and attention layers use bf16, and master weights and gradients are stored in fp32. The model implements E4M3 format for better Mean-Absolute-Error performance and custom E5M6 format for activations after attention operations.

How does DeepSeek-V3 improve upon DeepSeek-V2?

DeepSeek-V3 improves upon V2 by increasing routed experts per layer by 60 percent from 160 to 256 and expanding all-experts-activated layers from 1 to 3. V3 eliminates three auxiliary losses used in V2, replacing them with bias-based load balancing, and introduces FP8 training for doubled compute efficiency while adding one layer for a total of 61.

What is Multi-Token Prediction in DeepSeek-V3?

Multi-Token Prediction in DeepSeek-V3 maintains causal chain relationships when predicting multiple future tokens instead of making independent parallel predictions. At each position, predictions for token t₃ use both t₁ and t₂ as context, creating depth-dependent causal dependencies that enable more efficient processing of longer sequences.

TECH CONTENT

Latest article

More article