Text
DeepSeek V4 Pro
DeepSeek-V4-Pro is a powerful large language model maintaining strong performance.
Model ID
DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:
Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.