1 / 13

MiniMax-01

Scaling Foundation Models with Lightning Attention

MiniMax Team

MiniMax

Academic Conference Presentation

Breaking the Context Window Barrier

  • Foundation models limited by quadratic attention complexity
  • Context window sizes typically capped at 128K tokens
  • Key goal: Match SOTA performance while extending context to 1M+ tokens
  • Practical applications requiring long context: document analysis, extended reasoning

Introducing the MiniMax-01 Family

  • MiniMax-Text-01: Text-only foundation model
  • MiniMax-VL-01: Vision-language model
  • Core innovation: Lightning Attention at scale
  • 456B total parameters (MoE architecture, 45.9B activated per token)
  • Context window: 1M tokens (training), 4M tokens (inference)

Hybrid Architecture for Optimal Performance

  • Mixture of Experts (MoE): 32 experts, top-2 routing
  • Hybrid attention mechanism:
    • 7 TransNormer blocks with Lightning Attention
    • 1 Transformer block with Softmax Attention
  • Additional components: DeepNorm, Group Query Attention, RoPE
  • Vision component: ViT-L/14 encoder with MLP projector
Figure 3: MiniMax-Text-01 architecture

Lightning Attention: Linear Complexity with High Performance

  • Linear O(n) vs. Softmax's quadratic O(n²) complexity
  • Computation differences from traditional attention
  • Overcomes historical limitation of linear attention performance
  • Why hybrid approach: balances efficiency and representation power
Figure 5: Comparing Softmax and Linear Attention computation

Enabling Efficient Training at Scale

  • Expert Parallel (EP) and Expert Tensor Parallel (ETP)
  • Computation-communication overlap strategies
  • Varlen Ring Attention for variable sequence lengths
  • Custom CUDA kernels (>75% MFU achieved)
Figure 9: Expert Parallel overlap
Figure 10: EP-ETP overlap efficiency

Multi-Stage Training Approach

  • Large-scale pre-training (11.4T tokens)
  • Three-stage long-context extension (up to 1M tokens)
  • Data curation and quality filtering
  • Adaptive batch sizing based on critical batch size
Figure 13: Power-law fit for training loss vs. critical batch size

From Base Model to Aligned Assistant

  • Multi-stage alignment: SFT → DPO → GRPO
  • Multi-dimensional reward modeling
  • Safety alignment using Constitutional AI principles
  • Vision-language training: 4-stage process
Figure 17: Tag distribution in VLM instruction data

Matching SOTA Performance on Standard Tasks

  • MMLU: 88.5
  • MATH: 77.4
  • IFEval: 89.1
  • Vision benchmarks: MMMU 68.5, ChartQA 91.7
  • Comparable to GPT-4o, Claude-3.5-Sonnet
Figure 1: Benchmark performance comparison

Superior Performance on Long-Context Tasks

  • RULER benchmark performance
  • Successful extrapolation to 4M tokens on NIAH
  • MR-NIAH and MTOB results
  • LongBench-v2 performance
Figure 14: 4M token NIAH results
Figure 15: MR-NIAH benchmark

Computational Benefits of Lightning Attention

  • Constant training speed regardless of sequence length
  • Significantly lower latency for long sequences
  • IsoFLOP comparison with dense models
Figure 2: Prefill latency comparison
Figure 8: Training speed comparison across sequence lengths

Implications & Next Steps

  • Successfully scaled linear attention in foundation models
  • Demonstrated feasibility of million-token context windows
  • Public release: github.com/MiniMax-AI
  • Future directions:
    • Improved long-context evaluation metrics
    • Fully eliminating softmax attention
    • Enhanced programming capabilities

References & Contact

Key references available in the paper

Contact: model@minimaxi.com

Project: github.com/MiniMax-AI

Thank You!

Questions?

Created by MiniMax Agent