Hugging Face Blog · 9 Mar

Ulysses Sequence Parallelism: Training with Million-Token Contexts

infrastructureresearch

Training large language models on long sequences has become essential for document analysis, code understanding, complex reasoning, and RAG workloads. Ulysses Sequence Parallelism provides an elegant solution for processing sequences containing hundreds of thousands or even millions of tokens.

The attention mechanism in transformers scales quadratically with sequence length, requiring O(n²) FLOPs and memory for the attention score matrix. Even with FlashAttention reducing memory to O(n), training on sequences beyond 32k tokens pushes single-GPU memory to its limits.

Ulysses, developed as part of Snowflake's Arctic Long Sequence Training protocol, takes a clever approach by splitting sequences along the sequence dimension while also partitioning attention heads across GPUs. Each GPU holds a portion of tokens and computes attention for its assigned subset of heads.

The method uses all-to-all communication operations to redistribute data so each GPU holds all sequence positions but only for a subset of attention heads. This trading of sequence locality for head locality enables efficient parallelization with relatively low communication overhead.

Ulysses requires only two all-to-all operations per attention layer with total communication volume of O(n·d/P) per GPU. This is significantly lower than Ring Attention, which communicates O(n·d) per GPU through P-1 sequential point-to-point transfers.

The technique is now integrated across the Hugging Face ecosystem, including Accelerate, Transformers Trainer, and TRL's SFTTrainer. Developers can configure Ulysses through straightforward settings to enable training on contexts far beyond single-GPU capacity.

Best practices include ensuring sequence length divisibility by the number of GPUs, using Flash Attention, and combining with DeepSpeed ZeRO for optimal performance. Benchmarks demonstrate significant memory reduction while maintaining throughput across multi-GPU configurations.

Read original → huggingface.co