Skip to content

Latest commit

 

History

History
79 lines (63 loc) · 6.54 KB

File metadata and controls

79 lines (63 loc) · 6.54 KB

Performance Summary

This document provides performance benchmarks for various large language models using NeMo Automodel with the PyTorch backend.

Pre-Training Performance

The table below shows training performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16

Model #GPUs GBS MBS LBS GA Seq Length TP PP CP EP VP FSDP Kernel Optimizations Time per Global Step (s) Model TFLOPs/sec/GPU Tokens/sec/GPU
Nemotron V3 Super 120B (26.02) 64 512 2 2 4 4096 1 1 1 64 - 64 TE + DeepEP + TorchSDPA 7.286 334 4,497
Nemotron V3 Nano 30B (26.02) 8 512 4 4 16 4096 1 1 1 8 - 8 TE + DeepEP + TorchSDPA 15.614 328 16,789
DeepSeek V3 671B 1024 8192 1 8 4 4096 1 4 1 64 8 256 TE + DeepEP 37.87 216 865
DeepSeek V3 671B 256 512 1 8 1 4096 1 4 1 64 8 64 TE + DeepEP 8.18 250 1,002
Kimi K2 256 512 1 8 2 4096 1 8 1 32 4 32 TE + DeepEP 8.86 189 924
Qwen3 MoE 30B 8 512 4 4 16 4096 1 1 1 8 - 8 TE + DeepEP 21.773 277 12,040
GPT-OSS 20B 8 256 2 2 16 4096 1 1 1 - - 8 TE + DeepEP + FlexAttn 10.04 279 13,058
GPT-OSS 120B 64 512 2 2 4 4096 1 1 1 - - 64 TE + DeepEP + FlexAttn 4.30 231 7,626

Fine-Tuning (LoRA) Performance

The table below shows finetuning (LoRA) performance for full sequences with no padding across different model architectures and scales.

System: DGX-H100, Precision: BF16

Model #GPUs GBS MBS LBS GA Seq Length TP PP CP EP VP FSDP Kernel Optimizations Time per Global Step (s) Model TFLOPs/sec/GPU Tokens/sec/GPU
Llama3 8B 1 32 2 2 16 4096 1 1 1 - 1 1 - 10.51 402 12472.87
Qwen2.5 7B 1 32 2 2 16 4096 1 1 1 - 1 1 - 9.29 423 14110.05
Llama3 70B 8 32 1 4 4 4096 2 4 1 - 10 1 - 24.87 190 658.62
Qwen2.5 32B 8 32 1 8 2 4096 1 4 1 - 8 1 2 8.40 261 1950.93
Llama3 70B 2-node 16 32 1 4 2 4096 2 4 1 - 10 1 2 12.03 197 680.74
Qwen2.5 32B 2-node 16 32 1 8 1 4096 1 4 1 - 8 1 4 4.48 244 1826.49

Glossary

  • MFU: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
  • TP: Tensor Parallelism - splits individual layers across GPUs
  • PP: Pipeline Parallelism - splits model layers into stages
  • EP: Expert Parallelism - distributes MoE experts across GPUs
  • DP: Data Parallelism - replicates model and splits data
  • VP: Virtual Pipeline - number of pipeline stages per GPU for interleaving
  • MBS: Micro-Batch Size - size of one forward pass in pipeline
  • LBS: Local Batch Size - size of one step per GPU
  • GBS: Global Batch Size - total batch size across all GPUs
  • GA: Gradient Accumulation - number of local-batches before optimizer step
  • TE: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention
  • DeepEP: Deep Expert Parallelism - advanced EP routing for MoE models
  • FlexAttn: PyTorch's Flex Attention

Configuration Files

Pre-training benchmark configurations are available in examples/benchmark/configs/ and fine-tuning (LoRA) configurations are in examples/llm_finetune/:

:::{note}

  • All benchmarks use mock data for consistent performance measurement.
  • Fake balanced gate is enabled to simulate ideal expert routing.
  • No gradient clipping applied for pure performance measurement.
  • MFU calculated using peak TFLOPs for the system (989 for BF16 H100).
  • Step times include forward and backward passes + optimizer step for the global batch. :::

Version Information

  • Last Updated: 2025-10-02
  • NeMo AutoModel Version: main Branch