How many parameters fit on your GPU cluster?

Adam mixed-precision training. Each GPU's memory is split between model state (parameters, gradients, optimizer state) and activations. Edit any value to recompute.

Cluster

Number of GPUs (G) Memory per GPU in GB (M_GPU)

Bytes / parameter

Weights (b_w) Gradients (b_g) Master weights (b_master) First moment (b_m) Second moment (b_v)

Total: 16 B / param

Architecture & training

Sequence length (s) Batch size (b) Aspect ratio h/L (α) Head dim (d)

Solved: —

How the constraint works

Cluster is treated as one big pooled GPU with G · M_GPU of total memory, holding one copy of the model and one forward pass worth of activations. No parallelism scheme (DP / TP / PP / ZeRO) is modeled.
The model architecture is coupled to N: standard transformer has ~12 h² parameters per layer (4h² attention QKVO + 8h² FFN with 4× intermediate dim), so N ≈ 12 · L · h². Aspect ratio fixes the shape: h = α · L; head count follows from a = h / d.
Activations use the Korthikanti et al. (2022) formula A = L · s · b · (34h + 5as) bytes (bf16, full storage, no checkpointing).
Constraint: N · B_param + A ≤ G · M_GPU. We bisect over N (everything else follows from N and the architectural ratios) to find the largest self-consistent model that fits.
Ignored: embedding / vocab params, comms buffers, framework overhead.

How many parameters fit on your GPU cluster?

Cluster

Bytes / parameter

Architecture & training

Computation

How the constraint works