How many parameters fit on your GPU cluster?

Adam mixed-precision training. Each GPU's memory is split between model state (parameters, gradients, optimizer state) and activations. Edit any value to recompute.

Cluster

Bytes / parameter

Total: 16 B / param

Architecture & training

Solved:

Computation

1. Bytes per parameter (Bparam) — sum of bytes per parameter for weights (bw), gradients (bg), master weights (bmaster), and the two Adam moments (bm, bv).

2. Model architecture. A standard transformer has roughly 12 h² parameters per layer (4h² attention QKVO + 8h² FFN with 4× intermediate dim). The aspect ratio (α = h / L) fixes the depth-vs-width shape, and the head dimension (d) sets the number of attention heads (a). So total parameters (N), layer count (L), and hidden dim (h) are all coupled.

3. Activation memory (A) — bytes needed to store all forward activations for one pass through the model (bf16, no recomputation), per Korthikanti et al. (2022). Scales with sequence length (s) and batch size (b).

4. Constraint. The pooled cluster — G GPUs of MGPU memory each — must hold both the model state (N · Bparam) and the activations (A). We bisect over N to find the largest self-consistent model that fits.

How the constraint works