Skip to main content

[A] Demystifying NVIDIA Ampere Architecture: Notes

· 7 min read
The CUDA Cache Maintainer

My notes on the article Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. I prefer to use it as a datasheet. You can find:

  • The relation between the number of instructions and the average cycles for ADD.U32 instruction (This reveal the existence of addition hardware pipeline)
  • The CPI for dependent and independent instructions
  • The Tensor Cores Latencies and Throughput
  • The memory accesses latencies
  • Instructions Clock Cycles for the (Ampere A100) GPU

Summary of Results and Conclusions

The paper demystifies the microarchitecture of the Nvidia Ampere A100 GPU by running low-level microbenchmarks to calculate the exact clock cycles required for various instructions, memory access latencies, and Tensor Core (TC) throughput. The authors discovered several critical insights regarding

  • how the compiler handles code,
  • how hardware dependencies affect performance,
  • and the exact clock cycle cost of operations.

Key Findings on Instruction Latency & Pipeline Behavior

  • Dependency Penalty: The latency of an instruction increases significantly if it depends on the output of a previous instruction. For example, a single-precision add.f32 takes 4 cycles when dependent, but only 2 cycles when independent.

  • Pipeline Utilization: The mad (multiply-add) instruction executes on the floating-point pipeline, even when used with integer values. The researchers proved this by running add and mad instructions simultaneously and observing that both executed in parallel without bottlenecking the integer pipeline.

  • Instruction Overheads: Running a single instruction incurs a "first launch overhead." To get accurate measurements, the authors ran multiple independent instructions to find the true average cycles per instruction (CPI).

  • Compiler Translations (PTX to SASS): Many PTX instructions map 1-to-1 to hardware SASS instructions, but complex math operations (like div, sinf, cosf) are broken down into multiple SASS instructions. Furthermore, signed and unsigned instructions generally execute in the same number of cycles and map identically, with few exceptions like bfind, min, and max.

Key Findings on Memory Latency

  • Ampere's Global Memory access latency is approximately 290 cycles (bypassing caches), which is a notable improvement over the Turing architecture's 434 cycles.

  • L2 Cache latency is measured at 200 cycles, slightly slower than Turing's 188 cycles.

  • L1 Cache latency remains fast and highly comparable to previous generations at 33 cycles.

  • Shared Memory is slightly faster for store operations (19 cycles) than for load operations (23 cycles).

Key Findings on Tensor Cores (TC)

  • Ampere introduces broad support for new data types including FP64, U8, U4, and TF32, which require different underlying SASS instructions (e.g., DMMA.884 for FP64, IMMA.8832 for U4).

  • Unlike older architectures where the matrix shape impacted latency, the Ampere architecture's latency is primarily tied to the data type rather than the shape of the matrix being computed.

Extracted Data and Full Tables

Below are the complete tables detailing the exact measurements collected by the authors.

Table 1: The relation between the number of instructions and the average cycles for ADD.U32 instruction

# instrsCPI
15
23
32
42

Table 2: The CPI for dependent and independent instructions

InstructionCPI for dependentCPI for independent
add.f1632
add.u3242
add.f6454
mul.lo.u3232
mad.rn.f3242

Table 3: The Tensor Cores Latencies and Throughput

Supported shapesInputsAccumulatorsCyclesMeasured-theoreticalInstructions
m16n16k16, m8n32k16, m32n8k16f16f1616311-312 GB/sPTX: wmma.mma.sync.aligned.row.row.m16n16k16.f16.f16 SASS: 2 HMMA.16816.F16 - each inst. is 8 cycles
m16n16k16, m8n32k16, m32n8k16f16f3216310-312 GB/sPTX: wmma.mma.sync.aligned.row.row.m16n16k16.f16.f32 SASS: 2 HMMA.16816.F32 - each inst. is 8 cycles
m16n16k16, m8n32k16, m32n8k16bf16f3216310-312 GB/sPTX: wmma.mma.sync.aligned.row.row.m16n16k16.f32.bf16.bf16.f32 SASS: 2 HMMA.16816.F32.BF16 - each inst. is 8 cycles
m16n16k8tf32f3216132-156 GB/sPTX: wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32 SASS: 4 HMMA.1684.F32.TF32 - each inst. is 4 cycles
m8n8k4f64f641619-19.5 GB/sPTX: wmma.mma.sync.aligned.row.row.m8n8k4.f64.f64.f64.f64 SASS: 1 DMMA.884 - each inst. is 16 cycles
m16n16k16, m32n8k16, m8n32k16u8u328594-624 GB/sPTX: wmma.mma.sync.aligned.row.row.m16n16k16.s32.u8.u8.s32 SASS: 2 IMMA.16816.U8.U8 - each inst. is 4 cycles
m8n8k32u4u3241229-1248 GB/sPTX: wmma.mma.sync.aligned.row.col.m8n8k32.s32.u4.u4.s32 SASS: 1 IMMA.8832.U4.U4 - each inst. is 4 cycles

Table 4: The memory accesses latencies

Memory typeCPI (cycles)
Global memory290
L2 cache200
L1 cache33
Shared Memory (ld/st)(23/19)

Table 5: Instructions Clock Cycles for the (Ampere A100) GPU (Note: Consolidating the dual-column layout from the source into a single clear list for readability)

PTX InstructionSASS TranslationCycles
Add/Sub Instructions
add.u16UIADD32
addc.u32IADD3.X2
add.u32IADD2
add.u64UIADD3.X + UIADD34
add.s64UIADD3.X + UIADD34
add.f16HADD2
add.f32FADD2
add.f64DADD4
Mul Instructions
mul.wide.u16LOP3.LUT + IMAD4
mul.wide.u32IMAD4
mul.lo.u16LOP3.LUT + IMAD4
mul.lo.u32IMAD2
mul.lo.u64IMAD2
mul24.lo.u32PRMT + IMAD3
mul24.hi.u32UPRMT + USHF.R.U32.HI + IMAD.U32 + PRMT9
mul.rn.f16HMUL22
mul.rn.f32FMUL2
mul.rn.f64DMUL4
MAD Instructions
mad.lo.u16LOP3.LUT + IMAD4
mad.lo.u32FFMA2
mad.lo.u64IMAD2
mad24.lo.u32SGXT.U32 + IMAD4
mad24.hi.u32USHF.R.U32.HI + UIMAD.WIDE.U32 + 2*UPRMT + IADD311
mad.rn.f32FFMA2
mad.rn.f64DFMA4
Sad Instructions
sad.u16/s16(2 LOP3) + ULOP3 + VABSDIFF6
sad.u32/s32VABSDIFF + IMAD (1 IMAD + 1 Umov for 3 instrs)3
sad.u64/s64UISETP.GE.U32.AND + UIADD + IADD10
Rem/Div Instructions
rem/div.u16/s16multiple instructions290
rem/div.s32/u32multiple instructions66
rem/div.u64/s64multiple instructions420
div.rn.f32multiple instructions525
div.rn.f64multiple instructions426
Abs Instructions
abs.s16PRMT + IABS + PRMT4
abs.s32IABS2
abs.s64UISETP.LT.AND + UIADD3.X + UIADD3 + 2 USEL11
abs.f16PRMT1
abs.ftz.f32FADD.FTZ2
abs.f64DADD or (DADD+UMOV)4
Brev Instructions
brev.b32BREV + SGXT.U322
brev.b642 UBREV + MOV6
Copysign Instructions
copysign.f322 LOP3.LUT or 1.5 LOP3.LUT4
copysign.f642 ULOP3.LUT + IMAD.U32 + MOV6
And/Or/Xor Instructions
and.b16LOP3.LUT or 1.5 LOP3.LUT2
and.b32LOP3.LUT2
and.b64ULOP3.LUT2-3
Not Instructions
not.b16LOP3.LUT2
not.b32LOP3.LUT2
not.b642 ULOP3.LUT4
Lop3 Instructions
lop3.b32IMAD.MOV.U32 + LOP3.LUT4
Cnot Instructions
cnot.b16 / cnot.b32ULOP3.LUT+ISETP.EQ.U32.AND+SEL / UISETP.EQ.U32.AND+USEL5 / 4
cnot.b64multiple instructions11
Bfe Instructions
bfe.s32/.u323*PRMT + 2 IMAD.MOV + SHF.R.U32.HI + SGXT/.U3211
bfe.u64UMOV + USHF.L.U32 + (UIADD3 + ULOP3.LUT)5
bfe.s64multiple instructions14
Min/Max Instructions
min.u16ULOP3.LUT + UISETPLT.U32.AND + USEL8
min.u32IMNMX.U322
min.u64UISETP.LT.U32.AND + 2*USEL8
min.s16PRMT + IMNMX4
min.s32IMNMX2
min.s64UISETPLT.U32.AND + UISETP.LT.AND.EX + 2 USEL8
min.f16HMNMX2 + PRMT4
min.f32FMNMX2
min.f64DSETP.MIN.AND + IMAD.MOV.U32 + UMOV + FSEL10
Neg Instructions
neg.s16UIADD3 + UPRMT5
neg.s32IADD32
neg.s64IMAD.MOV.U32 + HFMA2.MMA + MOV + UIADD310
neg.f32FADD or IMAD.MOV.U322
neg.f64DADD + (UMOV)4
FMA Instructions
fma.rn.f16HFMA22
fma.rn.f32FFMA2
fma.rn.f64DFMA4
Sqrt Instructions
sqrt.rn.f32[multiple instrs including MUFU.RSQ]190-235
sqrt.approx.f32[multiple instrs including MUFU.SQRT]2-18
sqrt.rn.f64[multiple instrs including MUFU.RSQ64]260-340
Rsqrt Instructions
rsqrt.approx.f32[multiple instrs including MUFU.RSQ]2-18
rsqrt.approx.f64MUFU.RSQ64H8-11
Rcp Instructions
rcp.rn.f32[multiple instrs including MUFU.RCP]198
rcp.approx.f32[multiple instrs including MUFU.RCP]23
rcp.rn.f64[multiple instrs including MUFU.RCP64H]244
Pop Instructions
popc.b32POPC6
popc.b642 UPOPC + UIADD37
Clz Instructions
clz.b32FLO.U32 + IADD7
clz.b64UISETP.NE.U32.AND + USEL + UFLO.U32 + 2 UIADD313
Bfind Instructions
bfind.u32FLO.U326
bfind.u64FLO.U32 + ISETP.NE.U32.AND + IADD3 + BRA164
bfind.s32FLO6
bfind.s64multiple instructions195
Testp Instructions
testp.normal.f32IMAD.MOV.U32 + 2*ISETP.GE.U32.AND0 or 6
testp.subnor.f32ISETP.LT.U32.AND0 or 6
testp.normal.f642 UISETP.LE.U32.AND + 2 UISETP.GE.U32.AND13
testp.subnor.f64UISETP.LT.U32.AND + 2 UISETP.GE.U32.AND.EX8
Other Instructions
sin.approx.f32FMUL + MUFU.SIN8
cos.approx.f32FMUL.RZ + MUFU.COS8
lg2.approx.f32FSETP.GEU.AND + FMUL + MUFU.LG2 + FADD18
ex2.approx.f32FSTEP + FMUL + MUFU.EX2 + FMUL14
ex2.approx.f16MUFU.EX2.F166
tanh.approx.f32MUFU.TANH6
tanh.approx.f16MUFU.TANH.F166
bar.warp.syncNOPchanges
fns.b32multiple instructions79
cvt.rzi.s32.f32F2I.TRUNC.NTZ6
setp.ne.s32ISETP.NE.AND10
mov.u32 clockCS2R.322
Bfi Instructions
bfi.b323 PRMT + 2 IMAD.MOV + SHF.L.U32 + BMSK + LOP3.LUT11
bfi.b64UMOV + USHFL.U32 + (UIADD3 + ULOP3.LUT)5
dp4a/dp2a Instructions
dp4a.u32.u32IMAD.MOV.U32 + IDP.4A.U8.U8135-170
dp2a.lo.u32.u32IMAD.MOV.U32 + IDP.2A.LO.U16.U8135-170