The CUDA Cache Blog

[A] Dissecting the Volta Architecture: Notes

2026-05-13T00:00:00.000Z

My notes on the article Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. The Volta architecture fundamentally changes how AI computes. There are so many designs that still make an impact on latest architectures. I will dive into these same-in-Volta topics mentioned in Dissecting the NVIDIA Ampere GPU Architecture via Microbenchmarking:

Instruction Encoding
Dual-Port Register

Instruction

Volta uses one 128-bit word to encode each instruction together with its corresponding control information. Previous architectures use a 64-bit word to encode each instruction, and a separate 64-bit word to encode control information associated to multiple instructions. The author finds that

at least 91 bits are used to encode the instruction
at least 23 bits are used to encode control information;
the remaining 14 bits appeared to be unused in the author's experiments.

Control Information

On Volta, each control section contains 2 zeroes as its most significant bits, and 1 section of 21 bits. In each 128-bit word, control information is preceded and followed by the instruction encoding bits.

6 sections containing control information are organized as follows:

Width (bits)	4	6	3	3	1	4
Meaning	Reuse flags	Wait barrier mask	Read barrier index	Write barrier index	Yield flag	Stall Cycles

Reuse flags

Each of the 4 reuse flag bits corresponds to one of the 8-byte slots. When a flag is set, the value of the register in the corresponding slot will be stored in the reuse cache for future instructions to consume. Reuse mitigates register bank conflicts. The least significant bit in reuse flags controls the cache for the first source operand slot. The most significant bit is for the fourth source operand slot.

Wait barrier mask & Write barrier index

While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. In volta, dependency barriers are used to track the completion of variable-latency instructions and resolve data hazards. There are 6 available barriers in total and each maps to a bit in Wait barrier mask.

During compilation, when a variable latency instruction writes to a register, the assembler associates the register to one of the barriers by setting the corresponding Write barrier index field. In a later instruction that depends on this write result, the assembler marks a corresponding bit in its Wait barrier mask.

The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the Wait barrier mask is a bitmask, not an index.

Read barrier index (Read dependency barriers)

Read dependency barriers serve to protect against write-after-read hazards. Unbuffered instructions that write the contents of registers to memory need the registers to remain unchanged during the operation. To guarantee that, the assembler associates them to a barrier by populating the corresponding Read barrier index field. Later instructions writing to the same register will wait on that barrier.

Stall Cycles

This 4-bit field indicates how long the scheduler should wait before issuing the next instruction, ranging from 0 to 15 cycles. On Pascal and Maxwell, if the combination of this field and the yield flag contain a special combination of bits, the two dispatchers in a processing block can dispatch two consecutive instructions of a warp at the same time (dual issue). On Volta there is only one dispatcher in a processing block, and we do not observe dual issue in the generated code.

Yield flag

Balances workloads by controlling warp switching. If set, the scheduler prefers to issue the next instruction from the current warp. If cleared, it prefers switching to a new warp, which costs an extra cycle and renders the next instruction's register reuse flags ineffective.

Encoding

Volta uses more bits to encode its instructions than previous architectures. The Volta architecture places the opcode in the least significant bits of the first 64-bit part of the code bundle. Volta opcodes vary in length from 10 to 13 bits.

As in previous architectures, operands on Volta can be registers (general purpose, special or predication), memory addresses (constant, shared or global), or an immediate value. Predication is regulated by 4 bits: the first bit is a negation flag, and the remaining 3 bits encode a predicate register index.

Dual-Port Register

The register file on Volta is divided into 2 banks and the width of each bank is 64 bits. The bank of a register is the register’s index modulo 2. Since the Volta GPU has 64-bit register banks, a conflict will only happen when all three 32-bit source registers in an FFMA instruction are in a same bank. For example, R97 and R99 are in bank 1; if RX also sits in bank 1, a conflict will occur.

[A] Demystifying NVIDIA Ampere Architecture: Notes

2026-05-09T00:00:00.000Z

My notes on the article Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis. I prefer to use it as a datasheet. You can find:

The relation between the number of instructions and the average cycles for ADD.U32 instruction (This reveal the existence of addition hardware pipeline)
The CPI for dependent and independent instructions
The Tensor Cores Latencies and Throughput
The memory accesses latencies
Instructions Clock Cycles for the (Ampere A100) GPU

Summary of Results and Conclusions

The paper demystifies the microarchitecture of the Nvidia Ampere A100 GPU by running low-level microbenchmarks to calculate the exact clock cycles required for various instructions, memory access latencies, and Tensor Core (TC) throughput. The authors discovered several critical insights regarding

how the compiler handles code,
how hardware dependencies affect performance,
and the exact clock cycle cost of operations.

Key Findings on Instruction Latency & Pipeline Behavior

Dependency Penalty: The latency of an instruction increases significantly if it depends on the output of a previous instruction. For example, a single-precision add.f32 takes 4 cycles when dependent, but only 2 cycles when independent.
Pipeline Utilization: The mad (multiply-add) instruction executes on the floating-point pipeline, even when used with integer values. The researchers proved this by running add and mad instructions simultaneously and observing that both executed in parallel without bottlenecking the integer pipeline.
Instruction Overheads: Running a single instruction incurs a "first launch overhead." To get accurate measurements, the authors ran multiple independent instructions to find the true average cycles per instruction (CPI).
Compiler Translations (PTX to SASS): Many PTX instructions map 1-to-1 to hardware SASS instructions, but complex math operations (like div, sinf, cosf) are broken down into multiple SASS instructions. Furthermore, signed and unsigned instructions generally execute in the same number of cycles and map identically, with few exceptions like bfind, min, and max.

Key Findings on Memory Latency

Ampere's Global Memory access latency is approximately 290 cycles (bypassing caches), which is a notable improvement over the Turing architecture's 434 cycles.
L2 Cache latency is measured at 200 cycles, slightly slower than Turing's 188 cycles.
L1 Cache latency remains fast and highly comparable to previous generations at 33 cycles.
Shared Memory is slightly faster for store operations (19 cycles) than for load operations (23 cycles).

Key Findings on Tensor Cores (TC)

Ampere introduces broad support for new data types including FP64, U8, U4, and TF32, which require different underlying SASS instructions (e.g., DMMA.884 for FP64, IMMA.8832 for U4).
Unlike older architectures where the matrix shape impacted latency, the Ampere architecture's latency is primarily tied to the data type rather than the shape of the matrix being computed.

Extracted Data and Full Tables

Below are the complete tables detailing the exact measurements collected by the authors.

Table 1: The relation between the number of instructions and the average cycles for ADD.U32 instruction

# instrs	CPI
1	5
2	3
3	2
4	2

Table 2: The CPI for dependent and independent instructions

Instruction	CPI for dependent	CPI for independent
add.f16	3	2
add.u32	4	2
add.f64	5	4
mul.lo.u32	3	2
mad.rn.f32	4	2

Table 3: The Tensor Cores Latencies and Throughput

Supported shapes	Inputs	Accumulators	Cycles	Measured-theoretical	Instructions
m16n16k16, m8n32k16, m32n8k16	f16	f16	16	311-312 GB/s	PTX: wmma.mma.sync.aligned.row.row.m16n16k16.f16.f16 SASS: 2 HMMA.16816.F16 - each inst. is 8 cycles
m16n16k16, m8n32k16, m32n8k16	f16	f32	16	310-312 GB/s	PTX: wmma.mma.sync.aligned.row.row.m16n16k16.f16.f32 SASS: 2 HMMA.16816.F32 - each inst. is 8 cycles
m16n16k16, m8n32k16, m32n8k16	bf16	f32	16	310-312 GB/s	PTX: wmma.mma.sync.aligned.row.row.m16n16k16.f32.bf16.bf16.f32 SASS: 2 HMMA.16816.F32.BF16 - each inst. is 8 cycles
m16n16k8	tf32	f32	16	132-156 GB/s	PTX: wmma.mma.sync.aligned.row.row.m16n16k8.f32.tf32.tf32.f32 SASS: 4 HMMA.1684.F32.TF32 - each inst. is 4 cycles
m8n8k4	f64	f64	16	19-19.5 GB/s	PTX: wmma.mma.sync.aligned.row.row.m8n8k4.f64.f64.f64.f64 SASS: 1 DMMA.884 - each inst. is 16 cycles
m16n16k16, m32n8k16, m8n32k16	u8	u32	8	594-624 GB/s	PTX: wmma.mma.sync.aligned.row.row.m16n16k16.s32.u8.u8.s32 SASS: 2 IMMA.16816.U8.U8 - each inst. is 4 cycles
m8n8k32	u4	u32	4	1229-1248 GB/s	PTX: wmma.mma.sync.aligned.row.col.m8n8k32.s32.u4.u4.s32 SASS: 1 IMMA.8832.U4.U4 - each inst. is 4 cycles

Table 4: The memory accesses latencies

Memory type	CPI (cycles)
Global memory	290
L2 cache	200
L1 cache	33
Shared Memory (ld/st)	(23/19)

Table 5: Instructions Clock Cycles for the (Ampere A100) GPU (Note: Consolidating the dual-column layout from the source into a single clear list for readability)

PTX Instruction	SASS Translation	Cycles
Add/Sub Instructions
add.u16	UIADD3	2
addc.u32	IADD3.X	2
add.u32	IADD	2
add.u64	UIADD3.X + UIADD3	4
add.s64	UIADD3.X + UIADD3	4
add.f16	HADD	2
add.f32	FADD	2
add.f64	DADD	4
Mul Instructions
mul.wide.u16	LOP3.LUT + IMAD	4
mul.wide.u32	IMAD	4
mul.lo.u16	LOP3.LUT + IMAD	4
mul.lo.u32	IMAD	2
mul.lo.u64	IMAD	2
mul24.lo.u32	PRMT + IMAD	3
mul24.hi.u32	UPRMT + USHF.R.U32.HI + IMAD.U32 + PRMT	9
mul.rn.f16	HMUL2	2
mul.rn.f32	FMUL	2
mul.rn.f64	DMUL	4
MAD Instructions
mad.lo.u16	LOP3.LUT + IMAD	4
mad.lo.u32	FFMA	2
mad.lo.u64	IMAD	2
mad24.lo.u32	SGXT.U32 + IMAD	4
mad24.hi.u32	USHF.R.U32.HI + UIMAD.WIDE.U32 + 2*UPRMT + IADD3	11
mad.rn.f32	FFMA	2
mad.rn.f64	DFMA	4
Sad Instructions
sad.u16/s16	(2 LOP3) + ULOP3 + VABSDIFF	6
sad.u32/s32	VABSDIFF + IMAD (1 IMAD + 1 Umov for 3 instrs)	3
sad.u64/s64	UISETP.GE.U32.AND + UIADD + IADD	10
Rem/Div Instructions
rem/div.u16/s16	multiple instructions	290
rem/div.s32/u32	multiple instructions	66
rem/div.u64/s64	multiple instructions	420
div.rn.f32	multiple instructions	525
div.rn.f64	multiple instructions	426
Abs Instructions
abs.s16	PRMT + IABS + PRMT	4
abs.s32	IABS	2
abs.s64	UISETP.LT.AND + UIADD3.X + UIADD3 + 2 USEL	11
abs.f16	PRMT	1
abs.ftz.f32	FADD.FTZ	2
abs.f64	DADD or (DADD+UMOV)	4
Brev Instructions
brev.b32	BREV + SGXT.U32	2
brev.b64	2 UBREV + MOV	6
Copysign Instructions
copysign.f32	2 LOP3.LUT or 1.5 LOP3.LUT	4
copysign.f64	2 ULOP3.LUT + IMAD.U32 + MOV	6
And/Or/Xor Instructions
and.b16	LOP3.LUT or 1.5 LOP3.LUT	2
and.b32	LOP3.LUT	2
and.b64	ULOP3.LUT	2-3
Not Instructions
not.b16	LOP3.LUT	2
not.b32	LOP3.LUT	2
not.b64	2 ULOP3.LUT	4
Lop3 Instructions
lop3.b32	IMAD.MOV.U32 + LOP3.LUT	4
Cnot Instructions
cnot.b16 / cnot.b32	ULOP3.LUT+ISETP.EQ.U32.AND+SEL / UISETP.EQ.U32.AND+USEL	5 / 4
cnot.b64	multiple instructions	11
Bfe Instructions
bfe.s32/.u32	3*PRMT + 2 IMAD.MOV + SHF.R.U32.HI + SGXT/.U32	11
bfe.u64	UMOV + USHF.L.U32 + (UIADD3 + ULOP3.LUT)	5
bfe.s64	multiple instructions	14
Min/Max Instructions
min.u16	ULOP3.LUT + UISETPLT.U32.AND + USEL	8
min.u32	IMNMX.U32	2
min.u64	UISETP.LT.U32.AND + 2*USEL	8
min.s16	PRMT + IMNMX	4
min.s32	IMNMX	2
min.s64	UISETPLT.U32.AND + UISETP.LT.AND.EX + 2 USEL	8
min.f16	HMNMX2 + PRMT	4
min.f32	FMNMX	2
min.f64	DSETP.MIN.AND + IMAD.MOV.U32 + UMOV + FSEL	10
Neg Instructions
neg.s16	UIADD3 + UPRMT	5
neg.s32	IADD3	2
neg.s64	IMAD.MOV.U32 + HFMA2.MMA + MOV + UIADD3	10
neg.f32	FADD or IMAD.MOV.U32	2
neg.f64	DADD + (UMOV)	4
FMA Instructions
fma.rn.f16	HFMA2	2
fma.rn.f32	FFMA	2
fma.rn.f64	DFMA	4
Sqrt Instructions
sqrt.rn.f32	[multiple instrs including MUFU.RSQ]	190-235
sqrt.approx.f32	[multiple instrs including MUFU.SQRT]	2-18
sqrt.rn.f64	[multiple instrs including MUFU.RSQ64]	260-340
Rsqrt Instructions
rsqrt.approx.f32	[multiple instrs including MUFU.RSQ]	2-18
rsqrt.approx.f64	MUFU.RSQ64H	8-11
Rcp Instructions
rcp.rn.f32	[multiple instrs including MUFU.RCP]	198
rcp.approx.f32	[multiple instrs including MUFU.RCP]	23
rcp.rn.f64	[multiple instrs including MUFU.RCP64H]	244
Pop Instructions
popc.b32	POPC	6
popc.b64	2 UPOPC + UIADD3	7
Clz Instructions
clz.b32	FLO.U32 + IADD	7
clz.b64	UISETP.NE.U32.AND + USEL + UFLO.U32 + 2 UIADD3	13
Bfind Instructions
bfind.u32	FLO.U32	6
bfind.u64	FLO.U32 + ISETP.NE.U32.AND + IADD3 + BRA	164
bfind.s32	FLO	6
bfind.s64	multiple instructions	195
Testp Instructions
testp.normal.f32	IMAD.MOV.U32 + 2*ISETP.GE.U32.AND	0 or 6
testp.subnor.f32	ISETP.LT.U32.AND	0 or 6
testp.normal.f64	2 UISETP.LE.U32.AND + 2 UISETP.GE.U32.AND	13
testp.subnor.f64	UISETP.LT.U32.AND + 2 UISETP.GE.U32.AND.EX	8
Other Instructions
sin.approx.f32	FMUL + MUFU.SIN	8
cos.approx.f32	FMUL.RZ + MUFU.COS	8
lg2.approx.f32	FSETP.GEU.AND + FMUL + MUFU.LG2 + FADD	18
ex2.approx.f32	FSTEP + FMUL + MUFU.EX2 + FMUL	14
ex2.approx.f16	MUFU.EX2.F16	6
tanh.approx.f32	MUFU.TANH	6
tanh.approx.f16	MUFU.TANH.F16	6
bar.warp.sync	NOP	changes
fns.b32	multiple instructions	79
cvt.rzi.s32.f32	F2I.TRUNC.NTZ	6
setp.ne.s32	ISETP.NE.AND	10
mov.u32 clock	CS2R.32	2
Bfi Instructions
bfi.b32	3 PRMT + 2 IMAD.MOV + SHF.L.U32 + BMSK + LOP3.LUT	11
bfi.b64	UMOV + USHFL.U32 + (UIADD3 + ULOP3.LUT)	5
dp4a/dp2a Instructions
dp4a.u32.u32	IMAD.MOV.U32 + IDP.4A.U8.U8	135-170
dp2a.lo.u32.u32	IMAD.MOV.U32 + IDP.2A.LO.U16.U8	135-170

[GTC] Dissecting the Ampere Architecture: Notes

2026-05-08T00:00:00.000Z

My notes on GTC 2021 talk: Dissecting the Ampere GPU Architecture through Microbenchmarking. The research uses microbenchmarking to reveal internal details about Ampere's L2 cache design, atomic operations, fine-grained sparsity and memory improvements over the V100.

Grouping Work by L2 Partition

Ampere features a split L2 design. Relative to each Streaming Multiprocessor (SM), the L2 cache is divided into a "near" partition and a "far" partition. Each partition holds 20 MB of space, totaling 40 MB across the GPU.

The hardware resolves cache line requests using the following hierarchy:

Check the L1 cache. If there is a miss, proceed to the L2 cache.
Check the near L2 partition. If there is a miss, query the far L2 partition.
Check the far L2 partition. If the data is found here, the cache line is moved/copied to the near partition. If it is a miss, the request goes to global memory.
Query global memory. Each memory address corresponds to one specific L2 partition. Note that when data is fetched from global memory, it goes to its assigned partition. The cache line does not automatically migrate to the near partition of the SM that requested it. This behavior will be demonstrated in later experiments.

Failing to group work effectively introduces two major hazards:

Latency: There is a 1.75x slowdown when an SM has to access data from the far partition.
Cache Thrashing/Duplication: If data is constantly moved from the far partition to the near partition, the same cache line ends up duplicated, consuming valuable space in multiple partitions.

Developers usually want SMs to populate the cache cooperatively instead of thrashing the cache. Cache capacity can be doubled when access are managed well.

Case study: pairs of partition-aware blocks

The presentation author conducted an experiment by iterating through large chunks of global memory data and performing simple computations. They assigned a pair of thread blocks to operate on a chunk of data, ensuring the dataset size exceeded the L1 cache capacity to force L2 cache accesses. Two pairing combinations were tested:

One block on the near partition and one block on the far partition.
Both blocks on the near partition.

The results demonstrated that co-locating both blocks on the near partition yielded an 85% increase in throughput due to more efficient utilization of the L2 cache. The figure below shows that when the requested data size exceeds 20 MB (the capacity of a single L2 partition), relying on the far partition causes a significant drop in throughput.

Hit latency (Access Cycles)

This experiment demonstrates the Non-Uniform Cache Access (NUCA) architecture of NVIDIA Ampere GPUs. The test involved a single thread within a block scanning a large global array twice, deliberately bypassing the L1 cache.

The first scan loaded the data from global memory into the L2 cache partition that corresponds to the physical address.
The second scan measured the actual L2 cache hit latency.

The results revealed a clear bimodal distribution, with access times grouping around either 200 cycles or 350 cycles.

This dual-peak pattern confirms that the L2 cache is physically partitioned across the massive GPU die. Data located in the near partition is retrieved faster than data residing in the far partition across the chip.

Latency distribution over all SMs

The Setup: Two Groups, Two Segments

The author notes there are 108 SMs in total on the chip. They split these into two groups based on assumed physical location:

Group 0: SMs assumed to be physically near Partition 0. Let's say there are X SMs in this group.
Group 1: SMs assumed to be physically near Partition 1. Let's say there are Y SMs in this group.
We know that X + Y = 108.

The experiment dictates that Group 0 only reads "Memory Segment A" and Group 1 only reads "Memory Segment B".

The Variable: Tricking the Hardware Hash

By constantly changing the offset (the empty space between Segment A and Segment B), the author forces the GPU's hardware memory controller to randomly map those two segments into the two L2 cache partitions.

Because there are two memory segments and two cache partitions, there are only four possible ways the hardware can map them during any given run.

The Four Permutations (The Deduction)

By running the experiment many times with different offsets, the author captures all four possible mapping scenarios. Let's look at the expected "Far L2 Hits" for each:

Scenario 1: Perfect Match (0 Far Hits)
- Segment A maps to Partition 0. Segment B maps to Partition 1.
- Group 0 reads from its near partition. Group 1 reads from its near partition.
- Result: 0 SMs experience far hits. This matches the spike at 0 on the graph.
Scenario 2: Total Mismatch (108 Far Hits)
- Segment A maps to Partition 1. Segment B maps to Partition 0.
- Group 0 has to cross the chip. Group 1 has to cross the chip.
- Result: All 108 SMs experience far hits. This matches the spike at 108 on the graph (mentioned in text).
Scenario 3: Both Segments in Partition 0 (Y Far Hits)
- Segment A maps to Partition 0. Segment B maps to Partition 0.
- Group 0 gets a near hit (0 far hits). Group 1 has to cross the chip to get its data.
- Result: Only the SMs in Group 1 experience far hits. Therefore, the number of far hits equals Y.
Scenario 4: Both Segments in Partition 1 (X Far Hits)
- Segment A maps to Partition 1. Segment B maps to Partition 1.
- Group 0 has to cross the chip to get its data. Group 1 gets a near hit (0 far hits).
- Result: Only the SMs in Group 0 experience far hits. Therefore, the number of far hits equals X.

The Conclusion

Looking at the graph, the only two numbers that appear between 0 and 108 are 46 and 62.

Because Scenario 3 and Scenario 4 uniquely isolate the exact number of SMs in Group 1 (Y) and Group 0 (X), the author can definitively conclude that the two groups consist of 46 SMs and 62 SMs.

Furthermore, checking the math confirms the deduction: 46 + 62 = 108. This perfectly accounts for all SMs on the chip, proving the asymmetrical physical design of the Ampere GPU where one cache partition serves 62 SMs and the other serves 46.

Global address space evenly distributed across L2 partitions

Finally, the author compared the access times of different global memory addresses at a finer granularity. They discovered that virtual addresses are mapped to L2 partitions in 8-KB chunks iteratively . Furthermore, sub-slices of the L2 cache are directly associated with specific HBM memory controllers, corroborating the architectural descriptions provided in NVIDIA's official whitepaper.

Atomics

Shared memory atomics: a new, contention-free increment

Ampere introduces a powerful new atomic increment instruction: ATOMS.POPC.INC. Traditionally, atomic instructions can bottleneck execution; when multiple threads attempt to operate on a single memory address simultaneously, hardware contention occurs. This leads to thread stalls and a significant drop in throughput.

The author conducted experiments on multiple generations of GPU. They reveal that only Ampere's atomic increment does not suffer from throughput drop. However, it is important to note that adding values other than 1 still relies on the legacy ATOMS.ADD operation. Contention penalties still exist.

Experiment Setup

To test this, the author designed a benchmark: N threads within a 256-thread block atomically increment a shared memory location by an immediate value of 1.

The value of N was varied from 1 to 256,
and every N threads assigned to sequential memory addresses.

All other atomics still suffer from contention

The author also designed experiments for other atomic operations.

When benchmarking other atomics, throughput consistently drops due to contention across both shared and global memory. Comparing Ampere's performance to older architectures reveals the following:

Shared Memory: Ampere performs slightly worse than the V100.
Global Memory: Ampere performs slightly better than the V100.
P4 and T4 GPUs: Interestingly, these older architectures yield the best throughput in these specific scenarios, primarily due to their higher base clock frequencies.

Sparse Matrix Multiplication

For a deep dive into Ampere's Sparse Matrix Multiplication capabilities, please refer to the official NVIDIA architecture whitepapers.

A100 vs V100

When comparing the A100 directly to its predecessor, the V100, several core architectural elements remain unchanged, while memory and caching have seen massive overhauls.

Key Similarities to V100

Instruction encoding
Dual-port registers
Shared memory latency behavior (specifically regarding bank conflicts)
Overall memory topology

Key Differences from V100

Bank conflicts on the A100 result in higher latency penalties than on the V100.

Memory Capacity

The A100 features significantly expanded memory structures across the board:

L0 Instruction Cache: 2.7x larger (verified via micro-benchmarking).
L1 & Shared Memory: Total combined capacity is 1.5x larger.
- Achieves 96% utilization of the theoretical L1 maximum.
- Allows 100% utilization of shared memory (with only 1 KiB reserved).
Unified L2 Cache: More than 6x larger overall, with each individual partition growing by more than 3x.
Global Memory: 2.5x larger capacity.

Memory Bandwidth

Bandwidth has been scaled up significantly to feed the larger caches and higher SM count:

Global Memory Bandwidth: 1.7x faster
- More than 1.4x increase in memory clock speed
L2 Memory Bandwidth: 2.6x faster
- More than 3% increase in graphics clock
Shared memory bandwidth: 1.4x faster
- Same as L1 which is co-located
- Proportional to increase in SM count from 80 to 108 and the increase in graphics clock
Observed-theoretical ratio is over V100
- Global memory and L1 with 92% theoretical maximum

[WP] L2 Cache and DRAM Architecture: Summary

2026-05-08T00:00:00.000Z

This blog summarizes basic architectural information of Device Memory and L2 Cache from NVIDIA's

The global and local memory areas accessed by CUDA programs reside in HBM memory space, i.e., “device memory”.

Constant memory space resides in device memory and is cached in the constant cache.
Texture and surface memory spaces reside in device memory. They are cached in texture cache.
The Level 2 (L2) cache caches reads from and writes to HBM (device) memory. It services memory requests from various subsystems within the GPU.

HBM and L2 memory spaces are accessible to all SMs and all applications running on the GPU.

Device Memory (DRAM) Overview

	Ampere (SXM4)	Hopper (SXM5)	Hopper (PCIe)
DRAM	40GB (HBM2, 5 stacks, 8 memory dies per stack)	80GB (HBM3, 5 stacks)	80GB (HBM2e, 5 stacks)
Data Rate	1215 MHz DDR	2619 MHz DDR	1593 MHz DDR
Bandwidth	1555 GB/sec	3352 GB/sec	2039 GB/sec

For more, please check "H100 HBM and L2 Cache Memory Architectures" section of Hopper Whitepaper, Hopper Architecture In-depth and Hopper Architecture In-depth.

L2 Cache

	Ampere (SXM4)	Hopper (SXM5)	Hopper (PCIe)
Cache Size	40MB	50MB	50MB
Organization	The L2 cache is divided into two partitions to enable higher bandwidth and lower latency memory access. Each L2 partition localizes and caches data for memory accesses from SMs in the GPCs directly connected to the partition. Each L2 cache partition is divided into 40 L2 cache slices. Eight 512 KB L2 slices are associated with each memory controller.	Partitioned Crossbar but not necessarily 2-way split.	Partitioned Crossbar but not necessarily 2-way split.
Read Bandwidth	5120 Bytes/clk	Unknown	Unknown
Data Compression	The NVIDIA Ampere architecture adds Compute Data Compression to accelerate unstructured sparsity and other compressible data patterns.	Supported	Supported

For detailed info, please refer to whitepapers.

The CUDA Cache Blog

[A] Dissecting the Volta Architecture: Notes

Instruction​

Control Information​

Reuse flags​

Wait barrier mask & Write barrier index​

Read barrier index (Read dependency barriers)​

Stall Cycles​

Yield flag​

Encoding​

Dual-Port Register​

[A] Demystifying NVIDIA Ampere Architecture: Notes

Summary of Results and Conclusions​

Key Findings on Instruction Latency & Pipeline Behavior​

Key Findings on Memory Latency​

Key Findings on Tensor Cores (TC)​

Extracted Data and Full Tables​

[GTC] Dissecting the Ampere Architecture: Notes

Grouping Work by L2 Partition​

Case study: pairs of partition-aware blocks​

Hit latency (Access Cycles)​

Latency distribution over all SMs​

The Setup: Two Groups, Two Segments​

The Variable: Tricking the Hardware Hash​

The Four Permutations (The Deduction)​

The Conclusion​

Global address space evenly distributed across L2 partitions​

Atomics​

Shared memory atomics: a new, contention-free increment​

Experiment Setup​

All other atomics still suffer from contention​

Sparse Matrix Multiplication​

A100 vs V100​

Key Similarities to V100​

Key Differences from V100​

Memory Capacity​

Memory Bandwidth​