Introduction
Challenges for Kernel Devs
Developing highly optimized CUDA kernels for modern GPU architectures requires more than just algorithmic efficiency. It demands a precise understanding of how the underlying hardware translates and executes PTX instructions. To write truly hardware-centric, high-performance code, developers must be able to conceptualize how every line of CUDA C++ maps to physical execution units and memory hierarchies.
While official resources like the CUDA C++ Programming Guide, the Best Practices Guide, and architecture whitepapers provide an essential foundation, they purposefully abstract away low-level mechanisms and details. Consequently, developers seeking an in-depth understanding of the hardware are often forced to piece together fragmented information scattered across reverse-engineering analyses, disparate technical blogs, and sub-component specifications like HBM standards.
Primary Goal
This knowledge base is built to bridge that gap. It consolidates these widely dispersed, low-level hardware details into a single, structured resource. The objective is to
- provide CUDA or CUTLASS/CuTe developers with convenient hardware references
- and contribute to the broader HPC community.
A Note from the editor: This knowledge base is optimized for hardware details. Please refer to official CUDA docs for any mentioned programming concepts, such as CUDA Graphs and thread hierarchy.