Notes on Blackwell SM100: What Actually Changed for Kernel Writers

These are rough notes from spending a few weeks porting kernels to SM100. Not a comprehensive architecture guide — more like “here’s what actually mattered in practice.”

The big shifts

The headline features are TCGEN05 (the new tensor core generation), UMMA instructions, and TMEM (tensor memory). If you’ve been writing kernels on Hopper, the mental model changes are significant but not insurmountable.

TMEM is the one that took the most getting used to. It’s a new level in the memory hierarchy — sits alongside shared memory but is private to the warp. Think of it as a register file extension that’s addressable. The allocation/deallocation happens via PTX instructions, which feels unusual if you’re used to everything being either registers or smem.

What you can mostly ignore

If you’re coming from Hopper, a lot of the CTA cluster stuff works similarly. DSMEM is still there. TMA descriptors are still there. The warp specialization patterns from Hopper carry over conceptually, though the details differ.

CuTe DSL vs C++ CUTLASS

This is where I’ve been spending most of my time. The Python CuTe DSL (CUTLASS 4.x) is genuinely a different thing from the C++ CUTLASS templates. It’s more expressive for layout algebra — hierarchical shapes and strides, mixed-radix decomposition, compatibility checks — but the learning curve is real.

More notes to come as I get deeper into Flash Attention 4 on this architecture.