GL_NV_shader_sm_builtins is no longer in the Nvidia beta-driver-jail, and can allow shaders to access the index of the physical Streaming multi-processor and warp that a particular compute kernel dispatch is executing upon.

My laptop has a Nidia GTX 1650 Max-Q which features the `TU117`

Turing-architecture chip.

Image from anandtech.com

Here’s a deeper dive into the Turing **S**treaming-**M**ultiprocessors

The `GTX 1650`

is a special-case implementation of the Turing architecture that does not implement the “True Turing” video encoding engine or any of the new Tensor Cores or Ray Tracing Cores, but the `SM`

is generally the same, save for the absense of the die-space that these non-essential features would have taken up.

…each SM has a total of 64 FP32 Cores and 64 INT32 Cores…

The Turing SM is partitioned into four processing blocks, each with 16 FP32 Cores, 16 INT32 Cores, two Tensor Cores, one warp scheduler, and one dispatch unit.

So what’s a

warp?

A `warp`

is what Nvidia uses to describe a bundle of `32`

“GPU threads” or kernel-invocations all executing in lock-step together(they are all sharing a program-counter, with the exception of execution masks, and are all executing the same kernel) within an `SM`

. A turing `SM`

is capable of executing `32`

*warp s* together in parallel(that’s

`32`

* `32`

= `1024`

gpu-threads, in parallel within any given `SM`

).

- Vulkan/OpenCL calls it a
`subgroup`

- Nvidia calls it a
`warp`

- AMD calls it a
`wavefront`

The interesting part about always having subgroups of

`32`

threads always executing in parallel with each other is that you can have them each communicate, synchronize, and even share data with each other very quickly!

At any given moment, a single Streaming Multiprocessor is in the middle of executing `32`

*warps*(again, `1024`

threads), all at the same time. Depending on your architecture and CUDA compatibility, this number may differ a bit.

AMD hardware can chew through threads in groups of `64`

rather than `32`

, thus their `subgroupSize`

is `64`

rather than `32`

.

So, on my GTX 1650, we’re expecting `16`

**S**treaming-**M**ultiprocessors, and each `SM`

to have `32`

warps per `SM`

. And now with this extension we have a means to identify both the index of the `SM`

(`[0,15]`

) and the identifier of the * Warp executing within that SM*(

`[0,31]`

) within my particular GPU.Here, I’ve repurposed one of my Adobe After Effects plugins to write these new shader builtins into the Red and Green channels of the image. The intensity of the color channel indicates index of the value within its respective domain.

From the spec:

```
Overview
This extension adds the functionality of NV_shader_thread_group
that was not subsumed by KHR_shader_subgroup.
In particular, this extension adds new shader inputs to query
* the number of warps running on a SM (gl_WarpsPerSMNV),
* the number of SMs on the GPU (gl_SMCountNV),
* the warp id (gl_WarpIDNV), and
* the SM id (gl_SMIDNV).
```

Color Channel | Value |
---|---|

Red | `gl_SMIDNV` / `gl_SMCountNV` |

Green | `gl_WarpIDNV` / `gl_WarpsPerSMNV` |

Blue | 0 |

```
PixelColor.rgb = vec3(
gl_SMIDNV / float(gl_SMCountNV), // Red Channel
gl_WarpIDNV / float(gl_WarpsPerSMNV), // Green Channel
0
);
```

In Vulkan’s abstraction of these concepts, the GTX 1650 in my laptop exposes these details about itself.

`maxComputeWorkGroupInvocations`

is the maximum amount of invocations(read: threads) that a single workgroup can have. This matches up precisely with the fact that a single Turing-`SM`

can have a max of `1024`

resident-threads.
So for a 2-Dimensional compute dispatch, I want some `width * height = 1024`

workgroup-size to fully occupy the `SM`

with threads. Thankfully, most power-of-two numbers lend themselves to things like this. A `sqrt(1024)`

will get me both a width and a height. `sqrt(1024) = 32`

, so my workgroup size is `32 * 32 = 1024`

, which fully occupies each `SM`

with `1024`

threads. Now to dispatch enough 1024-thread-workgroups to account for every pixel of an image and ensuring that we emit an upper-bound in each dimension in case any of the image dimensions are smaller than `32`

or not an even multiple of `32`

.

`maxComputeWorkGroupInvocations`

may be different for your GPU.

```
...
const float OptimalWorkgroupSize2D = std::sqrt(maxComputeWorkGroupInvocations);
vkCmdDispatch( myCmdBuf,
std::uint32_t(std::ceil(imageWidth / OptimalWorkgroupSize2D)),
std::uint32_t(std::ceil(imageHeight / OptimalWorkgroupSize2D)),
1
);
...
```

So, lets do a 2D dispatch across the pixels of a blank `256x256`

image a bunch of items to see how work is being distributed between runs.

A quick re-cap:

A Turing-

`SM`

has`32`

warpsA warp has

`32`

threadsWhich means a Turing-

`SM`

can have up to`32 * 32 = 1024`

threads at any given momentVulkan exposes this value as

`maxComputeWorkGroupInvocations`

, the largest number of threads(invocations) that can sustain the concept of a`workgroup`

.To process a

`256x256`

image, and optimally utilize an`SM`

’s`1024`

-thread capabilities, we can distribute workgroups in 2D-groups of`32x32 = 1024`

threads because`32`

is the square-root of`1024`

(`maxComputeWorkGroupInvocations`

).

So for a `256x256`

image we would do a 2D dispatch of `8 * 8 = 64`

workgroups with each workgroup being the size `32x32x1`

. A total of `65536`

invocations for this particular dispatch

Dispatch Result | Red Channel | Green Channel |
---|---|---|

A lot’s happening here. I’ll zoom into a small 64x64 tile around the middle too to dissect what’s going on.

Dispatch Result | Red Channel | Green Channel |
---|---|---|

Dispatch Red Channel | Zoomed/Slowed down |
---|---|

A characteristic to notice in the Red channel of the full dispatch is that the first `16`

(this matches up precisely with the number of `SM`

we were expecting on this chip) tiles at the top of the 2D compute dispatch has a steady ascending change in brightness(`SM`

index is increasing).

The scheduler dispatched the `SM`

very determinately, then after that it’s total noise as the scheduler has to take over and re-assigns the `SM`

to work on another part of the dispatch based on which ever `SM`

becomes available. The GPU only has `16`

`SM`

and we asked it to do `64`

`SM`

-worth of work, it has to make do.

Dispatch Green Channel | Zoomed/Slowed down |
---|---|

In the Green channel, we see regular horizontal streaks where the `gl_WarpIDNV`

is consistent.

We see *rows* of `32`

pixels that all have the same pixel intensity, where a group of 32-threads are all bundled together into a warp under the same `gl_WarpIDNV`

.

We see *columns* of `32`

pixels, each of seemingly randomly ordered intensity but where where each row identifies a *unique* bundle of a *32-thread warp*.

The `SM`

is scheduling its `32`

-warp-capacity across the vertical axis of this `32x32`

tile, and each 32-thread-warp is taking up a full scanline on the horizontal axis.

Just for fun, I set the workgroup’s size to be `13`

rather than `32`

.
Meaning each work-group is now going to be of size `13 * 13 = 169`

and we will need to dispatch *more* workgroups to handle the entire `256x256`

image.

I picked `13`

since this is a nice and small prime number. Nothing will divide it evenly, especially in a world where hardware likes to work with power-of-two numbers.

So for a `256x256`

image we would do a 2D dispatch of `20 * 20 = 400`

workgroups with each workgroup being the size `13x13x1`

.
A total of `67600`

invocations for this particular dispatch.

Dispatch Result | Red Channel | Green Channel |
---|---|---|

And here it is zoomed in on a particular 4x4 section of workgroups.

Dispatch Result | Red Channel | Green Channel |
---|---|---|

This is a mess, but it’s actually not as bad as you think.

One

`SM`

can sustain`1024`

threads, and we have a workgroup size of`169`

, so each`SM`

can sustain6(`floor(1024 / 169 = 6.059171597633136)`

) of our weird`13x13x1`

workgroups.One of these

`13x13x1`

workgroups will take the capacity of6(`ceil( 32 / 6 = 5.333333333333333)`

) warps.So at any given moment an

`SM`

will have5workgroups, each taking6warps each, utilizing30(`5 * 6 = 30`

) of the`32`

-warp-capacity of the SM.At

`30/32`

warp-usage,`2`

of the`32`

warps of an`SM`

will be unused. This is a under-utilization.

Surprisingly though, that means that we are utilizing **93.75%** of the `SM`

’s full capacity(`30/32 = 0.9375`

). Which isn’t so bad, but we still had to dispatch `20x20=400`

workgroups rather than the more-ideal `8x8=64`

workgroups.

This is an example of what Nvidia calls *occupancy*.

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.

WorkgroupSize | |
---|---|

1024 | |

169 |

So if there is one thing to take away from all this it’s:

To achieve full chip-utilization in Vulkan:

Your workgroup size(

`x * y * z`

) should be some multiple of your device’s`subgroupSize`

(Intel/Nvidia:`32`

, AMD:`64`

), but no greater than`maxComputeWorkGroupInvocations`

.