-
tdpbuud: Average Color
January 23, 2023
To go beyond vectors, Intel introduced Advanced Matrix eXtensions into their Sapphire Rapids processors, allowing massive matrix-operation instructions into the ISA. Apple also implemented a similar instruction set extension.1 Intel Architecture Day 2021 The shape of these matrices are configured by a ldtilecfg -instruction that refers to a 64-byte structure in memory defining the AMX register-state. The 64-byte tile-configration structure is pretty optimistic, allowing a total of 16 tiles to be configured and with 14 reserved bytes for future configuration data.…
-
vfixupimm: signum
October 10, 2022
Signum The signum function, also known as the sign function, is defined by the piecewise function: $$ signum(x) = \begin{cases} -1.0 &\text{if } x \lt 0.0 \\ 1.0 &\text{if } x \gt 0.0 \\ 0.0 &\text{if } x = 0.0 \end{cases} $$ Put simply, it gets the sign of the function. Returning 1.0 if it is positive, -1.0 if it is negative, and 0.0 if it is 0.0. But floating point values are weird.…
-
GL_EXT_fragment_shader_barycentric: Wireframe
July 27, 2022
A render of Derelict from Bungie’s Halo: Combat Evolved. Rendered using an experimental Vulkan-renderer I am making utilizing GL_EXT_fragment_shader_barycentric for rendering high-quality wireframes. GL_EXT_fragment_shader_barycentric is a more vendor-neutral version of both Nvidia’s GL_NV_fragment_shader_barycentric extension, and AMD’s GL_AMD_shader_explicit_vertex_parameter extension which allows fragment shaders to directly access the barycentric weights of the current sample. Coming to a GPU near you! Now that this feature is no longer exclusive to a particular vendor’s hardware, I figure now is a good time to finally write a cool use-case for this extension.…
-
pavgb: most-significant-bit constant
June 24, 2022
This is a quirky way to generate 8-bit or 16-bit sign-masks (0x808080..., 0x800080008000...) in an x86 SIMD-register using the pavg{b,w} instructions without touching memory. pavg only supports an 8-bit form(pavgb) and a 16-bit form(pavgw) unfortunately. PAVGB/PAVGW — Average Packed Integers Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the destination operand (first operand), and stores the results in the destination operand. For each corresponding pair of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary sum, and that result is shifted right one bit position.…
-
Memory-Size Literals
February 8, 2022
This is a very subjective “please use this C++ feature”-post where I try to convince you to use user-literals to solve a particular issue of concisely declaring memory sizes. You’ve probably seen or done stuff like this before in a code base when dealing with units of memory. const std::size_t BUFFER_SIZE = 1024 * 1024 * 8; ... const std::size_t ScratchSpace = 1000 * 1000 * 4; ... #define KILOBYTES * 1024UL #define MEGABYTES * 1048576UL #define GIGABYTES * 1073741824UL .…
-
gf2p8affineqb: int8 shifting
November 14, 2020
gf2p8affineqb is becoming a new favorite. Previously I have a short introduction to the instruction and showed an example of it being used to reverse all 128-bits within a vector-register by first reversing the bits within each bytes, and then reversing the bytes. For my next “trick”, I’ll implement some intrinsics that are curiously missing from the x86 ISA. Lets say you’re doing some cool SIMD work and while you’re designing your kernels, you find yourself needing to bit-shift some bytes around.…
-
gf2p8affineqb: Bit reversal
November 14, 2020
gf2p8affineqb is the latest and one of the longest-named instructions of the x86 ISA, featured in the GFNI extension(but is pretty much paired with AVX512VL as well). At the moment, this instruction is only available in Sunny Cove (Icelake) based Intel architectures or newer by detecting CPUID (EAX=7,ECX=0):ECX[bit 08] or checking for the gfni flag in your /proc/cpuinfo. Synopsis __m128i _mm_gf2p8affine_epi64_epi8 (__m128i x, __m128i A, int b) #include <immintrin.h> Instruction: vgf2p8affineqb xmm, xmm, xmm, imm8…
-
pclmulqdq Tricks
May 10, 2020
Carry-less products find their natural habitat in finite field arithmetic, common in error checking codes and in encryption and checksums . Finite field arithmetic(aka Galois Fields ) is very different from the usual addition and multiplication rules that you know, especially GF(2)(Binary Galois fields) which are very easy to implement with pre-existing binary computers. Galois(pronounced gal-wah) fields are named after Évariste Galois . A very young French activist that was also a very talented mathematician that died at the young age of 20 due to a duel.…
-
Visualizing GL_NV_shader_sm_builtins
February 13, 2020
GL_NV_shader_sm_builtins is no longer in the Nvidia beta-driver-jail, and can allow shaders to access the index of the physical Streaming multi-processor and warp that a particular compute kernel dispatch is executing upon. My laptop has a Nidia GTX 1650 Max-Q which features the TU117 Turing-architecture chip. Image from anandtech.com Here’s a deeper dive into the Turing Streaming-Multiprocessors From the Nvidia Turing Architecture Whitepaper The GTX 1650 is a special-case implementation of the Turing architecture that does not implement the “True Turing” video encoding engine or any of the new Tensor Cores or Ray Tracing Cores , but the SM is generally the same, save for the absense of the die-space that these non-essential features would have taken up.…
-
Online Compilers and __cpuid
February 12, 2020
Here is some very minimal code I wrote that gets your (x86) processor’s brand name using the cpuid instruction. #include <cpuid.h> #include <cstdio> #include <cstdint> int main() { uint32_t Name[4]; for( uint32_t i = 2; i < 5; ) { __cpuid(0x80000000 | i++, Name[0], Name[1], Name[2], Name[3]); fwrite(Name, 16, 1, stdout); } } main: push rbp mov ebp, 2 push rbx sub rsp, 24 .L2: mov eax, ebp mov esi, 16 add ebp, 1 mov rdi, rsp or eax, -2147483648 >>>> cpuid mov DWORD PTR [rsp+8], ecx mov rcx, QWORD PTR stdout[rip] mov DWORD PTR [rsp+12], edx mov edx, 1 mov DWORD PTR [rsp], eax mov DWORD PTR [rsp+4], ebx call fwrite cmp ebp, 5 jne .…