Wunk

vpternlog: Signed Saturation
December 4, 2025
vpternlog{d,q} is one of those instructions that a time traveler should have told Intel to put into the original x86 ISA years ago once you understand what it does. vpternlog{d,q} is an AVX512F (and ideally with AVX512VL) instruction that implements ternary-logic across its three operands, overwriting one of its inputs. Note that this is a bit different from the understanding of ternary-logic that operates on trits, but it is ternary in the sense that it is doing three-operand logic much like the ternary conditional operator. …
vecint: Average Color
September 30, 2024
In a previous post, I used Intel’s AMX instructions intended for AI/ML use-cases to take the average color of an image. This was primarily a proof-of-concept since pedestrians like me generally don’t have access to Intel’s’ AMX-enabled hardware. The cost-of-entry for Intel’s Sapphire Rapids chips is pretty steep too. Maybe some day it will be ubiquitous in consumer-hardware and share a similar story as AVX-512. Pedestrians like me do have access to an Apple M2 Mini though, after some frustration with trying to sustain a development-environment with a MacOS VM: …
GPU Debug Scopes
September 17, 2024
Rendering APIs these days tend to capture their gpu workloads into a serialized form such as a command-buffer or command-list to be dispatched at a later time into a work-queue. Diagnostic tooling such as RenderDoc or Nsight-Graphics allows the disecting of these command-buffers, but it’s not very obvious to determine what is happening at a high level from the list of API commands alone: RenderDoc(Before) Nsight-Graphics(Before) Without any additional debugging information, RenderDoc and Nsight will show a flat list of command-buffer API-calls and will provide some filtering and categorization of these commands to help track down the ones that you care about. This process is slow, especially when working with multiple captures and need to draw some kind of comparisons between them. …
tdpbuud: Average Color
January 23, 2023
To go beyond vectors, Intel introduced Advanced Matrix eXtensions into their Sapphire Rapids processors, allowing massive matrix-operation instructions into the ISA. Apple also implemented a similar instruction set extension.1 Intel Architecture Day 2021 The shape of these matrices are configured by a ldtilecfg-instruction that refers to a 64-byte structure in memory defining the AMX register-state. The 64-byte tile-configration structure is pretty optimistic, allowing a total of 16 tiles to be configured and with 14 reserved bytes for future configuration data. …
vfixupimm: signum
October 10, 2022
Signum The signum function, also known as the sign function, is defined by the piecewise function: $$ signum(x) = \begin{cases} -1.0 &\text{if } x \lt 0.0 \\ 1.0 &\text{if } x \gt 0.0 \\ 0.0 &\text{if } x = 0.0 \end{cases} $$ Put simply, it gets the sign of the function. Returning 1.0 if it is positive, -1.0 if it is negative, and 0.0 if it is 0.0. But floating point values are weird. There is ±Infinity, ±0.0, QNaN, and SNaN to account for. An initial implementation might look something like this. You’ll likely find something like this on Google or StackOverflow. …
GL_EXT_fragment_shader_barycentric: Wireframe
July 27, 2022
A render of Derelict from Bungie’s Halo: Combat Evolved. Rendered using an experimental Vulkan-renderer I am making utilizing GL_EXT_fragment_shader_barycentric for rendering high-quality wireframes. GL_EXT_fragment_shader_barycentric is a more vendor-neutral version of both Nvidia’s GL_NV_fragment_shader_barycentric extension, and AMD’s GL_AMD_shader_explicit_vertex_parameter extension which allows fragment shaders to directly access the barycentric weights of the current sample. Coming to a GPU near you! Now that this feature is no longer exclusive to a particular vendor’s hardware, I figure now is a good time to finally write a cool use-case for this extension. Wireframe rendering! …
pavgb: most-significant-bit constant
June 24, 2022
This is a quirky way to generate 8-bit or 16-bit sign-masks (0x808080..., 0x800080008000...) in an x86 SIMD-register using the pavg{b,w} instructions without touching memory. pavg only supports an 8-bit form(pavgb) and a 16-bit form(pavgw) unfortunately. PAVGB/PAVGW — Average Packed Integers Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the destination operand (first operand), and stores the results in the destination operand. For each corresponding pair of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary sum, and that result is shifted right one bit position. …
Memory-Size Literals
February 8, 2022
This is a very subjective “please use this C++ feature”-post where I try to convince you to use user-literals to solve a particular issue of concisely declaring memory sizes. You’ve probably seen or done stuff like this before in a code base when dealing with units of memory. const std::size_t BUFFER_SIZE = 1024 * 1024 * 8; ... const std::size_t ScratchSpace = 1000 * 1000 * 4; ... #define KILOBYTES * 1024UL #define MEGABYTES * 1048576UL #define GIGABYTES * 1073741824UL ... #define KILO * 1000 #define MEGA * (1000 * 1000) #define GIGA * (1000 * 1000 * 1000) ... #define KB(x) ((size_t) (x) << 10) #define MB(x) ((size_t) (x) << 20) ... const unsigned long KILOBYTE = 1024; const unsigned long MEGABYTE = 1024 * 1024; const unsigned long GIGABYTE = 1024 * 1024 * 1024; const auto BUFFER_SIZE = MEGABYTE * 8; I’ve certainly dealt with this enough to where I’ve wanted a much better pattern. …
gf2p8affineqb: int8 shifting
November 14, 2020
gf2p8affineqb is becoming a new favorite. Previously I have a short introduction to the instruction and showed an example of it being used to reverse all 128-bits within a vector-register by first reversing the bits within each bytes, and then reversing the bytes. For my next “trick”, I’ll implement some intrinsics that are curiously missing from the x86 ISA. Lets say you’re doing some cool SIMD work and while you’re designing your kernels, you find yourself needing to bit-shift some bytes around. Maybe you want to multiply your vector of 8-bit integers by a power of two or something. You want to do a logical shift left by some immediate value. So you have the Intel Intrinsics Guide open on the other screen like a good SIMD programmer and you curiously start typing in _mm_srli_epi, hoping to see _mm_srli_epi8 at some point. …
gf2p8affineqb: Bit reversal
November 14, 2020
gf2p8affineqb is the latest and one of the longest-named instructions of the x86 ISA, featured in the GFNI extension(but is pretty much paired with AVX512VL as well). At the moment, this instruction is only available in Sunny Cove(Icelake) based Intel architectures or newer by detecting CPUID (EAX=7,ECX=0):ECX[bit 08] or checking for the gfni flag in your /proc/cpuinfo. Synopsis __m128i _mm_gf2p8affine_epi64_epi8 (__m128i x, __m128i A, int b) #include <immintrin.h> Instruction: vgf2p8affineqb xmm, xmm, xmm, imm8 …

1
2