vpternlog: Signed Saturation

vpternlog{d,q} is one of those instructions that a time traveler should have told Intel to put into the original x86 ISA years ago once you understand what it does.

vpternlog{d,q} is an AVX512F (and ideally with AVX512VL) instruction that implements ternary-logic across its three operands, overwriting one of its inputs.

Note that this is a bit different from the understanding of ternary-logic that operates on trits, but it is ternary in the sense that it is doing three-operand logic much like the ternary conditional operator.

An intuition to have about this instruction is that it will take one lane of bits from each of its three operands to create a 3-bit value, which further indexes a singular bit within an 8-bit look up table value alongside the instruction. This 8-bit look-up operation has the ability to generalize many different three-operand binary operations such as combining several XOR, AND, NOT and OR operations into a singular instruction.

The instruction itself is bitwise across the entire register, though the d(32-bit) and q(64-bit) specifiers of vpternlog{d,q} provide additional element masking with a mask-register:

Synopsis
__m128i _mm_ternarylogic_epi32 (__m128i a, __m128i b, __m128i c, int imm8)
#include <immintrin.h>
Instruction: vpternlogd xmm, xmm, xmm, imm8
CPUID Flags: AVX512F + AVX512VL
Description
Bitwise ternary logic that provides the capability to implement any three-operand binary function; the specific binary function is specified by value in imm8. For each bit in each packed 32-bit integer, the corresponding bit from a, b, and c are used according to imm8, and the result is written to the corresponding bit in dst.

DEFINE TernaryOP(imm8, a, b, c) {
	CASE imm8[7:0] OF
	0: dst[0] := 0                   // imm8[7:0] := 0
	1: dst[0] := NOT (a OR b OR c)   // imm8[7:0] := NOT (_MM_TERNLOG_A OR _MM_TERNLOG_B OR _MM_TERNLOG_C)
	// ...
	254: dst[0] := a OR b OR c       // imm8[7:0] := _MM_TERNLOG_A OR _MM_TERNLOG_B OR _MM_TERNLOG_C
	255: dst[0] := 1                 // imm8[7:0] := 1
	ESAC
}
imm8[7:0] = LogicExp(_MM_TERNLOG_A, _MM_TERNLOG_B, _MM_TERNLOG_C)
FOR j := 0 to 3
	i := j*32
	FOR h := 0 to 31
		dst[i+h] := TernaryOP(imm8[7:0], a[i+h], b[i+h], c[i+h])
	ENDFOR
ENDFOR
dst[MAX:128] := 0

Intel intrinsics guide vers. 3.6.9

Based on https://www.sandpile.org/x86/opc_3.htm, there was apparently an intent to implement vpternlogb and vpternlogw variants for 16-bit and 8-bit elements. sandpile.org also provides a helpful table of ternary logic operations with their matching vpternlog LUT values: https://www.sandpile.org/x86/ternlog.htm

Nvidia’s PTX instruction set implements a similar instruction called LOP3.LUT for their GPUs.

Intel’s Virtual Instruction Set Architecture, used for their GPUs, implements a similar instruction called BFN

Compile time constants

Rather than spending the time to fumble up the look up table bits yourself, you can actually express these look-up bits directly in high-level code and generate the LUT bits at compile time!

GCC and Clang both provide the enum variables _MM_TERNLOG_{A,B,C} in avx512fintrin.h (included by immintrin.h) which allow you to directly express your binary operation in such a way that directly generates the 8-bit LUT value:

const std::uint8_t TernLut = (_MM_TERNLOG_A | ~_MM_TERNLOG_B) & _MM_TERNLOG_C;
//						   = 0b10100010
//						   = 0xa2

// later...
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0xa2

const std::uint8_t TernLut = ~(_MM_TERNLOG_A ^ _MM_TERNLOG_B) & _MM_TERNLOG_C;
//    					   = 0b10000010
//    					   = 0x82

// later...
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0x82

MSVC does not currently provide these constants, though. You are probably better off implementing these constants yourself for proper portability across compilers. It’s just three integer constants!

I did something like this for both Dynarmic and Xenia when emitting the vpternlog{d,q} instructions for their x64 JIT backends:

namespace Tern {
constexpr std::uint8_t a = 0b11110000;
constexpr std::uint8_t b = 0b11001100;
constexpr std::uint8_t c = 0b10101010;
}  // namespace Tern

With these constants, you can express your ternary operation directly within your C or C++ code, and the resulting compile-time constant will be the appropriate 8-bit look up table value for vpternlog{d,q} to do the same operation upon its operands!

//// This implements "(a | ~b) & c" as a single instruction!
const std::uint8_t TernLut = (Tern::a | ~Tern::b) & Tern::c;
//		    			   = 0b10100010
//		    			   = 0xa2

... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0xa2

//// This implements "~(a ^ b) & c" as a single instruction!
const std::uint8_t TernLut = ~(Tern::a ^ Tern::b) & Tern::c;
//    					   = 0b10000010
//    					   = 0x82

... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0x82

Intel’s uses a similar mechanism of compile-time constants when explicitly emitting the bfn instruction with SYCL:

https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-2/optimizing-explicit-simd-kernels.html#MATH-FUNCTIONS

Signed Saturated arithmetic

Saturated arithmetic is pretty commonplace in tasks relating to image processing, digital signal processing, fixed point arithmetic, safety critical practices, control algorithms, and anywhere else where having predictable underflow/overflow results is better than modulo arithmetic’s wrap-around results.

Wrapping vs. Saturation on an underflowing/overflowing input signal

Alright so where are the SIMD instructions…

On ARM, there are NEON instructions for SIMD vector addition and subtraction with the {s,u}qadd and {s,u}qsub instructions. It supports signed and unsigned 8 bit, 16 bit, 32 bit, and 64 bit elements!

Learn the architecture - Neon programmers’ guide: C.7. List of saturating instructions

On RISC-V, saturated vector operations are also supported by the V vector extensions. Specifically, the vsadd{,u} and vssub{,u} instructions:

RISC-V V vector extension specification 1.0: Vector Single-Width Saturating Add and Subtract

How about on x64?

There is 8-bit and 16-bit saturation, but no 32-bit or 64-bit saturation! With the absence of native support for 32 and 64-bit saturated SIMD instructions, we are left to implement it ourselves.

There actually seemed to be a intent to release an AVX512_BITALG2 instruction set extension that adds the VPADDSD, VPADDSQ, VPADDUSD and VPADDUSQ instructions. Notably, there was not a subtraction (SUBS) version in what is known about AVX512_BITALG2’s intended design.

Section EVEX + F3h: https://www.sandpile.org/x86/opc_3.htm

Saturated arithmetic depends on doing a regular integer addition/subtraction and then determining if saturation needs to occur based on the sign bit (the most significant bit) of both the input operands and the intermediate addition/subtraction result.

Using vpternlog{d,q}, it can be concisely determined if a saturation needs to occur by utilizing the sign bits of each SIMD vector’s elements. The rest of the bits of each element will be along for the ride. The resulting signed saturation values can be either an overflow (0x7FFFFF...) or underflow (0x80000...).

In the case of addition:

Signed saturation occurs if…

The sign bits of the two inputs is the same.

The sign bit of resulting sum has a different sign from the inputs.

Signed Saturated Add: Saturation happens if...
a = Operand A sign bit
b = Operand B sign bit
c = Sum sign bit

Interpreted as binary operations:
 "Input sign bits are the same"                   = ~(a ^ b)
 "result has a different sign bit from the input" =  (a ^ c)

Saturation needed  = ~(a ^ b) & (a ^ c)

TernlogLut = (~(_MM_TERNLOG_A ^ _MM_TERNLOG_B)) & (_MM_TERNLOG_A ^ _MM_TERNLOG_C)

A(`OpA` sign bit)	B(`OpB` sign bit)	C(`Addition` sign bit)	LUT(Saturation is needed)
0	0	0	0
0	0	1	1 (If Pos + Pos = Neg, a signed-overflow occurred)
0	1	0	0
0	1	1	0
1	0	0	0
1	0	1	0
1	1	0	1 (If Neg + Neg = Pos, a signed-underflow occurred)
1	1	1	0

In the case of subtraction:

Signed saturation occurs if…

The sign bits of the two inputs are different.

The resulting subtraction has a different sign from the first operand.

// Signed Saturated Subtract saturation happens if...
a = Operand A sign
b = Operand B sign
c = Subtraction sign bit

Interpreted as binary operations:
 "Input signs are different"                          = (a ^ b)
 "Result has a different sign from the first operand" = (a ^ c)

Saturation needed = (a ^ b) & (a ^ c)

TernlogLut = (_MM_TERNLOG_A ^ _MM_TERNLOG_B) & (_MM_TERNLOG_A ^ _MM_TERNLOG_C)

A(`OpA` sign bit)	B(`OpB` sign bit)	C(`Subtraction` sign bit)	LUT(Saturation is needed)
0	0	0	0
0	0	1	0
0	1	0	0
0	1	1	1 (If Pos - Neg = Neg, a signed-overflow occurred)
1	0	0	1 (If Neg - Pos = Pos, a signed-underflow occurred)
1	0	1	0
1	1	0	0
1	1	1	0

This results in the vpternlog{d,q} constants 0b01000010 for detecting the need to saturate lanes during addition and 0b00011000 for detecting the need to saturate lanes during subtraction.

// Detect if a signed overflow/underflow has occurred by using ternary
// logic on the sign bit of each element, other bits can be ignored.

// Addition
... = _mm_ternarylogic_epi32(OpA, OpB, Addition,    0b01000010);
// vpternlogd OpA, OpB, Addition, 0b01000010

// Subtraction:
... = _mm_ternarylogic_epi32(OpA, OpB, Subtraction, 0b00011000);
// vpternlogd OpA, OpB, Subtraction, 0b00011000

With AVX512, this value can then directly be converted into an execution-mask by using _mm_movepi32_mask(vpmovd2m), which will convert the most significant bit (the sign bit) of each element into a mask-register. With this write-mask of where underflows/overflows have occurred, we now need to write the correct overflow/underflow values into these lanes.

Either an underflow (0x80000000) or overflow (0x7FFFFFFF) value is written depending on the sign bit of the sum. This can be achieved in several ways depending on how much you want to touch memory or what execution ports you want to use among neighboring work. I’ll show one way that attempts to emit a small amount of instructions that tries to avoid additional masking or reading constants from memory:

The sign bit of the resulting addition/subtraction we calculated before can be repurposed directly to create the desired overflow/underflow values. Each lane will have either a 0xxxxxx... or a 1xxxxxx... and these values can be turned into either an 0x80000000(under) or an 0x7FFFFFFF(overflow) respectively.

We do this with an arithmetic shift to the right by 31 bits to “spread” the sign bit to all other bits. Then, we can flip the sign bit again to get either a 0x80000000/0x7FFFFFFF value.

Arithmetic Shift the Addition/Subtraction value to the right by 31 bits
- Creates either a 0x00000000 or 0xFFFFFFFF value in each lane
Invert the sign bit
- Creates either 0x80000000 or 0x7FFFFFFF value in each lane

Both saturated addition and subtraction implementations are done similarly, the only difference is the ternary look up table constant used and the intermediate result uses either a 32 bit addition or subtraction with _mm_{add,sub}_epi32. The 64-bit version is implemented the same as well, with careful attention to the data-types of the masks and arithmetic operations and constants.

Anyways you’re probably here to copy-paste some code:

// #include <immintrin.h>

__m128i _mm_adds_epi32(__m128i a, __m128i b)
{
	__m128i Result = _mm_add_epi32(a, b);

	// Resolves to 0b01000010(0x42)
	const std::uint32_t Lut = 0b01000010;
		// Not supported on MSVC
		// = (~(_MM_TERNLOG_A ^ _MM_TERNLOG_B)) // OpA and OpB are the same
		// & (_MM_TERNLOG_A ^ _MM_TERNLOG_C);   // OpA and OpC are different

	// Each element's sign bit is `1` if an overflow/underflow has occurred.
	// Other bits are just along for the ride
	const __m128i  SaturationCheck = _mm_ternarylogic_epi32(a, b, Result, Lut);
	const __mmask8 SaturationMask  = _mm_movepi32_mask(SaturationCheck);

	// "Spread" the sign bit to the rest of the bits, creating either
	// a `0x00000000` or a `0xFFFFFFFF`
	Result = _mm_mask_srai_epi32(Result, SaturationMask, Result, 31);

	// Flip the top-most bit creating either a `0x80000000` or a `0x7FFFFFFF`
	Result = _mm_mask_xor_epi32(
		Result, SaturationMask, Result, _mm_set1_epi32(0x80000000u)
	);

	return Result;
}

__m128i _mm_subs_epi32(__m128i a, __m128i b)
{
	__m128i Result = _mm_sub_epi32(a, b);

	// Resolves to 0b00011000(0x18)
	const std::uint32_t Lut = 0b00011000;
		// Not supported on MSVC
		// = (_MM_TERNLOG_A ^ _MM_TERNLOG_B)  // OpA and OpB are different
		// & (_MM_TERNLOG_A ^ _MM_TERNLOG_C); // OpA and OpC are different

	// Each element's sign bit is `1` if an overflow/underflow has occurred.
	// Other bits are just along for the ride
	const __m128i  SaturationCheck = _mm_ternarylogic_epi32(a, b, Result, Lut);
	const __mmask8 SaturationMask  = _mm_movepi32_mask(SaturationCheck);

	// "Spread" the sign bit to the rest of the bits, creating either
	// a `0x00000000` or a `0xFFFFFFFF`
	Result = _mm_mask_srai_epi32(Result, SaturationMask, Result, 31);

	// Flip the top-most bit creating either a `0x80000000` or a `0x7FFFFFFF`
	Result = _mm_mask_xor_epi32(
		Result, SaturationMask, Result, _mm_set1_epi32(0x80000000u)
	);

	return Result;
}

__m128i _mm_adds_epi64(__m128i a, __m128i b)
{
	// Same general idea as _mm_adds_epi32
	__m128i Result = _mm_add_epi64(a, b);

	const __m128i  SaturationCheck = _mm_ternarylogic_epi64(a, b, Result, 0b01000010);
	const __mmask8 SaturationMask  = _mm_movepi64_mask(SaturationCheck);

	Result = _mm_mask_srai_epi64(Result, SaturationMask, Result, 63);
	Result = _mm_mask_xor_epi64(
		Result, SaturationMask, Result, _mm_set1_epi64x(0x8000000000000000)
	);

	return Result;
}

__m128i _mm_subs_epi64(__m128i a, __m128i b)
{
	// Same general idea as _mm_subs_epi32
	__m128i Result = _mm_sub_epi64(a, b);

	const __m128i  SaturationCheck = _mm_ternarylogic_epi64(a, b, Result, 0b00011000);
	const __mmask8 SaturationMask  = _mm_movepi64_mask(SaturationCheck);

	Result = _mm_mask_srai_epi64(Result, SaturationMask, Result, 63);
	Result = _mm_mask_xor_epi64(
		Result, SaturationMask, Result, _mm_set1_epi64x(0x8000000000000000)
	);

	return Result;
}

GCC seems to emit the best assembly for the 32-bit case. GCC generates the 0x80000000-constant within the SIMD registers while clang and MSVC involve reaching out to memory for the 0x80000000-constant. For the 64-bit case, all compilers will read the constant from memory. Depending on where this operation lands, you might want to use a different method to generate these constants.

g++ 16.0.0 20251203

-march=skylake-avx512 -O2

"_mm_adds_epi32(long long __vector(2), long long __vector(2))":
        vpaddd  xmm2, xmm0, xmm1
        vpternlogd      xmm0, xmm1, xmm2, 66
        vpmovd2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpslld  xmm0, xmm0, 31
        vpsrad  xmm2{k1}, xmm2, 31
        vpxord  xmm2{k1}, xmm2, xmm0
        vmovdqa xmm0, xmm2
        ret
"_mm_subs_epi32(long long __vector(2), long long __vector(2))":
        vpsubd  xmm2, xmm0, xmm1
        vpternlogd      xmm0, xmm1, xmm2, 24
        vpmovd2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpslld  xmm0, xmm0, 31
        vpsrad  xmm2{k1}, xmm2, 31
        vpxord  xmm2{k1}, xmm2, xmm0
        vmovdqa xmm0, xmm2
        ret
"_mm_adds_epi64(long long __vector(2), long long __vector(2))":
        vpaddq  xmm2, xmm0, xmm1
        vpternlogq      xmm0, xmm1, xmm2, 66
        vpmovq2m        k1, xmm0
        vpsraq  xmm2{k1}, xmm2, 63
        vpxorq  xmm2{k1}, xmm2, XMMWORD PTR .LC1[rip]
        vmovdqa xmm0, xmm2
        ret
"_mm_subs_epi64(long long __vector(2), long long __vector(2))":
        vpsubq  xmm2, xmm0, xmm1
        vpternlogq      xmm0, xmm1, xmm2, 24
        vpmovq2m        k1, xmm0
        vpsraq  xmm2{k1}, xmm2, 63
        vpxorq  xmm2{k1}, xmm2, XMMWORD PTR .LC1[rip]
        vmovdqa xmm0, xmm2
        ret
.LC1:
        .quad   -9223372036854775808
        .quad   -9223372036854775808

clang version 22.0.0

-march=skylake-avx512 -O2

.LCPI0_0:
        .long   2147483647
.LCPI0_1:
        .long   2147483648
_mm_adds_epi32(long long vector[2], long long vector[2]):
        vpaddd  xmm2, xmm1, xmm0
        vpternlogd      xmm0, xmm1, xmm2, 66
        vpmovd2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpcmpgtd        k2, xmm2, xmm0
        vpbroadcastd    xmm0, dword ptr [rip + .LCPI0_0]
        vpbroadcastd    xmm0 {k2}, dword ptr [rip + .LCPI0_1]
        vmovdqa32       xmm2 {k1}, xmm0
        vmovdqa xmm0, xmm2
        ret

.LCPI1_0:
        .long   2147483647
.LCPI1_1:
        .long   2147483648
_mm_subs_epi32(long long vector[2], long long vector[2]):
        vpsubd  xmm2, xmm0, xmm1
        vpternlogd      xmm0, xmm1, xmm2, 24
        vpmovd2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpcmpgtd        k2, xmm2, xmm0
        vpbroadcastd    xmm0, dword ptr [rip + .LCPI1_0]
        vpbroadcastd    xmm0 {k2}, dword ptr [rip + .LCPI1_1]
        vmovdqa32       xmm2 {k1}, xmm0
        vmovdqa xmm0, xmm2
        ret

.LCPI2_0:
        .quad   9223372036854775807
.LCPI2_1:
        .quad   -9223372036854775808
_mm_adds_epi64(long long vector[2], long long vector[2]):
        vpaddq  xmm2, xmm1, xmm0
        vpternlogq      xmm0, xmm1, xmm2, 66
        vpmovq2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpcmpgtq        k2, xmm2, xmm0
        vpbroadcastq    xmm0, qword ptr [rip + .LCPI2_0]
        vpbroadcastq    xmm0 {k2}, qword ptr [rip + .LCPI2_1]
        vmovdqa64       xmm2 {k1}, xmm0
        vmovdqa xmm0, xmm2
        ret

.LCPI3_0:
        .quad   9223372036854775807
.LCPI3_1:
        .quad   -9223372036854775808
_mm_subs_epi64(long long vector[2], long long vector[2]):
        vpsubq  xmm2, xmm0, xmm1
        vpternlogq      xmm0, xmm1, xmm2, 24
        vpmovq2m        k1, xmm0
        vpcmpeqd        xmm0, xmm0, xmm0
        vpcmpgtq        k2, xmm2, xmm0
        vpbroadcastq    xmm0, qword ptr [rip + .LCPI3_0]
        vpbroadcastq    xmm0 {k2}, qword ptr [rip + .LCPI3_1]
        vmovdqa64       xmm2 {k1}, xmm0
        vmovdqa xmm0, xmm2
        ret

MSVC 19.44.35219

/arch:AVX512 /Ox

__xmm@80000000800000008000000080000000 DB 00H, 00H, 00H, 080H, 00H, 00H, 00H
        DB      080H, 00H, 00H, 00H, 080H, 00H, 00H, 00H, 080H
__xmm@80000000000000008000000000000000 DB 00H, 00H, 00H, 00H, 00H, 00H, 00H
        DB      080H, 00H, 00H, 00H, 00H, 00H, 00H, 00H, 080H

a$ = 8
b$ = 16
__m128i _mm_adds_epi32(__m128i,__m128i) PROC          ; _mm_adds_epi32
        vmovdqu xmm1, XMMWORD PTR [rdx]
        vmovdqu xmm2, XMMWORD PTR [rcx]
        vpaddd  xmm0, xmm1, xmm2
        vpternlogd xmm2, xmm1, xmm0, 66       ; 00000042H
        vpmovd2m k1, xmm2
        vpsrad  xmm0 {k1}, xmm0, 31
        vpxord  xmm0 {k1}, xmm0, XMMWORD PTR __xmm@80000000800000008000000080000000
        ret     0
__m128i _mm_adds_epi32(__m128i,__m128i) ENDP          ; _mm_adds_epi32

a$ = 8
b$ = 16
__m128i _mm_subs_epi32(__m128i,__m128i) PROC          ; _mm_subs_epi32
        vmovdqu xmm1, XMMWORD PTR [rdx]
        vmovdqu xmm2, XMMWORD PTR [rcx]
        vpsubd  xmm0, xmm2, xmm1
        vpternlogd xmm2, xmm1, xmm0, 24
        vpmovd2m k1, xmm2
        vpsrad  xmm0 {k1}, xmm0, 31
        vpxord  xmm0 {k1}, xmm0, XMMWORD PTR __xmm@80000000800000008000000080000000
        ret     0
__m128i _mm_subs_epi32(__m128i,__m128i) ENDP          ; _mm_subs_epi32

a$ = 8
b$ = 16
__m128i _mm_adds_epi64(__m128i,__m128i) PROC          ; _mm_adds_epi64
        vmovdqu xmm0, XMMWORD PTR [rdx]
        vmovdqu xmm1, XMMWORD PTR [rcx]
        vpaddq  xmm2, xmm0, xmm1
        vpternlogq xmm1, xmm0, xmm2, 66       ; 00000042H
        vpmovq2m k1, xmm1
        mov     eax, 63                             ; 0000003fH
        vmovd   xmm1, eax
        vpsraq  xmm2 {k1}, xmm2, xmm1
        vpxorq  xmm2 {k1}, xmm2, XMMWORD PTR __xmm@80000000000000008000000000000000
        vmovdqu xmm0, xmm2
        ret     0
__m128i _mm_adds_epi64(__m128i,__m128i) ENDP          ; _mm_adds_epi64

a$ = 8
b$ = 16
__m128i _mm_subs_epi64(__m128i,__m128i) PROC          ; _mm_subs_epi64
        vmovdqu xmm0, XMMWORD PTR [rdx]
        vmovdqu xmm1, XMMWORD PTR [rcx]
        vpsubq  xmm2, xmm1, xmm0
        vpternlogq xmm1, xmm0, xmm2, 24
        vpmovq2m k1, xmm1
        mov     eax, 63                             ; 0000003fH
        vmovd   xmm1, eax
        vpsraq  xmm2 {k1}, xmm2, xmm1
        vpxorq  xmm2 {k1}, xmm2, XMMWORD PTR __xmm@80000000000000008000000000000000
        vmovdqu xmm0, xmm2
        ret     0
__m128i _mm_subs_epi64(__m128i,__m128i) ENDP          ; _mm_subs_epi64