
vpternlog{d,q} is one of those instructions that a time traveler should have
told Intel to put into the original x86 ISA years ago once you understand
what it does.
vpternlog{d,q} is an AVX512F (and ideally with AVX512VL) instruction that implements ternary-logic across
its three operands, overwriting one of its inputs.
Note that this is a bit different from the understanding of ternary-logic that operates on trits, but it is ternary in the sense that it is doing three-operand logic much like the ternary conditional operator.
An intuition to have about this instruction is that it will take one lane of bits
from each of its three operands to create a 3-bit value, which further indexes
a singular bit within an 8-bit look up table value alongside the instruction.
This 8-bit look-up operation has the ability to generalize many different
three-operand binary operations such as combining several XOR, AND, NOT
and OR operations into a singular instruction.
The instruction itself is bitwise across the entire register, though the
d(32-bit) and q(64-bit) specifiers of vpternlog{d,q} provide additional
element masking with a
mask-register:
Synopsis __m128i _mm_ternarylogic_epi32 (__m128i a, __m128i b, __m128i c, int imm8) #include <immintrin.h> Instruction: vpternlogd xmm, xmm, xmm, imm8 CPUID Flags: AVX512F + AVX512VL Description Bitwise ternary logic that provides the capability to implement any three-operand binary function; the specific binary function is specified by value in imm8. For each bit in each packed 32-bit integer, the corresponding bit from a, b, and c are used according to imm8, and the result is written to the corresponding bit in dst.DEFINE TernaryOP(imm8, a, b, c) { CASE imm8[7:0] OF 0: dst[0] := 0 // imm8[7:0] := 0 1: dst[0] := NOT (a OR b OR c) // imm8[7:0] := NOT (_MM_TERNLOG_A OR _MM_TERNLOG_B OR _MM_TERNLOG_C) // ... 254: dst[0] := a OR b OR c // imm8[7:0] := _MM_TERNLOG_A OR _MM_TERNLOG_B OR _MM_TERNLOG_C 255: dst[0] := 1 // imm8[7:0] := 1 ESAC } imm8[7:0] = LogicExp(_MM_TERNLOG_A, _MM_TERNLOG_B, _MM_TERNLOG_C) FOR j := 0 to 3 i := j*32 FOR h := 0 to 31 dst[i+h] := TernaryOP(imm8[7:0], a[i+h], b[i+h], c[i+h]) ENDFOR ENDFOR dst[MAX:128] := 0
Based on https://www.sandpile.org/x86/opc_3.htm, there was apparently an intent to implement vpternlogb and vpternlogw variants for 16-bit and 8-bit elements. sandpile.org also provides a helpful table of ternary logic operations with their matching
vpternlogLUT values: https://www.sandpile.org/x86/ternlog.htm
Nvidia’s PTX instruction set implements a similar instruction called LOP3.LUT for their GPUs.
Intel’s Virtual Instruction Set Architecture, used for their GPUs, implements a similar instruction called BFN
Compile time constants
Rather than spending the time to fumble up the look up table bits yourself, you can actually express these look-up bits directly in high-level code and generate the LUT bits at compile time!
GCC and Clang both provide the enum variables _MM_TERNLOG_{A,B,C}
in avx512fintrin.h
(included by immintrin.h)
which allow you to directly express your binary operation in such a way that
directly generates the 8-bit LUT value:
const std::uint8_t TernLut = (_MM_TERNLOG_A | ~_MM_TERNLOG_B) & _MM_TERNLOG_C;
// = 0b10100010
// = 0xa2
// later...
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0xa2
const std::uint8_t TernLut = ~(_MM_TERNLOG_A ^ _MM_TERNLOG_B) & _MM_TERNLOG_C;
// = 0b10000010
// = 0x82
// later...
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0x82
MSVC does not currently provide these constants, though. You are probably better off implementing these constants yourself for proper portability across compilers. It’s just three integer constants!
I did something like this for both
Dynarmic
and
Xenia
when emitting the vpternlog{d,q} instructions for their x64 JIT backends:
namespace Tern {
constexpr std::uint8_t a = 0b11110000;
constexpr std::uint8_t b = 0b11001100;
constexpr std::uint8_t c = 0b10101010;
} // namespace Tern
With these constants, you can express your ternary operation directly within
your C or C++ code, and the resulting compile-time constant will be the appropriate
8-bit look up table value for vpternlog{d,q} to do the same operation upon its
operands!
//// This implements "(a | ~b) & c" as a single instruction!
const std::uint8_t TernLut = (Tern::a | ~Tern::b) & Tern::c;
// = 0b10100010
// = 0xa2
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0xa2
//// This implements "~(a ^ b) & c" as a single instruction!
const std::uint8_t TernLut = ~(Tern::a ^ Tern::b) & Tern::c;
// = 0b10000010
// = 0x82
... = _mm_ternarylogic_epi32(a, b, c, TernLut);
// vpternlogd a, b, c, 0x82
Intel’s uses a similar mechanism of compile-time constants when explicitly emitting the
bfninstruction with SYCL:
Signed Saturated arithmetic
Saturated arithmetic is pretty commonplace in tasks relating to image processing, digital signal processing, fixed point arithmetic, safety critical practices, control algorithms, and anywhere else where having predictable underflow/overflow results is better than modulo arithmetic’s wrap-around results.
Wrapping vs. Saturation on an underflowing/overflowing input signal
Alright so where are the SIMD instructions…
On ARM, there are NEON instructions for SIMD vector
addition
and
subtraction
with the {s,u}qadd and {s,u}qsub instructions.
It supports signed and unsigned 8 bit, 16 bit, 32 bit, and 64 bit elements!
Learn the architecture - Neon programmers’ guide: C.7. List of saturating instructions
On RISC-V,
saturated vector operations are also supported by the V vector extensions.
Specifically, the vsadd{,u} and vssub{,u} instructions:
RISC-V V vector extension specification 1.0: Vector Single-Width Saturating Add and Subtract
How about on x64?
There is 8-bit and 16-bit saturation, but no 32-bit or 64-bit saturation! With the absence of native support for 32 and 64-bit saturated SIMD instructions, we are left to implement it ourselves. Saturated arithmetic depends on doing a regular integer addition/subtraction and then determining if saturation needs to occur based on the sign bit (the most significant bit) of both the input operands and the intermediate addition/subtraction result.
Using vpternlog{d,q}, it can be concisely determined if a saturation needs to
occur by utilizing the sign bits of each SIMD vector’s elements.
The rest of the bits of each element will be along for the ride.
The resulting signed saturation values can be either an
overflow (0x7FFFFF...) or underflow (0x80000...).
In the case of addition:
Signed saturation occurs if…
- The sign bits of the two inputs is the same.
- The sign bit of resulting sum has a different sign from the inputs.
Signed Saturated Add: Saturation happens if...
a = Operand A sign bit
b = Operand B sign bit
c = Sum sign bit
Interpreted as binary operations:
"Input sign bits are the same" = ~(a ^ b)
"result has a different sign bit from the input" = (a ^ c)
Saturation needed = ~(a ^ b) & (a ^ c)
TernlogLut = (~(_MM_TERNLOG_A ^ _MM_TERNLOG_B)) & (_MM_TERNLOG_A ^ _MM_TERNLOG_C)
A(OpA sign bit) |
B(OpB sign bit) |
C(Addition sign bit) |
LUT(Saturation is needed) |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 (If Pos + Pos = Neg, a signed-overflow occurred) |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 1 (If Neg + Neg = Pos, a signed-underflow occurred) |
| 1 | 1 | 1 | 0 |
In the case of subtraction:
Signed saturation occurs if…
- The sign bits of the two inputs are different.
- The resulting subtraction has a different sign from the first operand.
// Signed Saturated Subtract saturation happens if...
a = Operand A sign
b = Operand B sign
c = Subtraction sign bit
Interpreted as binary operations:
"Input signs are different" = (a ^ b)
"Result has a different sign from the first operand" = (a ^ c)
Saturation needed = (a ^ b) & (a ^ c)
TernlogLut = (_MM_TERNLOG_A ^ _MM_TERNLOG_B) & (_MM_TERNLOG_A ^ _MM_TERNLOG_C)
A(OpA sign bit) |
B(OpB sign bit) |
C(Subtraction sign bit) |
LUT(Saturation is needed) |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 1 (If Pos - Neg = Neg, a signed-overflow occurred) |
| 1 | 0 | 0 | 1 (If Neg - Pos = Pos, a signed-underflow occurred) |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 0 |
This results in the vpternlog{d,q} constants
0b01000010 for detecting the need to saturate lanes during addition and
0b00011000 for detecting the need to saturate lanes during subtraction.
// Detect if a signed overflow/underflow has occurred by using ternary
// logic on the sign bit of each element, other bits can be ignored.
// Addition
... = _mm_ternarylogic_epi32(OpA, OpB, Addition, 0b01000010);
// vpternlogd OpA, OpB, Addition, 0b01000010
// Subtraction:
... = _mm_ternarylogic_epi32(OpA, OpB, Subtraction, 0b00011000);
// vpternlogd OpA, OpB, Subtraction, 0b00011000
With AVX512, this value can then directly be converted into an execution-mask by
using _mm_movepi32_mask(vpmovd2m), which will convert the
most significant bit (the sign bit) of each element into a mask-register.
With this write-mask of where underflows/overflows have occurred, we now need
to write the correct overflow/underflow values into these lanes.
Either an underflow (0x80000000) or overflow (0x7FFFFFFF) value is
written depending on the sign bit of the sum. This can be achieved in several
ways depending on how much you want to touch memory or what execution ports you
want to use among neighboring work.
I’ll show one way that attempts to emit a small amount of instructions that
tries to avoid additional masking or reading constants from memory:
The sign bit of the resulting addition/subtraction we calculated before can be
repurposed directly to create the desired overflow/underflow values.
Each lane will have either a 0xxxxxx... or a 1xxxxxx... and these values can
be turned into either an 0x80000000(under) or an 0x7FFFFFFF(overflow)
respectively.
We do this with an arithmetic shift
to the right by 31 bits to “spread” the sign bit to all other bits.
Then, we can flip the sign bit again to get either a 0x80000000/0x7FFFFFFF
value.
- Arithmetic Shift the Addition/Subtraction value to the right by
31bits- Creates either a
0x00000000or0xFFFFFFFFvalue in each lane
- Creates either a
- Invert the sign bit
- Creates either
0x80000000or0x7FFFFFFFvalue in each lane
- Creates either
Both saturated addition and subtraction implementations are done similarly,
the only difference is the ternary look up table constant used and the
intermediate result uses either a 32 bit addition or subtraction with
_mm_{add,sub}_epi32.
The 64-bit version is implemented the same as well, with careful attention
to the data-types of the masks and arithmetic operations and constants.
Anyways you’re probably here to copy-paste some code:
// #include <immintrin.h>
__m128i _mm_adds_epi32(__m128i a, __m128i b)
{
__m128i Result = _mm_add_epi32(a, b);
// Resolves to 0b01000010(0x42)
const std::uint32_t Lut = 0b01000010;
// Not supported on MSVC
// = (~(_MM_TERNLOG_A ^ _MM_TERNLOG_B)) // OpA and OpB are the same
// & (_MM_TERNLOG_A ^ _MM_TERNLOG_C); // OpA and OpC are different
// Each element's sign bit is `1` if an overflow/underflow has occurred.
// Other bits are just along for the ride
const __m128i SaturationCheck = _mm_ternarylogic_epi32(a, b, Result, Lut);
const __mmask8 SaturationMask = _mm_movepi32_mask(SaturationCheck);
// "Spread" the sign bit to the rest of the bits, creating either
// a `0x00000000` or a `0xFFFFFFFF`
Result = _mm_mask_srai_epi32(Result, SaturationMask, Result, 31);
// Flip the top-most bit creating either a `0x80000000` or a `0x7FFFFFFF`
Result = _mm_mask_xor_epi32(
Result, SaturationMask, Result, _mm_set1_epi32(0x80000000u)
);
return Result;
}
__m128i _mm_subs_epi32(__m128i a, __m128i b)
{
__m128i Result = _mm_sub_epi32(a, b);
// Resolves to 0b00011000(0x18)
const std::uint32_t Lut = 0b00011000;
// Not supported on MSVC
// = (_MM_TERNLOG_A ^ _MM_TERNLOG_B) // OpA and OpB are different
// & (_MM_TERNLOG_A ^ _MM_TERNLOG_C); // OpA and OpC are different
// Each element's sign bit is `1` if an overflow/underflow has occurred.
// Other bits are just along for the ride
const __m128i SaturationCheck = _mm_ternarylogic_epi32(a, b, Result, Lut);
const __mmask8 SaturationMask = _mm_movepi32_mask(SaturationCheck);
// "Spread" the sign bit to the rest of the bits, creating either
// a `0x00000000` or a `0xFFFFFFFF`
Result = _mm_mask_srai_epi32(Result, SaturationMask, Result, 31);
// Flip the top-most bit creating either a `0x80000000` or a `0x7FFFFFFF`
Result = _mm_mask_xor_epi32(
Result, SaturationMask, Result, _mm_set1_epi32(0x80000000u)
);
return Result;
}
__m128i _mm_adds_epi64(__m128i a, __m128i b)
{
// Same general idea as _mm_adds_epi32
__m128i Result = _mm_add_epi64(a, b);
const __m128i SaturationCheck = _mm_ternarylogic_epi64(a, b, Result, 0b01000010);
const __mmask8 SaturationMask = _mm_movepi64_mask(SaturationCheck);
Result = _mm_mask_srai_epi64(Result, SaturationMask, Result, 63);
Result = _mm_mask_xor_epi64(
Result, SaturationMask, Result, _mm_set1_epi64x(0x8000000000000000)
);
return Result;
}
__m128i _mm_subs_epi64(__m128i a, __m128i b)
{
// Same general idea as _mm_subs_epi32
__m128i Result = _mm_sub_epi64(a, b);
const __m128i SaturationCheck = _mm_ternarylogic_epi64(a, b, Result, 0b00011000);
const __mmask8 SaturationMask = _mm_movepi64_mask(SaturationCheck);
Result = _mm_mask_srai_epi64(Result, SaturationMask, Result, 63);
Result = _mm_mask_xor_epi64(
Result, SaturationMask, Result, _mm_set1_epi64x(0x8000000000000000)
);
return Result;
}
GCC seems to emit the best assembly for the 32-bit case.
GCC generates the 0x80000000-constant within the SIMD registers while
clang and MSVC involve reaching out to memory for the 0x80000000-constant.
For the 64-bit case, all compilers will read the constant from memory.
Depending on where this operation lands, you might want to use a different
method to generate these constants.
g++ 16.0.0 20251203
-march=skylake-avx512 -O2
"_mm_adds_epi32(long long __vector(2), long long __vector(2))":
vpaddd xmm2, xmm0, xmm1
vpternlogd xmm0, xmm1, xmm2, 66
vpmovd2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpslld xmm0, xmm0, 31
vpsrad xmm2{k1}, xmm2, 31
vpxord xmm2{k1}, xmm2, xmm0
vmovdqa xmm0, xmm2
ret
"_mm_subs_epi32(long long __vector(2), long long __vector(2))":
vpsubd xmm2, xmm0, xmm1
vpternlogd xmm0, xmm1, xmm2, 24
vpmovd2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpslld xmm0, xmm0, 31
vpsrad xmm2{k1}, xmm2, 31
vpxord xmm2{k1}, xmm2, xmm0
vmovdqa xmm0, xmm2
ret
"_mm_adds_epi64(long long __vector(2), long long __vector(2))":
vpaddq xmm2, xmm0, xmm1
vpternlogq xmm0, xmm1, xmm2, 66
vpmovq2m k1, xmm0
vpsraq xmm2{k1}, xmm2, 63
vpxorq xmm2{k1}, xmm2, XMMWORD PTR .LC1[rip]
vmovdqa xmm0, xmm2
ret
"_mm_subs_epi64(long long __vector(2), long long __vector(2))":
vpsubq xmm2, xmm0, xmm1
vpternlogq xmm0, xmm1, xmm2, 24
vpmovq2m k1, xmm0
vpsraq xmm2{k1}, xmm2, 63
vpxorq xmm2{k1}, xmm2, XMMWORD PTR .LC1[rip]
vmovdqa xmm0, xmm2
ret
.LC1:
.quad -9223372036854775808
.quad -9223372036854775808
clang version 22.0.0
-march=skylake-avx512 -O2
.LCPI0_0:
.long 2147483647
.LCPI0_1:
.long 2147483648
_mm_adds_epi32(long long vector[2], long long vector[2]):
vpaddd xmm2, xmm1, xmm0
vpternlogd xmm0, xmm1, xmm2, 66
vpmovd2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpcmpgtd k2, xmm2, xmm0
vpbroadcastd xmm0, dword ptr [rip + .LCPI0_0]
vpbroadcastd xmm0 {k2}, dword ptr [rip + .LCPI0_1]
vmovdqa32 xmm2 {k1}, xmm0
vmovdqa xmm0, xmm2
ret
.LCPI1_0:
.long 2147483647
.LCPI1_1:
.long 2147483648
_mm_subs_epi32(long long vector[2], long long vector[2]):
vpsubd xmm2, xmm0, xmm1
vpternlogd xmm0, xmm1, xmm2, 24
vpmovd2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpcmpgtd k2, xmm2, xmm0
vpbroadcastd xmm0, dword ptr [rip + .LCPI1_0]
vpbroadcastd xmm0 {k2}, dword ptr [rip + .LCPI1_1]
vmovdqa32 xmm2 {k1}, xmm0
vmovdqa xmm0, xmm2
ret
.LCPI2_0:
.quad 9223372036854775807
.LCPI2_1:
.quad -9223372036854775808
_mm_adds_epi64(long long vector[2], long long vector[2]):
vpaddq xmm2, xmm1, xmm0
vpternlogq xmm0, xmm1, xmm2, 66
vpmovq2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpcmpgtq k2, xmm2, xmm0
vpbroadcastq xmm0, qword ptr [rip + .LCPI2_0]
vpbroadcastq xmm0 {k2}, qword ptr [rip + .LCPI2_1]
vmovdqa64 xmm2 {k1}, xmm0
vmovdqa xmm0, xmm2
ret
.LCPI3_0:
.quad 9223372036854775807
.LCPI3_1:
.quad -9223372036854775808
_mm_subs_epi64(long long vector[2], long long vector[2]):
vpsubq xmm2, xmm0, xmm1
vpternlogq xmm0, xmm1, xmm2, 24
vpmovq2m k1, xmm0
vpcmpeqd xmm0, xmm0, xmm0
vpcmpgtq k2, xmm2, xmm0
vpbroadcastq xmm0, qword ptr [rip + .LCPI3_0]
vpbroadcastq xmm0 {k2}, qword ptr [rip + .LCPI3_1]
vmovdqa64 xmm2 {k1}, xmm0
vmovdqa xmm0, xmm2
ret
MSVC 19.44.35219
/arch:AVX512 /Ox
__xmm@80000000800000008000000080000000 DB 00H, 00H, 00H, 080H, 00H, 00H, 00H
DB 080H, 00H, 00H, 00H, 080H, 00H, 00H, 00H, 080H
__xmm@80000000000000008000000000000000 DB 00H, 00H, 00H, 00H, 00H, 00H, 00H
DB 080H, 00H, 00H, 00H, 00H, 00H, 00H, 00H, 080H
a$ = 8
b$ = 16
__m128i _mm_adds_epi32(__m128i,__m128i) PROC ; _mm_adds_epi32
vmovdqu xmm1, XMMWORD PTR [rdx]
vmovdqu xmm2, XMMWORD PTR [rcx]
vpaddd xmm0, xmm1, xmm2
vpternlogd xmm2, xmm1, xmm0, 66 ; 00000042H
vpmovd2m k1, xmm2
vpsrad xmm0 {k1}, xmm0, 31
vpxord xmm0 {k1}, xmm0, XMMWORD PTR __xmm@80000000800000008000000080000000
ret 0
__m128i _mm_adds_epi32(__m128i,__m128i) ENDP ; _mm_adds_epi32
a$ = 8
b$ = 16
__m128i _mm_subs_epi32(__m128i,__m128i) PROC ; _mm_subs_epi32
vmovdqu xmm1, XMMWORD PTR [rdx]
vmovdqu xmm2, XMMWORD PTR [rcx]
vpsubd xmm0, xmm2, xmm1
vpternlogd xmm2, xmm1, xmm0, 24
vpmovd2m k1, xmm2
vpsrad xmm0 {k1}, xmm0, 31
vpxord xmm0 {k1}, xmm0, XMMWORD PTR __xmm@80000000800000008000000080000000
ret 0
__m128i _mm_subs_epi32(__m128i,__m128i) ENDP ; _mm_subs_epi32
a$ = 8
b$ = 16
__m128i _mm_adds_epi64(__m128i,__m128i) PROC ; _mm_adds_epi64
vmovdqu xmm0, XMMWORD PTR [rdx]
vmovdqu xmm1, XMMWORD PTR [rcx]
vpaddq xmm2, xmm0, xmm1
vpternlogq xmm1, xmm0, xmm2, 66 ; 00000042H
vpmovq2m k1, xmm1
mov eax, 63 ; 0000003fH
vmovd xmm1, eax
vpsraq xmm2 {k1}, xmm2, xmm1
vpxorq xmm2 {k1}, xmm2, XMMWORD PTR __xmm@80000000000000008000000000000000
vmovdqu xmm0, xmm2
ret 0
__m128i _mm_adds_epi64(__m128i,__m128i) ENDP ; _mm_adds_epi64
a$ = 8
b$ = 16
__m128i _mm_subs_epi64(__m128i,__m128i) PROC ; _mm_subs_epi64
vmovdqu xmm0, XMMWORD PTR [rdx]
vmovdqu xmm1, XMMWORD PTR [rcx]
vpsubq xmm2, xmm1, xmm0
vpternlogq xmm1, xmm0, xmm2, 24
vpmovq2m k1, xmm1
mov eax, 63 ; 0000003fH
vmovd xmm1, eax
vpsraq xmm2 {k1}, xmm2, xmm1
vpxorq xmm2 {k1}, xmm2, XMMWORD PTR __xmm@80000000000000008000000000000000
vmovdqu xmm0, xmm2
ret 0
__m128i _mm_subs_epi64(__m128i,__m128i) ENDP ; _mm_subs_epi64
See also:

