This is a quirky way to generate 8-bit or 16-bit sign-masks
(0x808080...
, 0x800080008000...
) in an x86 SIMD-register using the
pavg{b,w} instructions without
touching memory. pavg
only supports an 8-bit form(pavgb
) and a 16-bit
form(pavgw
) unfortunately.
PAVGB/PAVGW — Average Packed Integers
Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the destination operand (first operand), and stores the results in the destination operand. For each corresponding pair of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary sum, and that result is shifted right one bit position.
The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed unsigned words.
The core algorithm of pavg*
operates on n
-bit elements, and creates an
n+1
-bit intermediate sum before bit-shifting to the right again into an
n
-bit average.
$$(a + b + 1) \gg 1$$
If you set b
to be the bitwise-inverse of a
(~a
), then the addition
effectively just becomes a bitwise-or operation as the individual bits will
never carry over into its neighbors. This generates a 0xFFF...
for the
intermediate n+1
-bit sum, leaving the most-significant-bit
clear(011111111...
):
$$(a + \neg a + 1) \gg 1$$
$$(\verb|0xFFF…| + 1) \gg 1$$
This intermediate 0xFFF...
gets incremented once, overflows all of the
0xFFF..
, and sets the most-significant-bit while clearing all of the others
which creates a 0x800...
in the intermediate sum. This sum is then
bit-shifted to the right once, causing each element to equal 0x80
or 0x8000
depending on which instruction you used.
$$(a + \lnot a + 1) \gg 1$$
$$(\verb|0xFFF…| + 1) \gg 1$$
$$(\verb|0x800…|) \gg 1$$
0x80
in the case ofpavgb
,0x8000
in the case ofpavgw
It does not matter what value a
is in this case. So long as you provide
pavg
a register and a bit-inverted version of that register, then you will
always get 0x80...
-elements as a result with this operation. This includes
just providing a register of 0x000...
(pxor xmm, xmm
) and a register of
0xFFF...
(pcmpeqd xmm, xmm
) which are also constants that can be generated
without touching memory.
Mathematically, this property makes a lot of sense. The “average” is the value found at the “mid-point” of its two inputs.
- The average between
0x00
and0xFF
is0x80
. - The average between
0x0F
and0xF0
is0x80
. - The average between
0xF0
and0x0F
is0x80
. - The average between
0xAA
and0x55
is0x80
.
Generating 64,32,16-bit sign-bit elements without touching memory is nothing new.
You could just do an element-wise logical-shift-left with a register of
all-ones(0xFFF...
) to set the upper bit of each element.
But the 8-bit version presented here is new as there is no trivial way to do an 8-bit logical bit-shift to the left within a simd register.
I suppose unless you include the gf2p8affineqb instruction provided by
GFNI
). But at that point, havinggf2p8affineqb
available means you can just fill a SIMD register with any arbitrary byte like so:_mm_gf2p8affine_epi64_epi8(_mm_setzero_si128(), _mm_setzero_si128(), 0x80); // 0x80808080... # pxor xmm0, xmm0 # gf2p8affineqb xmm0, xmm0, 0x80
pavgb
is a baseline SSE2 instruction, so it is available for all x86-64
platforms.
### Set all bits
pcmpeqd xmm0, xmm0 # 0xFFFFFFFFFFFF...
### Bit-shift to the left such that the only set bit is the most-significant-bit
# 8-bit
# There is no "psllb"!
pxor xmm1, xmm1 # vector of all-zeros
pavgb xmm0, xmm1 # 0x808080...
# 16-bit
# pxor xmm1, xmm1 # vector of all-zeros
# pavgw xmm0, xmm1 # 0x800080008000...
# Use this rather than the pavgw method
psllw xmm0, 15 # 0x800080008000...
# 32-bit
pslld xmm0, 31 # 0x800000008000000080000000...
# 64-bit
psllq xmm0, 63 # 0x80000000000000008000000000000000
I’ve redundantly included the 16-bit implementation to further illustrate this
pavg*
behavior and just in case this pavg(a, ~a) = 0x80...
-pattern calls
something out to the reader.
#include <immintrin.h>
// Creates a simd register of 0x80808080...
__m128i _mm_msb_epu8()
{
return _mm_avg_epu8(_mm_setzero_si128(), _mm_set1_epi8(0xFF));
}
// _mm_msb_epu8():
// pcmpeqd xmm1, xmm1
// pxor xmm0, xmm0
// pavgb xmm0, xmm1
// ret
// Creates a simd register of 0x800080008000...
// __m128i _mm_msb_epu16()
// {
// return _mm_avg_epu16(_mm_setzero_si128(), _mm_set1_epi8(0xFF));
// }
// _mm_msb_epu16():
// pcmpeqd xmm1, xmm1
// pxor xmm0, xmm0
// pavgw xmm0, xmm1
// ret