pavgb: most-significant-bit constant

June 24, 2022

This is a quirky way to generate 8-bit or 16-bit sign-masks (0x808080..., 0x800080008000...) in an x86 SIMD-register using the pavg{b,w} instructions without touching memory. pavg only supports an 8-bit form(pavgb) and a 16-bit form(pavgw) unfortunately.

PAVGB/PAVGW — Average Packed Integers

Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the destination operand (first operand), and stores the results in the destination operand. For each corresponding pair of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary sum, and that result is shifted right one bit position.

The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed unsigned words.

The core algorithm of pavg* operates on n-bit elements, and creates an n+1-bit intermediate sum before bit-shifting to the right again into an n-bit average.

$$(a + b + 1) \gg 1$$

If you set b to be the bitwise-inverse of a(~a), then the addition effectively just becomes a bitwise-or operation as the individual bits will never carry over into its neighbors. This generates a 0xFFF... for the intermediate n+1-bit sum, leaving the most-significant-bit clear(011111111...):

$$(a + \neg a + 1) \gg 1$$

$$(\verb|0xFFF…| + 1) \gg 1$$

This intermediate 0xFFF... gets incremented once, overflows all of the 0xFFF.., and sets the most-significant-bit while clearing all of the others which creates a 0x800... in the intermediate sum. This sum is then bit-shifted to the right once, causing each element to equal 0x80 or 0x8000 depending on which instruction you used.

$$(a + \lnot a + 1) \gg 1$$

$$(\verb|0xFFF…| + 1) \gg 1$$

$$(\verb|0x800…|) \gg 1$$

0x80 in the case of pavgb, 0x8000 in the case of pavgw

It does not matter what value a is in this case. So long as you provide pavg a register and a bit-inverted version of that register, then you will always get 0x80...-elements as a result with this operation. This includes just providing a register of 0x000...(pxor xmm, xmm) and a register of 0xFFF...(pcmpeqd xmm, xmm) which are also constants that can be generated without touching memory.

Mathematically, this property makes a lot of sense. The “average” is the value found at the “mid-point” of its two inputs.

  • The average between 0x00 and 0xFF is 0x80.
  • The average between 0x0F and 0xF0 is 0x80.
  • The average between 0xF0 and 0x0F is 0x80.
  • The average between 0xAA and 0x55 is 0x80.

Generating 64,32,16-bit sign-bit elements without touching memory is nothing new .

You could just do an element-wise logical-shift-left with a register of all-ones(0xFFF...) to set the upper bit of each element.

But the 8-bit version presented here is new as there is no trivial way to do an 8-bit logical bit-shift to the left within a simd register.

I suppose unless you include the gf2p8affineqb instruction provided by GFNI). But at that point, having gf2p8affineqb available means you can just fill a SIMD register with any arbitrary byte like so:

_mm_gf2p8affine_epi64_epi8(_mm_setzero_si128(), _mm_setzero_si128(), 0x80); // 0x80808080...

# pxor xmm0, xmm0
# gf2p8affineqb xmm0, xmm0, 0x80

pavgb is a baseline SSE2 instruction, so it is available for all x86-64 platforms.

### Set all bits
pcmpeqd xmm0, xmm0 # 0xFFFFFFFFFFFF...

### Bit-shift to the left such that the only set bit is the most-significant-bit

# 8-bit
# There is no "psllb"!
pxor    xmm1, xmm1 # vector of all-zeros
pavgb   xmm0, xmm1 # 0x808080...

# 16-bit
# pxor    xmm1, xmm1 # vector of all-zeros
# pavgw   xmm0, xmm1 # 0x800080008000...
# Use this rather than the pavgw method
psllw xmm0, 15 # 0x800080008000...

# 32-bit
pslld xmm0, 31 # 0x800000008000000080000000...

# 64-bit
psllq xmm0, 63 # 0x80000000000000008000000000000000

I’ve redundantly included the 16-bit implementation to further illustrate this pavg* behavior and just in case this pavg(a, ~a) = 0x80... -pattern calls something out to the reader.


#include <immintrin.h>

// Creates a simd register of 0x80808080...
__m128i _mm_msb_epu8()
{
	return _mm_avg_epu8(_mm_setzero_si128(), _mm_set1_epi8(0xFF));
}
// _mm_msb_epu8():
// 	pcmpeqd xmm1, xmm1
// 	pxor    xmm0, xmm0
// 	pavgb   xmm0, xmm1
// 	ret

// Creates a simd register of 0x800080008000...
// __m128i _mm_msb_epu16()
// {
// 	return _mm_avg_epu16(_mm_setzero_si128(), _mm_set1_epi8(0xFF));
// }
// _mm_msb_epu16():
// 	pcmpeqd xmm1, xmm1
// 	pxor    xmm0, xmm0
// 	pavgw   xmm0, xmm1
// 	ret

GL_EXT_fragment_shader_barycentric: Wireframe

Memory-Size Literals