gf2p8affineqb: int8 shifting

gf2p8affineqb is becoming a new favorite. Previously I have a short introduction to the instruction and showed an example of it being used to reverse all 128-bits within a vector-register by first reversing the bits within each bytes, and then reversing the bytes.

For my next “trick”, I’ll implement some intrinsics that are curiously missing from the x86 ISA.

Lets say you’re doing some cool SIMD work and while you’re designing your kernels, you find yourself needing to bit-shift some bytes around. Maybe you want to multiply your vector of 8-bit integers by a power of two or something. You want to do a logical shift left by some immediate value. So you have the Intel Intrinsics Guide open on the other screen like a good SIMD programmer and you curiously start typing in _mm_srli_epi, hoping to see _mm_srli_epi8 at some point.

Yep, no _mm_srli_epi8, no _mm_slli_epi8, not even a _mm_srai_epi8. epi16/32/64 gets plenty of ISA attention, but epi8 is pretty neglected.

Lets change that

With gf2p8affineqb, we can calculate matrices ahead of time or at runtime to transform each byte to accomplish a bit-shift.

An identity matrix, that just takes in the bits and puts them where they were originally, is just the 64-bit constant 0x0102040810204080.

      [ 0 0 0 0 0 0 0 1 ] < Byte 7, Bit 0
      [ 0 0 0 0 0 0 1 0 ] < Byte 6, Bit 1
      [ 0 0 0 0 0 1 0 0 ] < Byte 5, Bit 2
 X' = [ 0 0 0 0 1 0 0 0 ] < Byte 4, Bit 3  *  X + 0
      [ 0 0 0 1 0 0 0 0 ] < Byte 3, Bit 4
      [ 0 0 1 0 0 0 0 0 ] < Byte 2, Bit 5
      [ 0 1 0 0 0 0 0 0 ] < Byte 1, Bit 6
      [ 1 0 0 0 0 0 0 0 ] < Byte 0, Bit 7

0b00000001'00000010'00000100'00001000'00010000'00100000'01000000'10000000
0x0102040810204080

In typical matrix fashion, each of the rows of the matrix determines how the resulting bits of the new 8-bit integer is defined. The bytes are defined in reverse order though, as the definition of the function defines its byte-indexing in reverse(7 - i) order. Byte 0 defines how bit-7 is built, byte 1 defines how bit-6 is built, byte 2 defines how bit-5 is built, and so on.

So, if we wanted to simulate a single logical shift to the left by one bit as a matrix operation, the bit-0 row has to result in ‘0’, the bit-1 row has to retrieve the incoming-bit-0, the bit-2 row has to retrieve the incoming-bit-1, and so on:

A new matrix, that defines a logical shift to the left by 1 bit.
      [ 0 0 0 0 0 0 0 0 ] < Byte 7, Bit 0
      [ 0 0 0 0 0 0 0 1 ] < Byte 6, Bit 1
      [ 0 0 0 0 0 0 1 0 ] < Byte 5, Bit 3
 X' = [ 0 0 0 0 0 1 0 0 ] < Byte 4, Bit 3  *  X + 0
      [ 0 0 0 0 1 0 0 0 ] < Byte 3, Bit 4
      [ 0 0 0 1 0 0 0 0 ] < Byte 2, Bit 5
      [ 0 0 1 0 0 0 0 0 ] < Byte 1, Bit 6
      [ 0 1 0 0 0 0 0 0 ] < Byte 0, Bit 7

0b00000000'00000001'00000010'00000100'00001000'00010000'00100000'01000000
0x0001020408102040

Here’s one to the right by 2 bits too:

A new matrix, that defines an arithmetic shift to the right by 2 bits
      [ 0 0 0 0 0 1 0 0 ] < Byte 7, Bit 0
      [ 0 0 0 0 1 0 0 0 ] < Byte 6, Bit 1
      [ 0 0 0 1 0 0 0 0 ] < Byte 5, Bit 3
 X' = [ 0 0 1 0 0 0 0 0 ] < Byte 4, Bit 3  *  X + 0
      [ 0 1 0 0 0 0 0 0 ] < Byte 3, Bit 4
      [ 1 0 0 0 0 0 0 0 ] < Byte 2, Bit 5
      [ 0 0 0 0 0 0 0 0 ] < Byte 1, Bit 6 < Repeat bit 7
      [ 0 0 0 0 0 0 0 0 ] < Byte 0, Bit 7 < Repeat bit 7

0b00000100'00001000'00010000'00100000'01000000'10000000'00000000'00000000
0x408102040800000

And if you look closely at the matrix constants:

Identity-Matrix:					0x0102040810204080
"Logical shift left once"-Matrix:	0x0001020408102040
"Logical shift right twice"-Matrix:	0x0408102040800000

An 8-bit logical shift left by n bits is bit-shifting the identity matrix itself to the right by i*8 bits! An 8-bit logical shift right by n bits is bit-shifting the identity matrix itself to the left by i*8 bits!

And with that, both _mm_slli_epi8 and _mm_srli_epi8 can be derived!

inline __m128i _mm_slli_epi8(__m128i a, std::uint8_t imm8)
{
	return _mm_gf2p8affine_epi64_epi8(
		a, _mm_set1_epi64x(0x0102040810204080 >> (imm8 * 8)), 0
	);
}

inline __m128i _mm_srli_epi8(__m128i a, std::uint8_t imm8)
{
	return _mm_gf2p8affine_epi64_epi8(
		a, _mm_set1_epi64x(0x0102040810204080 << (imm8 * 8)), 0
	);
}

So what about _mm_srai_epi8? An arithmetic shift is different in that not only do we shift the bits to the right, but we also replicate the most significant bit to preserve signed-ness of the original integer. So an arithmetic shift to the right by two bits means that we do a logical shift to the right twice, but rather than shifting in new 0-bits we want to repeat the bit in bit-7.

A new matrix, that defines an arithmetic shift to the right by 2 bits
      [ 0 0 0 0 0 1 0 0 ] < Byte 7, Bit 0
      [ 0 0 0 0 1 0 0 0 ] < Byte 6, Bit 1
      [ 0 0 0 1 0 0 0 0 ] < Byte 5, Bit 3
 X' = [ 0 0 1 0 0 0 0 0 ] < Byte 4, Bit 3  *  X + 0
      [ 0 1 0 0 0 0 0 0 ] < Byte 3, Bit 4
      [ 1 0 0 0 0 0 0 0 ] < Byte 2, Bit 5
      [ 1 0 0 0 0 0 0 0 ] < Byte 1, Bit 6 < Repeat bit 7
      [ 1 0 0 0 0 0 0 0 ] < Byte 0, Bit 7 < Repeat bit 7

0b00000100'00001000'00010000'00100000'01000000'10000000'10000000'10000000
0x408102040808080

Compared to the other constants-

Identity-Matrix:						0x0102040810204080
"Logical shift left once"-Matrix:		0x0001020408102040
"Logical shift right twice"-Matrix:		0x0408102040800000
"Arithmetic shift right twice"-Matrix:	0x0408102040808080

-the only difference this has from the the matrix bit-shifting for the logical shift right, is that while we shift the identity matrix to the left by i * 8 bits, we also shift in bytes of 0x80!

So after we bit-shift the identity matrix like the logical-step, we want to do a logical OR of this matrix with a i bytes of 0x80 to sign-extend the bytes:

inline __m128i _mm_srai_epi8(__m128i a, std::uint8_t imm8)
{
	// Shift a mask of 1111... bits to the left, invert it, and use this to mask
	// against 0x808080... to generate our sign-extended matrix elements
	// Latest compilers and architectures will reduce this into a `bzhi`
	// instruction
	const std::uint64_t sign_extend = ~(0xFFFFFFFFFFFFFFFF << (imm8 * 8)) & 0x8080808080808080;
	return _mm_gf2p8affine_epi64_epi8(
		a,
		// Perform a logical shift right, but shift in 0x80 bytes
		// to replicate the most significant bit
		_mm_set1_epi64x(
			0x0102040810204080 << (imm8 * 8)
			// Simulate shifting in 0x80 bytes by doing a bit-wise OR
			// against masked 0x80 bytes
			| sign_extend
			// If you're targetting a BMI2-enabled environment, you can
			// use bzhi directly too
			//| _bzhi_u64(0x8080808080808080ull, imm8 * 8) // BMI2
		),
		0
	);
}

And with that, all the missing epi8 bit-shifting functions. Ready to copy-paste.

// If you are unable to guarantee that 'imm8' is properly bounded to [0,8)
// then uncomment the safety-check lines at the top of each function to safely
// handle invalid shift-values of imm8. See final note

inline __m128i _mm_slli_epi8(__m128i a, std::uint8_t imm8)
{
	// if(imm8 > 7) return  _mm_setzero_si128();
	return _mm_gf2p8affine_epi64_epi8(
		a, _mm_set1_epi64x(0x0102040810204080 >> (imm8 * 8)), 0
	);
}

inline __m128i _mm_srli_epi8(__m128i a, std::uint8_t imm8)
{
	// if(imm8 > 7) return  _mm_setzero_si128();
	return _mm_gf2p8affine_epi64_epi8(
		a, _mm_set1_epi64x(0x0102040810204080 << (imm8 * 8)), 0
	);
}

inline __m128i _mm_srai_epi8(__m128i a, std::uint8_t imm8)
{
	// imm8 = imm8 > 7 ? 7 : imm8;
	const std::uint64_t sign_extend = ~(0xFFFFFFFFFFFFFFFF << (imm8 * 8)) & 0x8080808080808080;
	return _mm_gf2p8affine_epi64_epi8(
		a,
		_mm_set1_epi64x(0x0102040810204080 << (imm8 * 8) | sign_extend), 0
	);
}

A final note, these functions assume you are only passing in valid shift values for the imm8 argument(shift-values of [0,8)). Due to the arithmetic of generating the matrix, passing in values larger than the acceptable range of values is considered undefined behavior in C and C++. So, uint64(x) << imm8 * 8, where imm8 >= 8 is undefined, which the matrix calcuation seen here is susceptible to.

On x86, bit-shift instructions will only interpret the lower bits of your shift-value for the value-type that is being shifted. So trying to bit-shift a 32-bit register by 35 would equate to shifting it by 4, because it will only interpret the lower 5 bits of your shift value(modulo-32). This would be how the undefined-behavior would manifest itself on x86, which is not desirable.

Logical shifting an 8 bit value by 2000 should simply be 0, and arithmetic shift-overflows should behave just like the shift-by-7 case(the 7th bit is copied to all other bits).

I used these functions to accelerate the yuzu emulator’s ARM-recompiler backend!