vfixupimm: signum

October 10, 2022

Signum

The signum function, also known as the sign function, is defined by the piecewise function:

$$ signum(x) = \begin{cases} -1.0 &\text{if } x \lt 0.0 \\ 1.0 &\text{if } x \gt 0.0 \\ 0.0 &\text{if } x = 0.0 \end{cases} $$

Put simply, it gets the sign of the function. Returning 1.0 if it is positive, -1.0 if it is negative, and 0.0 if it is 0.0.

But floating point values are weird. There is ±Infinity, ±0.0, QNaN, and SNaN to account for. An initial implementation might look something like this. You’ll likely find something like this on Google or StackOverflow.

float signum(float Value)
{
	if( Value < 0.0 )
		return -1.0;
	if( Value > 0.0 )
		return 1.0;
	return 0.0;
}

Lets put it through some interesting float values and see how the compiler output looks like:

int main()
{
	std::vector<float> SampleData;

	SampleData.push_back(std::numeric_limits<float>::quiet_NaN());
	SampleData.push_back(std::numeric_limits<float>::signaling_NaN());
	SampleData.push_back(std::numeric_limits<float>::epsilon());
	SampleData.push_back(std::numeric_limits<float>::infinity());
	SampleData.push_back(-std::numeric_limits<float>::infinity());
	SampleData.push_back(std::numeric_limits<float>::min());
	SampleData.push_back(std::numeric_limits<float>::max());
	SampleData.push_back(std::numeric_limits<float>::denorm_min());
	SampleData.push_back(std::numeric_limits<float>::lowest());
	SampleData.push_back(std::numeric_limits<float>::round_error());
	SampleData.push_back(-0.0f);
	SampleData.push_back(+0.0f);

	for( const float& CurFloat : SampleData )
	{
		std::printf("%16.2g ==> %2.g\n", CurFloat, signum(CurFloat));
	}
}
float signum(float Value)
{
	if( Value < 0.0 )
		return -1.0;
	if( Value > 0.0 )
		return 1.0;
	return 0.0;
}

// gcc 12.2: -O2 -march=skylake-avx512
// signum(float):
//         vxorps  xmm1, xmm1, xmm1
//         vcomiss xmm1, xmm0
//         ja      .L3
//         vcmpnltss       xmm0, xmm1, xmm0
//         vmovss  xmm2, DWORD PTR .LC1[rip]
//         vblendvps       xmm0, xmm2, xmm1, xmm0
//         ret
// .L3:
//         vmovss  xmm0, DWORD PTR .LC0[rip]
//         ret
// .LC0:
//         .long   -1082130432
// .LC1:
//         .long   1065353216

//      nan ==>  0 ❔
//      nan ==>  0 ❔
//  1.2e-07 ==>  1 ✅
//      inf ==>  1 ✅
//     -inf ==> -1 ✅
//  1.2e-38 ==>  1 ✅
//  3.4e+38 ==>  1 ✅
//  1.4e-45 ==>  1 ✅
// -3.4e+38 ==> -1 ✅
//      0.5 ==>  1 ✅
//       -0 ==>  0 ✅
//        0 ==>  0 ✅

It gets the job done, but it doesn’t handle QNaN or SNaN correctly. It can be hard to define how a function that gets the sign of a number should handle something that is… Not a Number. Generally, you should propagate NaN values forward untouched so that later code might explicitly handle it or signal an exception upon it.

This is the behavior you would see in other language and library implementations for signum as well. Some even throw exceptions in the case of NaN, but for now we’ll just try and preserve it.

Rust Java C# Javascript Python(numpy) Clojure
num::signum Math.signum Math.Sign Math.sign numpy.sign signum

Utilizing std::isnan provided by the cmath header, an additional branch can added to handle and propagate NaN values to the output.

#include <cmath>

float signum_safe(float Value)
{
	if( Value < 0.0 )
		return -1.0;
	if( Value > 0.0 )
		return 1.0;
	if( std::isnan(Value) )
		return Value;
	return 0.0;
}

// gcc 12.2: -O2 -march=skylake-avx512
// signum_safe(float):
//         vxorps  xmm1, xmm1, xmm1
//         vcomiss xmm1, xmm0
//         ja      .L3
//         vcomiss xmm0, xmm1
//         vmovss  xmm2, DWORD PTR .LC1[rip]
//         ja      .L1
//         vucomiss        xmm0, xmm0
//         vmovd   eax, xmm0
//         vmovd   edx, xmm1
//         cmovnp  eax, edx
//         vmovd   xmm2, eax
// .L1:
//         vmovaps xmm0, xmm2
//         ret
// .L3:
//         vmovss  xmm2, DWORD PTR .LC0[rip]
//         vmovaps xmm0, xmm2
//         ret
// .LC0:
//         .long   -1082130432
// .LC1:
//         .long   1065353216

//      nan ==> nan ✅
//      nan ==> nan ✅
//  1.2e-07 ==>  1  ✅
//      inf ==>  1  ✅
//     -inf ==> -1  ✅
//  1.2e-38 ==>  1  ✅
//  3.4e+38 ==>  1  ✅
//  1.4e-45 ==>  1  ✅
// -3.4e+38 ==> -1  ✅
//      0.5 ==>  1  ✅
//       -0 ==>  0  ✅
//        0 ==>  0  ✅

Where the std::isnan check is placed is pretty subjective, but more often than not you are dealing with a finite negative or positive number than the more specific bit-patterns for a QNan, SNan, or +-0.0. So I placed those particular checks last as they are more unlikely than the bitpatterns for regular negative or positive numbers.

vfixupimm(AVX512)

The vfixupimm instructions are provided by the AVX512F x86 ISA extension. This is the core extension required by all implementations of AVX512, so you’ll find it on any processor that supports it such as Skylake-X, Cannon Lake, Cascade Lake, Cooper Lake, Ice Lake, Tiger Lake, RocketLake, Alder Lake(if you’re lucky), and even AMD’s recent Zen4 processors now support AVX512.

vfixupimm is quite a unique instruction. Operating upon either a vector of values or a singular value, this instruction will first classify the floating point element(s), and will then substitute each element(s) with another depending on the encoded substitution-values provided by the 32-bit element(s) in the last operand(either another simd register or a memory address, no embedded-broadcast support unfortunately).

The intrinsic that maps to operating upon a singular floating point value is _mm_fixupimm_ss, provided by the immintrin.h header. There are also double-precision floating point variants available as well as packed-element variations for bulk-processing. We’ll look at the singular-value floating point variant.

Synopsis
__m128 _mm_fixupimm_ss (__m128 a, __m128 b, __m128i c, int imm8)
#include <immintrin.h>
Instruction: vfixupimmss xmm, xmm, xmm, imm8
CPUID Flags: AVX512F
Description
Fix up the lower single-precision (32-bit) floating-point elements in a and b using the lower 32-bit integer in c, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst. imm8 is used to set the required flags reporting.

Intel intrinsics guide vers. 3.6.3

With this, it is possible to implement a single instruction version of the signum that handles all of the edge-cases as well.

Each 32-bit lane of operand c encodes how each class of floating point value(s) should be substituted. The final 8-bit immediate value determines if you wish for certain floating point exception flags to be reported. This can be left to 0 for this use-case.

Each 32-bit value has eight four-bit lanes for how each classification of floating point value should be substituted.

31..28 27..24 23..20 19..16 15..12 11..8 7..4 3..0
+Finite -Finite +Inf -Inf +1.0 ±0.0 SNaN QNaN

Within these four-bit classification lanes, the possible substitutions are:

Substitution Value
Preserve the destination value 0000
Source Value† 0001
QNaN with the sign of the Source Value† 0010
Negative QNaN with no payload 0011
-Infinity 0100
+Infinity 0101
Infinity with sign of source operand† 0110
-0.0 0111
+0.0 1000
-1.0 1001
+1.0 1010
+0.5 1011
+90.0 1100
PI/2 1101
+{FLT_MAX,DBL_MAX} 1110
-{FLT_MAX,DBL_MAX} 1111

Denormal values may be interpreted as being +0.0 if the DAZ flag is enabled, which is often the case.

It might be worth it to make some utility-code to handle this easier:

// Substitution codes for use with vfixupimm
enum class FpFixup : std::uint8_t
{
	Dest     = 0b0000, // Preserve destination
	Norm_Src = 0b0001, // Source operand (Denormal as positive-zero)
	QNaN_Src = 0b0010, // QNaN with sign of source (Denormal as positive-zero)
	IndefNaN = 0b0011, // Indefinite QNaN (Negative QNaN with no payload)
	NegInf   = 0b0100, // -Infinity
	PosInf   = 0b0101, // +Infinity
	Inf_Src	 = 0b0110, // Infinity with sign of source (Denormal as positive-zero)
	NegZero  = 0b0111, // -0.0
	PosZero  = 0b1000, // +0.0
	NegOne   = 0b1001, // -1.0
	PosOne   = 0b1010, // +1.0
	Half     = 0b1011, // 0.5
	Ninety   = 0b1100, // 90.0
	HalfPi   = 0b1101, // PI/2
	PosMax   = 0b1110, // +{FLT_MAX,DBL_MAX}
	NegMax   = 0b1111, // -{FLT_MAX,DBL_MAX}
};

// Generates a 32-bit LUT element for the vfixupimm instruction
constexpr std::uint32_t FixupLUT(
	FpFixup src_qnan = FpFixup::Dest, FpFixup src_snan = FpFixup::Dest,
	FpFixup src_zero = FpFixup::Dest, FpFixup src_posone = FpFixup::Dest,
	FpFixup src_neginf = FpFixup::Dest, FpFixup src_posinf = FpFixup::Dest,
	FpFixup src_neg = FpFixup::Dest, FpFixup src_pos = FpFixup::Dest)
{
	std::uint32_t fixup_lut = 0;
	fixup_lut |= ((static_cast<std::uint32_t>(src_qnan  ) & 0b1111) <<  0);
	fixup_lut |= ((static_cast<std::uint32_t>(src_snan  ) & 0b1111) <<  4);
	fixup_lut |= ((static_cast<std::uint32_t>(src_zero  ) & 0b1111) <<  8);
	fixup_lut |= ((static_cast<std::uint32_t>(src_posone) & 0b1111) << 12);
	fixup_lut |= ((static_cast<std::uint32_t>(src_neginf) & 0b1111) << 16);
	fixup_lut |= ((static_cast<std::uint32_t>(src_posinf) & 0b1111) << 20);
	fixup_lut |= ((static_cast<std::uint32_t>(src_neg   ) & 0b1111) << 24);
	fixup_lut |= ((static_cast<std::uint32_t>(src_pos   ) & 0b1111) << 28);
	return fixup_lut;
}

To implement the signum function, the kind of substitutions to be made would be:

+Finite -Finite +Inf -Inf +1.0 ±0.0 SNaN QNaN
+1.0 -1.0 +1.0 -1.0 +1.0 +0.0 Preserve Preserve

Using the utility function, the 32-bit lookup table can be generated at compile-time for the intrinsic to consume. FixupOp here resolves to the value 0xA9A9A800.

#include <immintrin.h>

float signum_avx512(float Value)
{
	const __m128            Value128 = _mm_set_ss(Value);

	constexpr std::uint32_t FixupOp  = FixupLUT(
		 FpFixup::Dest, FpFixup::Dest, FpFixup::PosZero, FpFixup::PosOne,
		 FpFixup::NegOne, FpFixup::PosOne, FpFixup::NegOne, FpFixup::PosOne);

	const __m128 FixUp
		= _mm_fixupimm_ss(Value128, Value128, _mm_cvtsi32_si128(FixupOp), 0);

	return _mm_cvtss_f32(FixUp);
}

// gcc 12.2: -O2 -march=skylake-avx512

// signum_avx512(float):
//         vinsertps       xmm0, xmm0, xmm0, 0xe
//         vfixupimmss     xmm0, xmm0, DWORD PTR .LC0[rip], 0
//         ret
// .LC0:
//         .long   -1448499200
//         .long   0
//         .long   0
//         .long   0

//      nan ==> nan ✅
//      nan ==> nan ✅
//  1.2e-07 ==>  1  ✅
//      inf ==>  1  ✅
//     -inf ==> -1  ✅
//  1.2e-38 ==>  1  ✅
//  3.4e+38 ==>  1  ✅
//  1.4e-45 ==>  1  ✅
// -3.4e+38 ==> -1  ✅
//      0.5 ==>  1  ✅
//       -0 ==>  0  ✅
//        0 ==>  0  ✅

Other than the zero-extension that occurs from trying to access the underlying __m128 that the float values has, this has reduced the core implementation of the algorithm into just a single vfixupimmss instruction.

The particular vfixupimmss xmm, xmm, m32, 0 form that is being utilized here has a pretty good latency and throughput on Intel implementations of AVX512 compared to the alternative implementations that require a more explicit sequence of instructions.

Latency Throughput Uops Ports
Skylake-X [4;≤11] 0.50 / 0.55 2 / 2 1_p01+1_p23
Ice Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23
Tiger Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23
Rocket Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23
Alder Lake-P [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23A

Using measured data from uops.info. Hope to see Zen 4 measurements soon!

Benchmarks

The benchmarking code generates 1'000'000'000 random values and I run the various signum implementations through them. The average evaluation time across 100 trials is then output in nanoseconds:

// gcc 12.2: -O2 -march=skylake-avx512

template<typename T>
inline void DoNotOptimize(const T& Value)
{
	asm volatile("" : : "r,m"(Value) : "memory");
}

int main()
{
	const std::size_t SampleCount = 1'000'000'000;
	const std::size_t Trials      = 100;

	static std::vector<float> BenchData(1'000'000, 0.0);
	std::mt19937              MersenneTwister;
	for( float& CurFloat : BenchData )
	{
		CurFloat = (float)MersenneTwister();
	}

	auto Average = std::chrono::high_resolution_clock::duration::zero();

	for( std::size_t CurTrial = 0; CurTrial < Trials; ++CurTrial )
	{

		const auto Start = std::chrono::high_resolution_clock::now();
		for( const float& CurFloat : BenchData )
		{
			const float Sign = signum(CurFloat);
			// const float Sign = signum_safe(CurFloat);
			// const float Sign = signum_avx512(CurFloat);
			DoNotOptimize(Sign);
		}
		const auto End = std::chrono::high_resolution_clock::now();
		Average += (End - Start);
	}

	Average /= Trials;

	std::printf(
		"%zu\n",
		std::chrono::duration_cast<std::chrono::nanoseconds>(Average).count());

	return 0;
}
% inxi -IC
CPU:
  Info: 10-core model: Intel Core i9-7900X bits: 64 type: MT MCP cache: L2: 10 MiB
  Speed (MHz): avg: 4500 min/max: 1200/4500 cores: 1: 4500 2: 4500 3: 4500 4: 4500 5: 4500
    6: 4500 7: 4500 8: 4500 9: 4500 10: 4500 11: 4500 12: 4500 13: 4500 14: 4500 15: 4500 16: 4500
    17: 4500 18: 4500 19: 4500 20: 4500
signum signum_safe signum_avx512
696126ns 706325ns 468256ns

The vfixupimmss-accelerated signum_avx512 is over %65 faster than both the signum and signum_safe. This is even without utilizing the packed-variants of these instructions(vfixupimmp{s,d}) for bulk-array processing that would have allowed up to 4, 8, or 16 elements to be processed at once! More often than not though, you are probably using a signum function within a sequence of other unrelated arithmetic.


I’ve used this acceleration in particular for a contribution to Jolt Physics in which I added initial support for AVX 512.

In a contribution to Dynarmic (the JIT backend for Yuzu, Citra, and Vita3k), I first made the FixupLUT utility-function and the FpFixup enum to make handling vfixupimm-operations easier. In this case though, I was using vfixupimm to assist in converting NaN values to zero.

tdpbuud: Average Color

GL_EXT_fragment_shader_barycentric: Wireframe