Signum
The signum function, also known as the sign function, is defined by the piecewise function:
$$ signum(x) = \begin{cases} -1.0 &\text{if } x \lt 0.0 \\ 1.0 &\text{if } x \gt 0.0 \\ 0.0 &\text{if } x = 0.0 \end{cases} $$
Put simply, it gets the sign of the function. Returning 1.0
if it is positive,
-1.0
if it is negative, and 0.0
if it is 0.0
.
But floating point values are weird. There is ±Infinity, ±0.0, QNaN, and SNaN to account for. An initial implementation might look something like this. You’ll likely find something like this on Google or StackOverflow.
float signum(float Value)
{
if( Value < 0.0 )
return -1.0;
if( Value > 0.0 )
return 1.0;
return 0.0;
}
Lets put it through some interesting float
values and see how the compiler
output looks like:
int main()
{
std::vector<float> SampleData;
SampleData.push_back(std::numeric_limits<float>::quiet_NaN());
SampleData.push_back(std::numeric_limits<float>::signaling_NaN());
SampleData.push_back(std::numeric_limits<float>::epsilon());
SampleData.push_back(std::numeric_limits<float>::infinity());
SampleData.push_back(-std::numeric_limits<float>::infinity());
SampleData.push_back(std::numeric_limits<float>::min());
SampleData.push_back(std::numeric_limits<float>::max());
SampleData.push_back(std::numeric_limits<float>::denorm_min());
SampleData.push_back(std::numeric_limits<float>::lowest());
SampleData.push_back(std::numeric_limits<float>::round_error());
SampleData.push_back(-0.0f);
SampleData.push_back(+0.0f);
for( const float& CurFloat : SampleData )
{
std::printf("%16.2g ==> %2.g\n", CurFloat, signum(CurFloat));
}
}
float signum(float Value)
{
if( Value < 0.0 )
return -1.0;
if( Value > 0.0 )
return 1.0;
return 0.0;
}
// gcc 12.2: -O2 -march=skylake-avx512
// signum(float):
// vxorps xmm1, xmm1, xmm1
// vcomiss xmm1, xmm0
// ja .L3
// vcmpnltss xmm0, xmm1, xmm0
// vmovss xmm2, DWORD PTR .LC1[rip]
// vblendvps xmm0, xmm2, xmm1, xmm0
// ret
// .L3:
// vmovss xmm0, DWORD PTR .LC0[rip]
// ret
// .LC0:
// .long -1082130432
// .LC1:
// .long 1065353216
// nan ==> 0 ❔
// nan ==> 0 ❔
// 1.2e-07 ==> 1 ✅
// inf ==> 1 ✅
// -inf ==> -1 ✅
// 1.2e-38 ==> 1 ✅
// 3.4e+38 ==> 1 ✅
// 1.4e-45 ==> 1 ✅
// -3.4e+38 ==> -1 ✅
// 0.5 ==> 1 ✅
// -0 ==> 0 ✅
// 0 ==> 0 ✅
It gets the job done, but it doesn’t handle QNaN or SNaN correctly. It can be hard to define how a function that gets the sign of a number should handle something that is… Not a Number. Generally, you should propagate NaN values forward untouched so that later code might explicitly handle it or signal an exception upon it.
This is the behavior you would see in other language and library implementations for
signum
as well. Some even throw exceptions in the case of NaN, but for now we’ll just try and preserve it.
Rust Java C# Javascript Python(numpy) Clojure num::signum Math.signum Math.Sign Math.sign numpy.sign signum
Utilizing std::isnan provided by the cmath header, an additional branch can added to handle and propagate NaN values to the output.
#include <cmath>
float signum_safe(float Value)
{
if( Value < 0.0 )
return -1.0;
if( Value > 0.0 )
return 1.0;
if( std::isnan(Value) )
return Value;
return 0.0;
}
// gcc 12.2: -O2 -march=skylake-avx512
// signum_safe(float):
// vxorps xmm1, xmm1, xmm1
// vcomiss xmm1, xmm0
// ja .L3
// vcomiss xmm0, xmm1
// vmovss xmm2, DWORD PTR .LC1[rip]
// ja .L1
// vucomiss xmm0, xmm0
// vmovd eax, xmm0
// vmovd edx, xmm1
// cmovnp eax, edx
// vmovd xmm2, eax
// .L1:
// vmovaps xmm0, xmm2
// ret
// .L3:
// vmovss xmm2, DWORD PTR .LC0[rip]
// vmovaps xmm0, xmm2
// ret
// .LC0:
// .long -1082130432
// .LC1:
// .long 1065353216
// nan ==> nan ✅
// nan ==> nan ✅
// 1.2e-07 ==> 1 ✅
// inf ==> 1 ✅
// -inf ==> -1 ✅
// 1.2e-38 ==> 1 ✅
// 3.4e+38 ==> 1 ✅
// 1.4e-45 ==> 1 ✅
// -3.4e+38 ==> -1 ✅
// 0.5 ==> 1 ✅
// -0 ==> 0 ✅
// 0 ==> 0 ✅
Where the
std::isnan
check is placed is pretty subjective, but more often than not you are dealing with a finite negative or positive number than the more specific bit-patterns for a QNan, SNan, or +-0.0. So I placed those particular checks last as they are more unlikely than the bitpatterns for regular negative or positive numbers.
vfixupimm(AVX512)
The vfixupimm instructions are provided by the AVX512F x86 ISA extension. This is the core extension required by all implementations of AVX512, so you’ll find it on any processor that supports it such as Skylake-X, Cannon Lake, Cascade Lake, Cooper Lake, Ice Lake, Tiger Lake, RocketLake, Alder Lake(if you’re lucky), and even AMD’s recent Zen4 processors now support AVX512.
vfixupimm
is quite a unique instruction.
Operating upon either a vector of values or a singular value, this instruction
will first classify the floating point element(s), and will then substitute each
element(s) with another depending on the encoded substitution-values provided by
the 32-bit element(s) in the last operand(either another simd register or a
memory address,
no embedded-broadcast support
unfortunately).
The intrinsic that maps to operating upon a singular floating point value
is _mm_fixupimm_ss
, provided by the immintrin.h
header. There are also
double-precision floating point variants available as well as packed-element
variations for bulk-processing. We’ll look at the singular-value floating point
variant.
Synopsis __m128 _mm_fixupimm_ss (__m128 a, __m128 b, __m128i c, int imm8) #include <immintrin.h> Instruction: vfixupimmss xmm, xmm, xmm, imm8 CPUID Flags: AVX512F Description Fix up the lower single-precision (32-bit) floating-point elements in a and b using the lower 32-bit integer in c, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst. imm8 is used to set the required flags reporting.
With this, it is possible to implement a single instruction version of the signum that handles all of the edge-cases as well.
Each 32-bit lane of operand c
encodes how each class of floating point value(s)
should be substituted. The final 8-bit immediate value determines if you wish for
certain floating point exception flags to be reported. This can be left to 0
for this use-case.
Each 32-bit value has eight four-bit lanes for how each classification of floating point value should be substituted.
31..28 | 27..24 | 23..20 | 19..16 | 15..12 | 11..8 | 7..4 | 3..0 |
+Finite | -Finite | +Inf | -Inf | +1.0 | ±0.0 | SNaN | QNaN |
Within these four-bit classification lanes, the possible substitutions are:
Substitution | Value |
---|---|
Preserve the destination value | 0000 |
Source Value† | 0001 |
QNaN with the sign of the Source Value† | 0010 |
Negative QNaN with no payload | 0011 |
-Infinity | 0100 |
+Infinity | 0101 |
Infinity with sign of source operand† | 0110 |
-0.0 |
0111 |
+0.0 |
1000 |
-1.0 |
1001 |
+1.0 |
1010 |
+0.5 |
1011 |
+90.0 |
1100 |
PI/2 |
1101 |
+{FLT_MAX,DBL_MAX} |
1110 |
-{FLT_MAX,DBL_MAX} |
1111 |
† Denormal values may be
interpreted as being +0.0
if the DAZ flag is enabled, which is often the
case.
It might be worth it to make some utility-code to handle this easier:
// Substitution codes for use with vfixupimm
enum class FpFixup : std::uint8_t
{
Dest = 0b0000, // Preserve destination
Norm_Src = 0b0001, // Source operand (Denormal as positive-zero)
QNaN_Src = 0b0010, // QNaN with sign of source (Denormal as positive-zero)
IndefNaN = 0b0011, // Indefinite QNaN (Negative QNaN with no payload)
NegInf = 0b0100, // -Infinity
PosInf = 0b0101, // +Infinity
Inf_Src = 0b0110, // Infinity with sign of source (Denormal as positive-zero)
NegZero = 0b0111, // -0.0
PosZero = 0b1000, // +0.0
NegOne = 0b1001, // -1.0
PosOne = 0b1010, // +1.0
Half = 0b1011, // 0.5
Ninety = 0b1100, // 90.0
HalfPi = 0b1101, // PI/2
PosMax = 0b1110, // +{FLT_MAX,DBL_MAX}
NegMax = 0b1111, // -{FLT_MAX,DBL_MAX}
};
// Generates a 32-bit LUT element for the vfixupimm instruction
constexpr std::uint32_t FixupLUT(
FpFixup src_qnan = FpFixup::Dest, FpFixup src_snan = FpFixup::Dest,
FpFixup src_zero = FpFixup::Dest, FpFixup src_posone = FpFixup::Dest,
FpFixup src_neginf = FpFixup::Dest, FpFixup src_posinf = FpFixup::Dest,
FpFixup src_neg = FpFixup::Dest, FpFixup src_pos = FpFixup::Dest)
{
std::uint32_t fixup_lut = 0;
fixup_lut |= ((static_cast<std::uint32_t>(src_qnan ) & 0b1111) << 0);
fixup_lut |= ((static_cast<std::uint32_t>(src_snan ) & 0b1111) << 4);
fixup_lut |= ((static_cast<std::uint32_t>(src_zero ) & 0b1111) << 8);
fixup_lut |= ((static_cast<std::uint32_t>(src_posone) & 0b1111) << 12);
fixup_lut |= ((static_cast<std::uint32_t>(src_neginf) & 0b1111) << 16);
fixup_lut |= ((static_cast<std::uint32_t>(src_posinf) & 0b1111) << 20);
fixup_lut |= ((static_cast<std::uint32_t>(src_neg ) & 0b1111) << 24);
fixup_lut |= ((static_cast<std::uint32_t>(src_pos ) & 0b1111) << 28);
return fixup_lut;
}
To implement the signum
function, the kind of substitutions to be made would
be:
+Finite | -Finite | +Inf | -Inf | +1.0 | ±0.0 | SNaN | QNaN |
+1.0 |
-1.0 |
+1.0 |
-1.0 |
+1.0 |
+0.0 |
Preserve | Preserve |
Using the utility function, the 32-bit lookup table can be generated at
compile-time for the intrinsic to consume.
FixupOp
here resolves to the value 0xA9A9A800
.
#include <immintrin.h>
float signum_avx512(float Value)
{
const __m128 Value128 = _mm_set_ss(Value);
constexpr std::uint32_t FixupOp = FixupLUT(
FpFixup::Dest, FpFixup::Dest, FpFixup::PosZero, FpFixup::PosOne,
FpFixup::NegOne, FpFixup::PosOne, FpFixup::NegOne, FpFixup::PosOne);
const __m128 FixUp
= _mm_fixupimm_ss(Value128, Value128, _mm_cvtsi32_si128(FixupOp), 0);
return _mm_cvtss_f32(FixUp);
}
// gcc 12.2: -O2 -march=skylake-avx512
// signum_avx512(float):
// vinsertps xmm0, xmm0, xmm0, 0xe
// vfixupimmss xmm0, xmm0, DWORD PTR .LC0[rip], 0
// ret
// .LC0:
// .long -1448499200
// .long 0
// .long 0
// .long 0
// nan ==> nan ✅
// nan ==> nan ✅
// 1.2e-07 ==> 1 ✅
// inf ==> 1 ✅
// -inf ==> -1 ✅
// 1.2e-38 ==> 1 ✅
// 3.4e+38 ==> 1 ✅
// 1.4e-45 ==> 1 ✅
// -3.4e+38 ==> -1 ✅
// 0.5 ==> 1 ✅
// -0 ==> 0 ✅
// 0 ==> 0 ✅
Other than the zero-extension that occurs from trying to access the underlying
__m128
that the float
values has,
this has reduced the core implementation of the algorithm into just a single
vfixupimmss
instruction.
The particular vfixupimmss xmm, xmm, m32, 0
form that is being utilized here
has a pretty good latency and throughput
on Intel implementations of AVX512 compared to the alternative implementations
that require a more explicit sequence of instructions.
Latency Throughput Uops Ports Skylake-X [4;≤11] 0.50 / 0.55 2 / 2 1_p01+1_p23 Ice Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23 Tiger Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23 Rocket Lake [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23 Alder Lake-P [4;≤11] 0.50 / 0.50 2 / 2 1_p01+1_p23A Using measured data from uops.info. Hope to see Zen 4 measurements soon!
Benchmarks
The benchmarking code generates 1'000'000'000
random values and I run the
various signum
implementations through them. The average
evaluation time across 100
trials is then output in nanoseconds:
// gcc 12.2: -O2 -march=skylake-avx512
template<typename T>
inline void DoNotOptimize(const T& Value)
{
asm volatile("" : : "r,m"(Value) : "memory");
}
int main()
{
const std::size_t SampleCount = 1'000'000'000;
const std::size_t Trials = 100;
static std::vector<float> BenchData(1'000'000, 0.0);
std::mt19937 MersenneTwister;
for( float& CurFloat : BenchData )
{
CurFloat = (float)MersenneTwister();
}
auto Average = std::chrono::high_resolution_clock::duration::zero();
for( std::size_t CurTrial = 0; CurTrial < Trials; ++CurTrial )
{
const auto Start = std::chrono::high_resolution_clock::now();
for( const float& CurFloat : BenchData )
{
const float Sign = signum(CurFloat);
// const float Sign = signum_safe(CurFloat);
// const float Sign = signum_avx512(CurFloat);
DoNotOptimize(Sign);
}
const auto End = std::chrono::high_resolution_clock::now();
Average += (End - Start);
}
Average /= Trials;
std::printf(
"%zu\n",
std::chrono::duration_cast<std::chrono::nanoseconds>(Average).count());
return 0;
}
% inxi -IC
CPU:
Info: 10-core model: Intel Core i9-7900X bits: 64 type: MT MCP cache: L2: 10 MiB
Speed (MHz): avg: 4500 min/max: 1200/4500 cores: 1: 4500 2: 4500 3: 4500 4: 4500 5: 4500
6: 4500 7: 4500 8: 4500 9: 4500 10: 4500 11: 4500 12: 4500 13: 4500 14: 4500 15: 4500 16: 4500
17: 4500 18: 4500 19: 4500 20: 4500
signum |
signum_safe |
signum_avx512 |
696126ns | 706325ns | 468256ns |
The vfixupimmss
-accelerated signum_avx512
is over %65 faster than both
the signum
and signum_safe
. This is even without utilizing the
packed-variants of these instructions(vfixupimmp{s,d}
) for bulk-array
processing that would have allowed up to 4, 8, or 16 elements to be processed at
once! More often than not though, you are probably using a signum
function
within a sequence of other unrelated arithmetic.
I’ve used this acceleration in particular for a contribution to Jolt Physics in which I added initial support for AVX 512.
In a contribution to Dynarmic
(the JIT backend for
Yuzu,
Citra,
and
Vita3k), I first
made the FixupLUT
utility-function and the FpFixup
enum
to make handling vfixupimm
-operations easier.
In this case though, I was using vfixupimm
to assist in converting NaN values to zero.