Risc-V Bitmanip Extension Document Version 0.90: Editor: Clifford Wolf Symbiotic GMBH June 10, 2019
Risc-V Bitmanip Extension Document Version 0.90: Editor: Clifford Wolf Symbiotic GMBH June 10, 2019
Risc-V Bitmanip Extension Document Version 0.90: Editor: Clifford Wolf Symbiotic GMBH June 10, 2019
This document is released under a Creative Commons Attribution 4.0 International License.
Contents
1 Introduction 1
i
ii RISC-V Bitmanip Extension V0.90
3 Evaluation 45
3.4.3 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Change History 63
Bibliography 65
iv RISC-V Bitmanip Extension V0.90
Chapter 1
Introduction
Any proposed changes to the ISA should be evaluated according to the following criteria.
• Architecture Consistency: Decisions must be consistent with RISC-V philosophy. ISA changes
should deviate as little as possible from existing RISC-V standards (such as instruction encod-
ings), and should not re-implement features that are already found in the base specification
or other extensions.
• Threshold Metric: The proposal should provide significant savings in terms of clocks or
instructions. As a heuristic, any proposal should replace at least three instructions. An
instruction that only replaces two may be considered, but only if the frequency of use is very
high and/or the implementation very cheap.
• Data-Driven Value: Usage in real world applications, and corresponding benchmarks showing
a performance increase, will contribute to the score of a proposal. A proposal will not be
accepted on the merits of its theoretical value alone, unless it is used in the real world.
• Hardware Simplicity: Though instructions saved is the primary benefit, proposals that dra-
matically increase the hardware complexity and area, or are difficult to implement, should
be penalized and given extra scrutiny. The final proposals should only be made if a test
implementation can be produced.
• Compiler Support: ISA changes that can be natively detected by the compiler, or are already
used as intrinsics, will score higher than instructions which do not fit that criteria.
1
2 RISC-V Bitmanip Extension V0.90
The overall goal of this extension is pervasive adoption by minimizing potential barriers and en-
suring the instructions can be mapped to the largest number of ops, either direct or pseudo, that
are supported by the most popular processors and compilers. By adding generic instructions and
taking advantage of the RISC-V base instructions that already operate on bits, the minimal set of
instructions need to be added while at the same time enabling a rich of operations.
The instructions cover the four major categories of bit manipulation: Count, Extract, Insert, Swap.
The spec supports RV32, RV64, and RV128. “Clever” obscure and/or overly specific instructions
are avoided in favor of more straightforward, fast, generic ones. Coordination with other emerg-
ing RISC-V ISA extensions groups is required to ensure our instruction sets are architecturally
consistent.
• Assign concrete instruction encodings so that we can start implementing the extension in
processor cores and compilers.
• Add support for this extension to processor cores and compilers so we can run quantitative
evaluations on the instructions.
• Create assembler snippets for common operations that do not map 1:1 to any instruction in
this spec, but can be implemented easily using clever combinations of the instructions. Add
support for those snippets to compilers.
Chapter 2
In the proposals provided in this chapter, the C code examples are for illustration purposes only.
They are not optimal implementations, but are intended to specify the desired functionality.
The final standard will likely define a range of Z-extensions for different bit manipulation instruc-
tions, with the “B” extension itself being a mix of instructions from those Z-extensions. It is unclear
as of yet what this will look like exactly, but it will probably look something like this:
B?
The main open questions of course relate to what should and shouldn’t be included in “B”, and
3
4 RISC-V Bitmanip Extension V0.90
what should or shouldn’t be included in “Zbb”. These decisions will be informed in big part by
evaluations of the cost and added value for the individual instructions.
• Which “Zbp” pseudo-ops should be included in “B”? Which in “Zbb”? Should “Zbp” be
included in “B” as a whole?
For the purpose of tool-chain development “B” is currently everything (excluding “Zbf”).
RV64 only:
clzw rd, rs
ctzw rd, rs
The clz operation counts the number of 0 bits at the MSB end of the argument. That is, the
number of 0 bits before the first 1 bit counting from the most significant bit. If the input is 0, the
output is XLEN. If the input is -1, the output is 0.
The ctz operation counts the number of 0 bits at the LSB end of the argument. If the input is 0,
the output is XLEN. If the input is -1, the output is 0.
uint_xlen_t clz(uint_xlen_t rs1)
{
for (int count = 0; count < XLEN; count++)
if ((rs1 << count) >> (XLEN - 1))
return count;
return XLEN;
}
RISC-V Bitmanip Extension V0.90 5
The expression XLEN-1-clz(x) evaluates to the index of the most significant set bit, also known
as integer base-2 logarithm, or -1 if x is zero.
RV64 only:
pcntw rd, rs
This instruction counts the number of 1 bits in a register. This operations is known as population
count, popcount, sideways sum, bit summation, or Hamming weight. [20, 18]
uint_xlen_t pcnt(uint_xlen_t rs1)
{
int count = 0;
for (int index = 0; index < XLEN; index++)
count += (rs1 >> index) & 1;
return count;
}
This instructions implement AND, OR, and XOR with the 2nd arument inverted.
uint_xlen_t andn(uint_xlen_t rs1, uint_xlen_t rs2)
{
return rs1 & ~rs2;
}
6 RISC-V Bitmanip Extension V0.90
This can use the existing inverter on rs2 in the ALU that’s already there to implement subtract.
Among other things, those instructions allow implementing the “trailing bit manipulation” code
patterns in two instructions each. For example, (x - 1) & ~x produces a mask from trailing zero
bits in x.
RV64 only:
packw rd, rs1, rs2
This instruction packs the XLEN/2-bit lower halves of rs1 and rs2 into rd, with rs1 in the lower
half and rs2 in the upper half.
uint_xlen_t pack(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t lower = (rs1 << XLEN/2) >> XLEN/2;
uint_xlen_t upper = rs2 << XLEN/2;
return upper | lower;
}
Applications include XLEN/2-bit funnel shifts, zero-extend XLEN/2 bit values, duplicate the lower
XLEN/2 bits (e.g. for mask creation), and loading unsigned 32 constants on RV64.
; Load 0xffff0000ffff0000 on RV64
lui rd, 0xffff0
pack rd, rd, rd
; Same as FSLW on RV64
pack rd, rs1, rs3
rol rd, rd, rs2
addiw rd, rd, 0
; Clear the upper half of rd
pack rd, rd, zero
RISC-V Bitmanip Extension V0.90 7
Paired with shfli/unshfli and the other bit permutation instructions, pack can interleave arbi-
trary power-of-two chunks of rs1 and rs2. For example, interleaving the bytes in the lower halves
of rs1 and rs2:
pack rd, rs1, rs2
zip8 rd, rd
pack is most commonly used to zero-extend words <XLEN. For this purpose we define the following
assembler pseudo-ops:
RV32:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> pack rd, rs, zero
RV64:
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> packw rd, rs, zero
zext.w rd, rs -> pack rd, rs, zero
RV128:.
zext.b rd, rs -> andi rd, rs, 255
zext.h rd, rs -> packw rd, rs, zero
zext.w rd, rs -> packd rd, rs, zero
zext.d rd, rs -> pack rd, rs, zero
We define 4 R-type instructions min, max, minu, maxu with the following semantics:
uint_xlen_t min(uint_xlen_t rs1, uint_xlen_t rs2)
{
return (int_xlen_t)rs1 < (int_xlen_t)rs2 ? rs1 : rs2;
}
uint_xlen_t max(uint_xlen_t rs1, uint_xlen_t rs2)
{
return (int_xlen_t)rs1 > (int_xlen_t)rs2 ? rs1 : rs2;
}
uint_xlen_t minu(uint_xlen_t rs1, uint_xlen_t rs2)
{
return rs1 < rs2 ? rs1 : rs2;
}
8 RISC-V Bitmanip Extension V0.90
Code that performs saturated arithmetic on a word size < XLEN needs to perform min/max opera-
tions frequently. A simple way of performing those operations without branching can benefit those
programs.
SAT solvers spend a lot of time calculating the absolute value of a signed integer due to the way
CNF literals are commonly encoded [10]. With max (or minu) this is a two-instruction operation:
neg a1, a0
max a0, a0, a1
RV64:
sbsetw rd, rs1, rs2
sbclrw rd, rs1, rs2
sbinvw rd, rs1, rs2
sbextw rd, rs1, rs2
sbsetiw rd, rs1, imm
sbclriw rd, rs1, imm
sbinviw rd, rs1, imm
We define 4 single-bit instructions sbset (set), sbclr (clear), sbinv (invert), and sbext (extract),
and their immediate-variants, with the following semantics:
uint_xlen_t sbset(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN - 1);
return rs1 | (uint_xlen_t(1) << shamt);
}
RISC-V Bitmanip Extension V0.90 9
RV64 only:
slow rd, rs1, rs2
srow rd, rs1, rs2
sloiw rd, rs1, imm
sroiw rd, rs1, imm
These instructions are similar to shift-logical operations from the base spec, except instead of
shifting in zeros, they shift in ones.
uint_xlen_t slo(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN - 1);
return ~(~rs1 << shamt);
}
uint_xlen_t sro(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN - 1);
return ~(~rs1 >> shamt);
}
ISAs with flag registers often have a ”Shift in Carry” or ”Rotate through Carry” instruction.
10 RISC-V Bitmanip Extension V0.90
Arguably a ”Shift Ones” is an equivalent on an ISA like RISC-V that avoids such flag registers.
The main application for the Shift Ones instruction is mask generation.
When implementing this circuit, the only change in the ALU over a standard logical shift is that
the value shifted in is not zero, but is a 1-bit register value that has been forwarded from the high
bit of the instruction decode. This creates the desired behavior on both logical zero-shifts and
logical ones-shifts.
RV64 only:
rorw rd, rs1, rs2
rolw rd, rs1, rs2
roriw rd, rs1, imm
These instructions are similar to shift-logical operations from the base spec, except they shift in
the values from the opposite side of the register, in order. This is also called ‘circular shift’.
uint_xlen_t rol(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN - 1);
return (rs1 << shamt) | (rs1 >> ((XLEN - shamt) & (XLEN - 1)));
}
uint_xlen_t ror(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN - 1);
return (rs1 >> shamt) | (rs1 << ((XLEN - shamt) & (XLEN - 1)));
}
RISC-V Bitmanip Extension V0.90 11
ror stage 4
(shamt[4])
ror stage 3
(shamt[3])
ror stage 2
(shamt[2])
ror stage 1
(shamt[1])
ror stage 0
(shamt[0])
RV64 only:
grevw rd, rs1, rs2
greviw rd, rs1, imm
This instruction provides a single hardware instruction that can implement all of byte-order swap,
bitwise reversal, short-order-swap, word-order-swap (RV64), nibble-order swap, bitwise reversal in a
byte, etc, all from a single hardware instruction. It takes in a single register value and an immediate
that controls which function occurs, through controlling the levels in the recursive tree at which
reversals occur.
This operation iteratively checks each bit i in rs2 from i = 0 to XLEN − 1, and if the corresponding
bit is set, swaps each adjacent pair of 2i bits.
12 RISC-V Bitmanip Extension V0.90
grev stage 4
(shamt[4])
grev stage 3
(shamt[3])
grev stage 2
(shamt[2])
grev stage 1
(shamt[1])
grev stage 0
(shamt[0])
The above pattern should be intuitive to understand in order to extend this definition in an obvious
manner for RV128.
The grev operation can easily be implemented using a permutation network with log2 (XLEN)
stages. Figure 2.1 shows the permutation network for ror for reference. Figure 2.2 shows the
permutation network for grev.
grev is encoded as standard R-type opcode and grevi is encoded as standard I-type opcode. grev
and grevi can use the instruction encoding for “arithmetic shift left”.
Pseudo-instructions are provided for the most common GREVI use-cases. Their names consist of
a prefix and and optional suffix. Each prefix and suffix corresponds to a bit mask. The GREVI
control word is obtained by AND-ing the two masks together.
In other words, the prefix controls the number of zero bits at the LSB end of the control word, and
the suffix controls the number of zeros at the MSB end of the control word.
rev8 reverses the order of bytes in a word, thus performs endianness conversion.
14 RISC-V Bitmanip Extension V0.90
RV32 RV64
shamt Instruction shamt Instruction shamt Instruction
0: 00000 — 0: 000000 — 32: 100000 rev32
1: 00001 rev.p 1: 000001 rev.p 33: 100001 —
2: 00010 rev2.n 2: 000010 rev2.n 34: 100010 —
3: 00011 rev.n 3: 000011 rev.n 35: 100011 —
4: 00100 rev4.b 4: 000100 rev4.b 36: 100100 —
5: 00101 — 5: 000101 — 37: 100101 —
6: 00110 rev2.b 6: 000110 rev2.b 38: 100110 —
7: 00111 rev.b 7: 000111 rev.b 39: 100111 —
8: 01000 rev8.h 8: 001000 rev8.h 40: 101000 —
9: 01001 — 9: 001001 — 41: 101001 —
10: 01010 — 10: 001010 — 42: 101010 —
11: 01011 — 11: 001011 — 43: 101011 —
12: 01100 rev4.h 12: 001100 rev4.h 44: 101100 —
13: 01101 — 13: 001101 — 45: 101101 —
14: 01110 rev2.h 14: 001110 rev2.h 46: 101110 —
15: 01111 rev.h 15: 001111 rev.h 47: 101111 —
16: 10000 rev16 16: 010000 rev16.w 48: 110000 rev16
17: 10001 — 17: 010001 — 49: 110001 —
18: 10010 — 18: 010010 — 50: 110010 —
19: 10011 — 19: 010011 — 51: 110011 —
20: 10100 — 20: 010100 — 52: 110100 —
21: 10101 — 21: 010101 — 53: 110101 —
22: 10110 — 22: 010110 — 54: 110110 —
23: 10111 — 23: 010111 — 55: 110111 —
24: 11000 rev8 24: 011000 rev8.w 56: 111000 rev8
25: 11001 — 25: 011001 — 57: 111001 —
26: 11010 — 26: 011010 — 58: 111010 —
27: 11011 — 27: 011011 — 59: 111011 —
28: 11100 rev4 28: 011100 rev4.w 60: 111100 rev4
29: 11101 — 29: 011101 — 61: 111101 —
30: 11110 rev2 30: 011110 rev2.w 62: 111110 rev2
31: 11111 rev 31: 011111 rev.w 63: 111111 rev
RV64 only:
shflw rd, rs1, rs2
unshflw rd, rs1, rs2
RISC-V Bitmanip Extension V0.90 15
Shuffle is the third bit permutation instruction in the RISC-V Bitmanip extension, after rotary shift
and generalized reverse. It implements a generalization of the operation commonly known as perfect
outer shuffle and its inverse (shuffle/unshuffle), also known as zip/unzip or interlace/uninterlace.
Bit permutations can be understood as reversible functions on bit indices (i.e. 5 bit functions on
RV32 and 6 bit functions on RV64).
A generalized (un)shuffle operation has log2 (XLEN) − 1 control bits, one for each pair of neigh-
bouring bits in a bit index. When the bit is set, generalized shuffle will swap the two index bits.
The shfl operation performs this swaps in MSB-to-LSB order (performing a rotate left shift on
contiguous regions of set control bits), and the unshfl operation performs the swaps in LSB-to-
MSB order (performing a rotate right shift on contiguous regions of set control bits). Combining
up to log2 (XLEN) of those shfl/unshfl operations can implement any bitpermutation on the bit
indices.
The most common type of shuffle/unshuffle operation is one on an immediate control value that
only contains one contiguous region of set bits. We call those operations zip/unzip and provide
pseudo-instructions for them. The naming scheme for those pseudo-instructions is similar to the
naming scheme for the grevi pseudo-instructions.
Shuffle/unshuffle operations that only have individual bits set (not a contiguous region of two or
more bits) are their own inverse.
Like GREV and rotate shift, the (un)shuffle instruction can be implemented using a short sequence
of elementary permutations, that are enabled or disabled by the shamt bits. But (un)shuffle has one
stage fewer than GREV. Thus shfli+unshfli together require the same amount of encoding space
as grevi.
uint32_t shuffle32_stage(uint32_t src, uint32_t maskL, uint32_t maskR, int N)
{
uint32_t x = src & ~(maskL | maskR);
x |= ((src << N) & maskL) | ((src >> N) & maskR);
return x;
}
16 RISC-V Bitmanip Extension V0.90
return x;
}
RISC-V Bitmanip Extension V0.90 17
return x;
}
Or for RV64:
18 RISC-V Bitmanip Extension V0.90
mode[0] mode[3]
mode[1] mode[2]
mode[2] mode[1]
mode[3] mode[0]
inv
return x;
}
RISC-V Bitmanip Extension V0.90 19
return x;
}
The above pattern should be intuitive to understand in order to extend this definition in an obvious
manner for RV128.
Alternatively (un)shuffle) can be implemented in a single network with one more stage than GREV,
with the additional first and last stage executing a permutation that effectively reverses the order
of the inner stages. However, since the inner stages only mux half of the bits in the word each, a
hardware implementation using this additional “flip” stages might actually be more expensive than
simply creating two networks.
uint32_t shuffle32_flip(uint32_t src)
{
uint32_t x = src & 0x88224411;
x |= ((src << 6) & 0x22001100) | ((src >> 6) & 0x00880044);
x |= ((src << 9) & 0x00440000) | ((src >> 9) & 0x00002200);
x |= ((src << 15) & 0x44110000) | ((src >> 15) & 0x00008822);
x |= ((src << 21) & 0x11000000) | ((src >> 21) & 0x00000088);
return x;
}
20 RISC-V Bitmanip Extension V0.90
flip
stage 3
stage 2
stage 1
stage 0
flip
uint32_t x = rs1;
x = shuffle32_flip(x);
x = shfl32(x, shfl_mode);
x = shuffle32_flip(x);
return x;
}
Figure 2.4 shows the (un)shuffle permutation network with “flip” stages and Figure 2.3 shows the
(un)shuffle permutation network without “flip” stages.
The zip instruction with the upper half of its input cleared performs the commonly needed “fan-
out” operation. (Equivalent to bdep with a 0x55555555 mask.) The zip instruction applied twice
fans out the bits in the lower quarter of the input word by a spacing of 4 bits.
For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a
RISC-V Bitmanip Extension V0.90 21
Similarly, the following code stores the indices of the set bits in the LSB nibbles of the output word
(with the LSB bit having index 1), with the unused MSB nibbles in the output set to zero:
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 1
or a0, a0, a1
slli a1, a0, 2
or a0, a0, a1
li a1, 0x87654321
and a1, a0, a1
bext a0, a1, a0
Other zip modes can be used to “fan-out” in blocks of 2, 4, 8, or 16 bit. zip can be combined
with grevi to perform inner shuffles. For example on RV64:
li a0, 0x0000000012345678
zip4 t0, a0 ; <- 0x0102030405060708
rev4.b t1, t0 ; <- 0x1020304050607080
zip8 t2, a0 ; <- 0x0012003400560078
rev8.h t3, t2 ; <- 0x1200340056007800
zip16 t4, a0 ; <- 0x0000123400005678
rev16.w t5, t4 ; <- 0x1234000056780000
Another application for the zip instruction is generating Morton code [21].
The x86 PUNPCK[LH]* MMX/SSE/AVX instructions perform similar operations as zip8 and zip16.
22 RISC-V Bitmanip Extension V0.90
RV64 only:
bextw rd, rs1, rs2
bdepw rd, rs1, rs2
This instructions implement the generic bit extract and bit deposit functions. This operation is
also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand,
or right compress/right expand.
bext collects LSB justified bits to rd from rs1 using extract mask in rs2.
bdep writes LSB justified bits from rs1 to rd using deposit mask in rs2.
uint_xlen_t bext(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t r = 0;
for (int i = 0, j = 0; i < XLEN; i++)
if ((rs2 >> i) & 1) {
if ((rs1 >> i) & 1)
r |= uint_xlen_t(1) << j;
j++;
}
return r;
}
uint_xlen_t bdep(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t r = 0;
for (int i = 0, j = 0; i < XLEN; i++)
if ((rs2 >> i) & 1) {
if ((rs1 >> j) & 1)
r |= uint_xlen_t(1) << i;
j++;
}
return r;
}
Implementations may choose to use smaller multi-cycle implementations of bext and bdep, or even
emulate the instructions in software.
Even though multi-cycle bext and bdep often are not fast enough to outperform algorithms that
use sequences of shifts and bit masks, dedicated instructions for those operations can still be of
great advantage in cases where the mask argument is not constant.
RISC-V Bitmanip Extension V0.90 23
For example, the following code efficiently calculates the index of the tenth set bit in a0 using bdep:
li a1, 0x00000200
bdep a0, a1, a0
ctz a0, a0
For cases with a constant mask an optimizing compiler would decide when to use bext or bdep
based on the optimization profile for the concrete processor it is optimizing for. This is similar to
the decision whether to use MUL or DIV with a constant, or to perform the same operation using
a longer sequence of much simpler operations.
The bext and bdep instructions are equivalent to the x86 BMI2 instructions PEXT and PDEP. But
there is much older prior art. For example, the soviet BESM-6 mainframe computer, designed and
built in the 1960s, had APX/AUX instructions with almost the same semantics. [1] (The BESM-6
APX/AUX instructions packed/unpacked at the MSB end instead of the LSB end. Otherwise it is
the same instruction.)
RV64 only:
clmulw rd, rs1, rs2
clmulhw rd, rs1, rs2
clmulrw rd, rs1, rs2
Calculate the carry-less product [19] of the two arguments. clmul produces the lower half of the
carry-less product and clmulh produces the upper half of the 2·XLEN carry-less product.
clmulr produces bits 2·XLEN−2:XLEN-1 of the 2·XLEN carry-less product. That means clmulh
is equivalent to clmulr followed by a 1-bit right shift. (The MSB of a clmulh result is always
zero.) Another equivalent definition of clmulr is that is clmulr(a,b) := rev(clmul(rev(a),
rev(b))). (The “r” in clmulr means reversed.)
Unlike mulh[[s]u], we add a *W variant of clmulh. This is because we expect some code to use
32-bit clmul intrisics, even on 64-bit architectures. For example in cases where data is processed in
32-bit chunks.
24 RISC-V Bitmanip Extension V0.90
The classic applications for clmul are CRC [11, 22] and GCM, but more applications exist, including
the following examples.
There are obvious applications in hashing and pseudo random number generations. For exam-
ple, it has been reported that hashes based on carry-less multiplications can outperform Google’s
CityHash [15].
clmul of a number with itself inserts zeroes between each input bit. This can be useful for generating
Morton code [21].
clmul of a number with -1 calculates the prefix XOR operation. This can be useful for decoding
gray codes.
Another application of XOR prefix sums calculated with clmul is branchless tracking of quoted
strings in high-performance parsers. [14]
Carry-less multiply can also be used to implement Erasure code efficiently. [12]
RV64 only:
crc32.d rd, rs
crc32c.d rd, rs
Unary CRC instructions that interpret the bits of rs1 as a CRC32/CRC32C state and perform a
polynomial reduction of that state shifted left by 8, 16, 32, or 64 bits.
Payload data must be XOR’ed into the LSB end of the state before executing the CRC instruction.
The following code demonstrates the use of crc32.b:
uint32_t crc32_demo(const uint8_t *p, int len)
{
uint32_t x = 0xffffffff;
for (int i = 0; i < len; i++) {
x = x ^ p[i];
x = crc32_b(x);
}
return ~x;
}
In terms of binary polynomial arithmetic those instructions perform the operation
with N ∈ {8, 16, 32, 64}, P = 0xEDB8 8320 for CRC32 and P = 0x82F6 3B78 for CRC32C, a′
denoting the XLEN bit reversal of a, and {a, b} denoting bit concatenation. Note that for example
for CRC32 {1, P ′ } = 0x1 04C1 1DB7 on RV32 and {1, P ′ } = 0x1 04C1 1DB7 0000 0000 on RV64.
These dedicated CRC instructions are meant for RISC-V implementations without fast multiplier
and therefore without fast clmul[h]. For implementations with fast clmul[h] it is recommended
to use the methods described in [11] and demonstrated in [22] that can process XLEN input bits
using just one carry-less multiply for arbitrary CRC polynomials.
In applications where those methods are not applicable it is possible to emulate the dedicated CRC
instructions using two carry-less multiplies that implement a Barrett reduction. The following
example implements a replacement for crc32.w (RV32).
crc32_w:
li t0, 0xF7011641
li t1, 0xEDB88320
clmul a0, a0, t0
clmulr a0, a0, t1
ret
These are 64-bit-only instruction that are not available on RV32. On RV128 they ignore the upper
RISC-V Bitmanip Extension V0.90 27
bmatxor performs a matrix-matrix multiply with boolean AND as multiply operator and boolean
XOR as addition operator.
bmator performs a matrix-matrix multiply with boolean AND as multiply operator and boolean
OR as addition operator.
bmatflip is a unary operator that transposes the source matrix. It is equivalent to zip; zip; zip
on RV64.
Among other things, bmatxor/bmator can be used to perform arbitrary permutations of bits within
each byte (permutation matrix as 2nd operand) or perform arbitrary permutations of bytes within
a 64-bit word (permutation matrix as 1st operand).
There are similar instructions in Cray XMT [5]. The Cray X1 architecture even has a full 64x64
bit matrix multiply unit [4].
The MMIX architecture has MOR and MXOR instructions with the same semantic. [13, p. 182f]
uint64_t x = 0;
for (int i = 0; i < 64; i++) {
if (pcnt(u[i / 8] & v[i % 8]) & 1)
x |= 1LL << i;
}
return x;
}
uint64_t bmator(uint64_t rs1, uint64_t rs2)
{
// transpose of rs2
uint64_t rs2t = bmatflip(rs2);
uint64_t x = 0;
for (int i = 0; i < 64; i++) {
if ((u[i / 8] & v[i % 8]) != 0)
x |= 1LL << i;
}
return x;
}
RISC-V Bitmanip Extension V0.90 29
(Note that the assembler syntax of cmix has the rs2 argument first to make assembler code more
readable. But the reference C code code below uses the “architecturally correct” argument order
rs1, rs2, rs3.)
The cmix rd, rs2, rs1, rs3 instruction selects bits from rs1 and rs3 based on the bits in the
control word rs2.
uint_xlen_t cmix(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
return (rs1 & rs2) | (rs3 & ~rs2);
}
(Note that the assembler syntax of cmov has the rs2 argument first to make assembler code more
readable. But the reference C code code below uses the “architecturally correct” argument order
rs1, rs2, rs3.)
The cmov rd, rs2, rs1, rs3 instruction selects rs1 if the control word rs2 is non-zero, and rs3
if the control word is zero.
uint_xlen_t cmov(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
return rs2 ? rs1 : rs3;
}
30 RISC-V Bitmanip Extension V0.90
RV64 only:
fslw rd, rs1, rs3, rs2
fsrw rd, rs1, rs3, rs2
fsriw rd, rs1, rs3, imm
(Note that the assembler syntax for funnel shifts has the rs2 argument last to make assembler code
more readable. But the reference C code code below uses the “architecturally correct” argument
order rs1, rs2, rs3.)
The fsl rd, rs1, rs3, rs2 instruction creates a 2 · XLEN word by concatenating rs1 and rs3
(with rs1 in the MSB half), rotate-left-shifts that word by the amount indicated in the log2 (XLEN)+
1 LSB bits in rs2, and then writes the MSB half of the result to rd.
The fsr rd, rs1, rs3, rs2 instruction creates a 2 · XLEN word by concatenating rs1 and
rs3 (with rs1 in the LSB half), rotate-right-shifts that word by the amount indicated in the
log2 (XLEN) + 1 LSB bits in rs2, and then writes the LSB half of the result to rd.
uint_xlen_t fsl(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
int shamt = rs2 & (2*XLEN - 1);
uint_xlen_t A = rs1, B = rs3;
if (shamt >= XLEN) {
shamt -= XLEN;
A = rs3;
B = rs1;
}
return shamt ? (A << shamt) | (B >> (XLEN-shamt)) : A;
}
uint_xlen_t fsr(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
int shamt = rs2 & (2*XLEN - 1);
uint_xlen_t A = rs1, B = rs3;
if (shamt >= XLEN) {
shamt -= XLEN;
A = rs3;
B = rs1;
}
return shamt ? (A >> shamt) | (B << (XLEN-shamt)) : A;
}
RISC-V Bitmanip Extension V0.90 31
A shift unit capable of either fsl or fsr is capable of performing all the other shift functions,
including the other funnel shift, with only minimal additional logic.
Consider C code that’s using unsigned 32-bit ints as array indices. For example:
char addiwu_demo(char *p, unsigned int i) {
return p[i-1];
}
The instructions below make sure that no explicit zext.w instruction is needed in those cases, to
make sure there is no systematic performance penalty for code like shown above on RV64 compared
to RV32.
These instructions are identical to addw, subw, addiw, except that bits XLEN-1:32 of the result
are cleared after the addition. I.e. these instructions zero-extend instead of sign-extend the 32-bit
result.
uint_xlen_t addwu(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t result = rs1 + rs2;
return (uint32_t)result;
}
uint_xlen_t subwu(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t result = rs1 - rs2;
return (uint32_t)result;
}
slliu.w is identical to slli, except that bits XLEN-1:32 of the rs1 argument are cleared before
the shift.
addu.w and subu.w are identical to add and sub, except that bits XLEN-1:32 of the rs2 argument
are cleared before the add/subtract.
uint_xlen_t slliuw(uint_xlen_t rs1, int imm)
{
uint_xlen_t rs1u = (uint32_t)rs1;
int shamt = imm & (XLEN - 1);
return rs1u << shamt;
}
uint_xlen_t adduw(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t rs2u = (uint32_t)rs2;
return rs1 + rs2u;
}
uint_xlen_t subuw(uint_xlen_t rs1, uint_xlen_t rs2)
{
uint_xlen_t rs2u = (uint32_t)rs2;
return rs1 - rs2u;
}
RISC-V Bitmanip Extension V0.90 33
This chapter contains proposed encodings for most of the instructions described in this document.
DO NOT IMPLEMENT THESE OPCODES YET. We are trying to get official opcodes
assigned and will update this chapter soon with the official opcodes.
The andn, orn, and xnor instruction are encoded the same way as and, or, and xor, but with
op[30] set, mirroring the encoding scheme used for add and sub.
All shift instructions use funct3=001 for left shifts and funct3=101 for right shifts. GREV occupies
the spot that would decode as SLA (arithmetic left shift).
op[26]=1 selects funnel shifts. For funnel shifts op[30:29] is part if the 3rd operand and therefore
unused for encoding the operation. For all other shift operations op[26]=0.
fsri is also encoded with op[26]=1, leaving a 6 bit immediate. The 7th bit, that is necessary to
perform a 128 bit funnel shift on RV64, can be emulated by swapping rs1 and rs3.
There is no shfliw instruction. The slliu.w instruction occupies the encoding slot that would be
occupied by shfliw.
On RV128 op[26] contains the MSB of the immediate for the shift instructions. Therefore there
is no FSRI instruction on RV128. (But there is FSRIW/FSRID.)
| SLL SRL SRA | GREV | SLO SRO | ROL ROR | FSL FSR
op[30] | 0 0 1 | 1 | 0 0 | 1 1 | - -
op[29] | 0 0 0 | 0 | 1 1 | 1 1 | - -
op[26] | 0 0 0 | 0 | 0 0 | 0 0 | 1 1
funct3 | 001 101 101 | 001 | 001 101 | 001 101 | 001 101
Only an encoding for RORI exists, as ROLI can be implemented with RORI by negating the
immediate. Unary functions are encoded in the spot that would correspond to ROLI, with the
function encoded in the 5 LSB bits of the immediate.
The CRC instructions are encoded as unary instructions with op[24] set. The polynomial is
selected via op[23], with op[23]=0 for CRC32 and op[23]=1 for CRC32C. The width is selected
with op[22:20], using the same encoding as is used in funct3 for load/store operations.
cmix and cmov are encoded using the two remaining ternary operator encodings in funct3=001
and funct3=101. (There are two ternary operator encodings per minor opcode using the op[26]=1
scheme for marking ternary OPs.)
The single-bit instructions are also encoded within the shift opcodes, with op[27] set, and using
op[30] and op[29] to select the operation:
| SBSET SBCLR SBINV | SBEXT
op[30] | 0 1 1 | 1
op[29] | 1 0 1 | 0
op[27] | 1 1 1 | 1
funct3 | 001 001 001 | 101
34 RISC-V Bitmanip Extension V0.90
The encoding of clmul, clmulr, clmulh is identical to the encoding of mulh, mulhsu, mulhu,
except that op[27]=1.
The encoding of min[u]/max[u] uses funct3=100..111. The funct3 encoding matches op[31:29]
of the AMO min/max functions.
The remaining instructions are encoded within funct7=0000100. The shift-like shfl/unshfl in-
structions uses the same funct3 values as the shift operations. bdep and bext are encoded in a
way so that funct3[2] selects the “direction”, similar to shift operations.
addwu and subwu are encoded like addw and subw, except that op[25]=1 and op[27]=1.
addu.w and subu.w are encoded like addw and subw, except that op[27]=1.
Finally, RV64 has *W instructions for all bitmanip instructions, with the following exceptions:
andn, cmix, cmov, min[u], max[u] have no *W variants because they already behave in the way a
*W instruction would when presented with sign-exteded 32-bit arguments.
bmatflip, bmatxor, bmator have no *W variants because they are 64-bit only instructions.
There is no [un]shfliw, as a perfect outer shuffle always preserves the MSB bit, thus [un]shfli
preserves proper sign extension when the upper bit in the control word is set. There’s still
[un]shflw that masks that upper control bit and sign-extends the output.
Relevant instruction encodings from the base ISA are included in the table below and are marked
with a *.
RISC-V Bitmanip Extension V0.90 35
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------------------------------------------------------|
| funct7 | rs2 | rs1 | f3 | rd | opcode | R-type
| rs3 | f2| rs2 | rs1 | f3 | rd | opcode | R4-type
| imm | rs1 | f3 | rd | opcode | I-type
|===============================================================|
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND*
| 0000000 | rs2 | rs1 | 110 | rd | 0110011 | OR*
| 0000000 | rs2 | rs1 | 100 | rd | 0110011 | XOR*
| 0100000 | rs2 | rs1 | 111 | rd | 0110011 | ANDN
| 0100000 | rs2 | rs1 | 110 | rd | 0110011 | ORN
| 0100000 | rs2 | rs1 | 100 | rd | 0110011 | XNOR
|---------------------------------------------------------------|
| 0000000 | rs2 | rs1 | 001 | rd | 0110011 | SLL*
| 0000000 | rs2 | rs1 | 101 | rd | 0110011 | SRL*
| 0100000 | rs2 | rs1 | 001 | rd | 0110011 | GREV
| 0100000 | rs2 | rs1 | 101 | rd | 0110011 | SRA*
| 0010000 | rs2 | rs1 | 001 | rd | 0110011 | SLO
| 0010000 | rs2 | rs1 | 101 | rd | 0110011 | SRO
| 0110000 | rs2 | rs1 | 001 | rd | 0110011 | ROL
| 0110000 | rs2 | rs1 | 101 | rd | 0110011 | ROR
|---------------------------------------------------------------|
| 0010100 | rs2 | rs1 | 001 | rd | 0110011 | SBSET
| 0100100 | rs2 | rs1 | 001 | rd | 0110011 | SBCLR
| 0110100 | rs2 | rs1 | 001 | rd | 0110011 | SBINV
| 0100100 | rs2 | rs1 | 101 | rd | 0110011 | SBEXT
|---------------------------------------------------------------|
| 00000 | imm | rs1 | 001 | rd | 0010011 | SLLI*
| 00000 | imm | rs1 | 101 | rd | 0010011 | SRLI*
| 01000 | imm | rs1 | 001 | rd | 0010011 | GREVI
| 01000 | imm | rs1 | 101 | rd | 0010011 | SRAI*
| 00100 | imm | rs1 | 001 | rd | 0010011 | SLOI
| 00100 | imm | rs1 | 101 | rd | 0010011 | SROI
| 01100 | imm | rs1 | 101 | rd | 0010011 | RORI
|---------------------------------------------------------------|
| 00101 | imm | rs1 | 001 | rd | 0010011 | SBSETI
| 01001 | imm | rs1 | 001 | rd | 0010011 | SBCLRI
| 01101 | imm | rs1 | 001 | rd | 0010011 | SBINVI
| 01001 | imm | rs1 | 101 | rd | 0010011 | SBEXTI
|---------------------------------------------------------------|
| rs3 | 11| rs2 | rs1 | 001 | rd | 0110011 | CMIX
| rs3 | 11| rs2 | rs1 | 101 | rd | 0110011 | CMOV
| rs3 | 10| rs2 | rs1 | 001 | rd | 0110011 | FSL
| rs3 | 10| rs2 | rs1 | 101 | rd | 0110011 | FSR
| rs3 |1| imm | rs1 | 101 | rd | 0010011 | FSRI
|---------------------------------------------------------------|
36 RISC-V Bitmanip Extension V0.90
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------------------------------------------------------|
| funct7 | rs2 | rs1 | f3 | rd | opcode | R-type
| rs3 | f2| rs2 | rs1 | f3 | rd | opcode | R4-type
| imm | rs1 | f3 | rd | opcode | I-type
|===============================================================|
| 0110000 | 00000 | rs1 | 001 | rd | 0010011 | CLZ
| 0110000 | 00001 | rs1 | 001 | rd | 0010011 | CTZ
| 0110000 | 00010 | rs1 | 001 | rd | 0010011 | PCNT
| 0110000 | 00011 | rs1 | 001 | rd | 0010011 | BMATFLIP
|---------------------------------------------------------------|
| 0110000 | 10000 | rs1 | 001 | rd | 0010011 | CRC32.B
| 0110000 | 10001 | rs1 | 001 | rd | 0010011 | CRC32.H
| 0110000 | 10010 | rs1 | 001 | rd | 0010011 | CRC32.W
| 0110000 | 10011 | rs1 | 001 | rd | 0010011 | CRC32.D
| 0110000 | 11000 | rs1 | 001 | rd | 0010011 | CRC32C.B
| 0110000 | 11001 | rs1 | 001 | rd | 0010011 | CRC32C.H
| 0110000 | 11010 | rs1 | 001 | rd | 0010011 | CRC32C.W
| 0110000 | 11011 | rs1 | 001 | rd | 0010011 | CRC32C.D
|---------------------------------------------------------------|
| 0000101 | rs2 | rs1 | 001 | rd | 0110011 | CLMUL
| 0000101 | rs2 | rs1 | 010 | rd | 0110011 | CLMULR
| 0000101 | rs2 | rs1 | 011 | rd | 0110011 | CLMULH
| 0000101 | rs2 | rs1 | 100 | rd | 0110011 | MIN
| 0000101 | rs2 | rs1 | 101 | rd | 0110011 | MAX
| 0000101 | rs2 | rs1 | 110 | rd | 0110011 | MINU
| 0000101 | rs2 | rs1 | 111 | rd | 0110011 | MAXU
|---------------------------------------------------------------|
| 0000100 | rs2 | rs1 | 001 | rd | 0110011 | SHFL
| 0000100 | rs2 | rs1 | 101 | rd | 0110011 | UNSHFL
| 0000100 | rs2 | rs1 | 010 | rd | 0110011 | BDEP
| 0000100 | rs2 | rs1 | 110 | rd | 0110011 | BEXT
| 0000100 | rs2 | rs1 | 100 | rd | 0110011 | PACK
| 0000100 | rs2 | rs1 | 011 | rd | 0110011 | BMATOR
| 0000100 | rs2 | rs1 | 111 | rd | 0110011 | BMATXOR
|---------------------------------------------------------------|
| 000010 | imm | rs1 | 001 | rd | 0010011 | SHFLI
| 000010 | imm | rs1 | 101 | rd | 0010011 | UNSHFLI
|===============================================================|
| immediate | rs1 | 000 | rd | 0011011 | ADDIW*
| immediate | rs1 | 100 | rd | 0011011 | ADDIWU
| 00001 | imm | rs1 | 001 | rd | 0011011 | SLLIU.W
|---------------------------------------------------------------|
| 0000000 | rs2 | rs1 | 000 | rd | 0111011 | ADDW*
| 0100000 | rs2 | rs1 | 000 | rd | 0111011 | SUBW*
| 0000101 | rs2 | rs1 | 000 | rd | 0111011 | ADDWU
| 0100101 | rs2 | rs1 | 000 | rd | 0111011 | SUBWU
| 0000100 | rs2 | rs1 | 000 | rd | 0111011 | ADDU.W
| 0100100 | rs2 | rs1 | 000 | rd | 0111011 | SUBU.W
|---------------------------------------------------------------|
RISC-V Bitmanip Extension V0.90 37
| 3 2 1 |
|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
|---------------------------------------------------------------|
| funct7 | rs2 | rs1 | f3 | rd | opcode | R-type
| rs3 | f2| rs2 | rs1 | f3 | rd | opcode | R4-type
| imm | rs1 | f3 | rd | opcode | I-type
|===============================================================|
| 0000000 | rs2 | rs1 | 001 | rd | 0111011 | SLLW*
| 0000000 | rs2 | rs1 | 101 | rd | 0111011 | SRLW*
| 0100000 | rs2 | rs1 | 001 | rd | 0111011 | GREVW
| 0100000 | rs2 | rs1 | 101 | rd | 0111011 | SRAW*
| 0010000 | rs2 | rs1 | 001 | rd | 0111011 | SLOW
| 0010000 | rs2 | rs1 | 101 | rd | 0111011 | SROW
| 0110000 | rs2 | rs1 | 001 | rd | 0111011 | ROLW
| 0110000 | rs2 | rs1 | 101 | rd | 0111011 | RORW
|---------------------------------------------------------------|
| 0010100 | rs2 | rs1 | 001 | rd | 0111011 | SBSETW
| 0100100 | rs2 | rs1 | 001 | rd | 0111011 | SBCLRW
| 0110100 | rs2 | rs1 | 001 | rd | 0111011 | SBINVW
| 0100100 | rs2 | rs1 | 101 | rd | 0111011 | SBEXTW
|---------------------------------------------------------------|
| 0000000 | imm | rs1 | 001 | rd | 0011011 | SLLIW*
| 0000000 | imm | rs1 | 001 | rd | 0011011 | SRLIW*
| 0100000 | imm | rs1 | 001 | rd | 0011011 | GREVIW
| 0100000 | imm | rs1 | 001 | rd | 0011011 | SRAIW*
| 0010000 | imm | rs1 | 001 | rd | 0011011 | SLOIW
| 0010000 | imm | rs1 | 101 | rd | 0011011 | SROIW
| 0110000 | imm | rs1 | 101 | rd | 0011011 | RORIW
|---------------------------------------------------------------|
| 0010100 | imm | rs1 | 001 | rd | 0011011 | SBSETIW
| 0100100 | imm | rs1 | 001 | rd | 0011011 | SBCLRIW
| 0110100 | imm | rs1 | 001 | rd | 0011011 | SBINVIW
|---------------------------------------------------------------|
| rs3 | 10| rs2 | rs1 | 001 | rd | 0111011 | FSLW
| rs3 | 10| rs2 | rs1 | 101 | rd | 0111011 | FSRW
| rs3 | 10| imm | rs1 | 101 | rd | 0011011 | FSRIW
|---------------------------------------------------------------|
| 0110000 | 00000 | rs1 | 001 | rd | 0011011 | CLZW
| 0110000 | 00001 | rs1 | 001 | rd | 0011011 | CTZW
| 0110000 | 00010 | rs1 | 001 | rd | 0011011 | PCNTW
|---------------------------------------------------------------|
| 0000101 | rs2 | rs1 | 001 | rd | 0111011 | CLMULW
| 0000101 | rs2 | rs1 | 010 | rd | 0111011 | CLMULRW
| 0000101 | rs2 | rs1 | 011 | rd | 0111011 | CLMULHW
|---------------------------------------------------------------|
| 0000100 | rs2 | rs1 | 001 | rd | 0111011 | SHFLW
| 0000100 | rs2 | rs1 | 101 | rd | 0111011 | UNSHFLW
| 0000100 | rs2 | rs1 | 010 | rd | 0111011 | BDEPW
| 0000100 | rs2 | rs1 | 110 | rd | 0111011 | BEXTW
| 0000100 | rs2 | rs1 | 100 | rd | 0111011 | PACKW
|---------------------------------------------------------------|
38 RISC-V Bitmanip Extension V0.90
The RISC-V ISA has no dedicated instructions for bitwise inverse (not). Instead not is implemented
as xori rd, rs, -1 and neg is implemented as sub rd, x0, rs.
In bitmanipulation code not is a very common operation. But there is no compressed encoding for
those operation because there is no c.xori instruction.
On RV64 (and RV128) zext.w and zext.d (pack and packw) are commonly used to zero-extend
unsigned values <XLEN.
It presumably would make sense for a future revision of the “C” extension to include compressed
opcodes for those instructions.
An encoding with the constraint rd = rs would fit nicely in the reserved space in
c.addi16sp/c.lui.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
011 nzimm[9] 2 nzimm[4|6|8:7|5] 01 C.ADDI16SP (RES, nzimm=0)
011 nzimm[17] rd̸={0, 2} nzimm[16:12] 01 C.LUI (RES, nzimm=0; HINT, rd=0)
011 0 00 rs1′ /rd′ 0 01 C.NOT
011 0 01 rs1′ /rd′ 0 01 C.ZEXT.W (RV64/128)
011 0 11 rs1′ /rd′ 0 01 C.ZEXT.D (RV128)
The entire RVC encoding space is 15.585 bits wide, the remaining reserved encoding space in RVC
is 11.155 bits wide, not including space that is only reserved on RV32/RV64. This means that
above encoding would use 0.0065% of the RVC encoding space, or 1.4% of the remaining reserved
RVC encoding space. Preliminary experiments have shown that NOT instructions alone make up
approximately 1% of bitmanipulation code size. [23]
When instruction encodings for instructions >32-bit are defined, a Zbf (bit-field) extension should
be considered that defines the following bit-field extract and place instructions.
RV64 only:
bfxpw rd, rs1, rs2, src_off, src_len, dst_off, dst_len
bfxpuw rd, rs1, rs2, src_off, src_len, dst_off, dst_len
These instructions extract src len bits at offset src off from rs1, and place them in the field of
RISC-V Bitmanip Extension V0.90 39
dst len bits at offset dst off in the value from rs2. bfxp sign-extends if dst len>src len,
and bfxpu zero-extends. When src len == 0 then bfxp sets bits in the output and bfxpu
clears bits. dst len == 0 encodes for dst len == XLEN. When src len+src off > XLEN or
dst len+dst off > XLEN then the bit-field wraps around.
uint_xlen_t bfxp(uint_xlen_t rs1, uint_xlen_t rs2,
unsigned src_off, unsigned src_len, unsigned dst_off, unsigned dst_len)
{
assert(src_off < XLEN && src_len < XLEN && dst_off < XLEN && dst_len < XLEN);
// sign-extend
value = sra(sll(value, XLEN-src_len), XLEN-src_len);
if (src_len == 0) value = ~(uint_xlen_t)0;
A lot of bit manipulation code depends on “multiply with magic number”-tricks. Often those
tricks need the upper half of the 2 · XLEN product. Therefore decent performance for the MUL and
especially MULH[[S]U] instructions is important for fast bit manipulation code.
40 RISC-V Bitmanip Extension V0.90
Bit manipulation code, even more than other code, requires a lot of “magic numbers”, bitmasks,
and other (usually large) constants. On some microarchitectures those can easily be loaded from a
nearby data section using load instructions. On other microarchitectures however this comes at a
high cost, and it is more efficient to load immediates using a sequence of instructions.
In addition to that, a 64 bit core may consider fusing the following sequences as well:
lui rd, imm
addi rd, rd, imm
pack rd, rd, rs2
And finally, a 64-bit core should fuse sequences with addiwu as well as addi, for loading unsigned
32-bit numbers that have their MSB set. This is often the case with masks in bit manipulation
code.
Preliminary experiments have shown that NOT instructions make up approximately 1% of bitma-
nipulation code size, more when looking at dynamic instruction count. [23]
Therefore it makes sense to fuse NOT instructions with other ALU instructions, if possible.
Pairs of left and right shifts are common operations for extracting a bit field.
To extract the contiguous bit field starting at pos with length len from rs (with pos > 0, len > 0,
and pos + len ≤ XLEN):
slli rd, rs, (XLEN-len-pos)
srli rd, rd, (XLEN-len)
Using srai instead of srli will sign-extend the extracted bit-field.
Similarly, placing a bit field with length len at the position pos:
slli rd, rs, (XLEN-len-pos)
srli rd, rd, (XLEN-len)
If possible, an implementation should fuse the following macro ops:
alu_op rd, rs1, rs2
srli rd, rd, imm
For generating masks, i.e. constants with one continous run of 1 bits, a sequence like the following
can be used that would utilize postfix fusion of right shifts:
sroi rd, zero, len
c.srli rd, (XLEN-len-pos)
42 RISC-V Bitmanip Extension V0.90
This can be a useful sequence on RV64, where loading an arbitrary 64-bit constant would usually
require at least 96 bits (using c.ld).
RISC-V has dedicated instructions for branching on equal/not-equal. But C code such as the
following would require set-equal and set-not-equal instructions, similar to slt.
int is_equal = (a == b);
int is_noteq = (c != d);
Those can be implemented using the following fuse-able sequences:
sub rd, rs1, rs2
sltui rd, rd, 1
Architectures with support for ternary operations may want to support fusing two ALU operations.
alu_op rd, ...
alu_op rd, rd, ...
This would be a postfix-fusion pattern, extending the postfix shift-right fusion described in the
previous section.
Candidates for this kind of postfix fusion would be simple ALU operations, specifically AND/OR/X-
OR/ADD/SUB and ANDI/ORI/XORI/ADDI/SUBI.
RISC-V Bitmanip Extension V0.90 43
A C header file <rvintrin.h> is provided that contains assembler templates for directly creating
assembler instructions from C code.
The header defines rv *(...) functions that operate on the long data type, rv32 *(...) func-
tions that operate on the int32 t data type, and rv64 *(...) functions that operate on the
int64 t data type. The rv64 *(...) functions are only available on RV64. See table 2.4 for a
complete list of intrinsics defined in <rvintrin.h>.
Usage example:
#include <rvintrin.h>
RV32 RV64
Instruction rv * rv32 * rv * rv32 * rv64 *
clz ✔ ✔ ✔ ✔ ✔
ctz ✔ ✔ ✔ ✔ ✔
pcnt ✔ ✔ ✔ ✔ ✔
pack ✔ ✔ ✔ ✔ ✔
min ✔ ✔ ✔ ✔ ✔
minu ✔ ✔ ✔ ✔ ✔
max ✔ ✔ ✔ ✔ ✔
maxu ✔ ✔ ✔ ✔ ✔
sbset ✔ ✔ ✔ ✔ ✔
sbclr ✔ ✔ ✔ ✔ ✔
sbinv ✔ ✔ ✔ ✔ ✔
sbext ✔ ✔ ✔ ✔ ✔
sll ✔ ✔ ✔ ✔ ✔
srl ✔ ✔ ✔ ✔ ✔
sra ✔ ✔ ✔ ✔ ✔
slo ✔ ✔ ✔ ✔ ✔
sro ✔ ✔ ✔ ✔ ✔
rol ✔ ✔ ✔ ✔ ✔
ror ✔ ✔ ✔ ✔ ✔
grev ✔ ✔ ✔ ✔ ✔
shfl ✔ ✔ ✔ ✔ ✔
unshfl ✔ ✔ ✔ ✔ ✔
bext ✔ ✔ ✔ ✔ ✔
bdep ✔ ✔ ✔ ✔ ✔
clmul ✔ ✔ ✔ ✔ ✔
clmulh ✔ ✔ ✔ ✔ ✔
clmulr ✔ ✔ ✔ ✔ ✔
bmatflip ✔ ✔
bmator ✔ ✔
bmatxor ✔ ✔
fsl ✔ ✔ ✔ ✔ ✔
fsr ✔ ✔ ✔ ✔ ✔
cmix ✔ ✔
cmov ✔ ✔
crc32 b ✔ ✔
crc32 h ✔ ✔
crc32 w ✔ ✔
crc32 d ✔
crc32c b ✔ ✔
crc32c h ✔ ✔
crc32c w ✔ ✔
crc32c d ✔
Evaluation
This chapter contains a collection of short code snippets and algorithms using the Bitmanip exten-
sion for evaluation purposes. For the sake of simplicity we assume RV32 for most examples in this
chapter.
Extracting a bit field of length len at position pos can be done using two shift operations.
The parity of a word (xor of all bits) is the LSB of the population count.
pcnt a0, a0
andi a0, a0, 1
45
46 RISC-V Bitmanip Extension V0.90
Rank and select are fundamental operations in succinct data structures [17].
select(a0, a1) returns the position of the a1th set bit in a0. It can be implemented efficiently
using bdep and ctz:
select:
li a2, 1
sll a1, a2, a1
bdep a0, a1, a0
ctz a0, a0
ret
rank(a0, a1) returns the number of set bits in a0 up to and including position a1.
rank:
not a1, a1
sll a0, a1
pcnt a0, a0
ret
The following code packs the lower 8 bits from a0, a1, a2, a3 into a 32-bit word returned in a0,
ignoring other bits in the input values.
pack a0, a0, a1
pack a1, a2, a3
shfl a0, a0, 8
shfl a1, a1, 8
pack a0, a0, a1
This replaces either 4 store-byte instructions followed by one load-word instruction, or something
like the following sequence.
andi a0, a0, 255
andi a1, a1, 255
andi a2, a2, 255
pack a0, a0, a2
pack a1, a1, a3
slli a1, a1, 8
or a0, a0, a1
The “fill right” or “fold right” operation is a pattern commonly used in bit manipulation code. [8]
uint64_t rfill(uint64_t x)
{
x |= x >> 1; // SRLI, OR
x |= x >> 2; // SRLI, OR
x |= x >> 4; // SRLI, OR
x |= x >> 8; // SRLI, OR
x |= x >> 16; // SRLI, OR
x |= x >> 32; // SRLI, OR
return x;
}
With clz it can be implemented in only 4 instructions. Notice the handling of the case where x=0
using sltiu+addi.
uint64_t rfill_clz(uint64_t x)
{
uint64_t t;
t = clz(x); // CLZ
x = (!x)-1; // SLTIU, ADDI
x = x >> (t & 63); // SRL
return x;
}
Alternatively, a Trailing Bit Manipulation (TBM) code pattern can be used together with rev to
implement this function in 4 instructions:
uint64_t rfill_rev(uint64_t x)
{
x = rev(x); // GREVI
x = x | ~(x - 1); // ADDI, ORN
x = rev(x); // GREVI
return x;
}
Finally, there is another implementation in 4 instructions using BMATOR, if we do not count the
extra instructions for loading utility matrices.
48 RISC-V Bitmanip Extension V0.90
uint64_t rfill_bmat(uint64_t x)
{
uint64_t m0, m1, m2, t;
m0 = 0xFF7F3F1F0F070301LL; // LD
m1 = bmatflip(m0 << 8); // SLLI, BMATFLIP
m2 = -1LL; // ADDI
return x;
}
A funnel shift takes two XLEN registers, concatenates them to a 2 × XLEN word, shifts that by a
certain amount, then returns the lower half of the result for a right shift and the upper half of the
result for a left shift.
For example, the following functions implement rotate-shift operations for bigints made from n
XLEN words.
void bigint_rol(uint_xlen_t data[], int n, int shamt)
{
if (n <= 0)
return;
The following function parses n 27-bit words from a packed array of XLEN words:
50 RISC-V Bitmanip Extension V0.90
while (n--) {
if (reserve < 27) {
uint_xlen_t buf = *(idata++);
lower |= sll(buf, reserve);
upper = reserve ? srl(buf, -reserve) : 0;
reserve += XLEN;
}
*(odata++) = lower & ((1 << 27)-1);
lower = fsr(lower, 27, upper);
upper = srl(upper, 27);
reserve -= 27;
}
}
And here the same thing in RISC-V assembler:
parse_27bit:
li t1, 0 ; lower
li t2, 0 ; upper
li t3, 0 ; reserve
li t4, 27 ; shamt
slo t5, zero, t4 ; mask
beqz a2, endloop ; while (n--)
loop:
addi a2, a2, -1
bge t3, t4, output ; if (reserve < 27)
lw t6, 0(a0) ; buf = *(idata++)
addi a0, a0, 4
sll t7, t6, t3 ; lower |= sll(buf, reserve)
or t1, t1, t7
sub t7, zero, t3 ; upper = reserve ? srl(buf, -reserve) : 0
srl t7, t6, t7
cmov t2, t3, t7, zero
addi t3, t3, 32 ; reserve += XLEN;
output:
and t6, t1, t5 ; *(odata++) = lower & ((1 << 27)-1)
sw t6, 0(a1)
addi a1, a1, 4
fsr t1, t1, t2, t4 ; lower = fsr(lower, 27, upper)
srl t2, t2, t4 ; upper = srl(upper, 27)
sub t3, t3, t4 ; reserve -= 27
bnez a2, loop ; while (n--)
endloop:
ret
RISC-V Bitmanip Extension V0.90 51
A loop iteration without fetch is 9 instructions long, and a loop iteration with fetch is 17 instructions
long.
Without ternary operators that would be 13 instructions and 22 instructions, i.e. assuming one
cycle per instruction, that function would be about 30% slower without ternary instructions.
A fixed-point multiply is simply an integer multiply, followed by a right shift. If the entire dynamic
range of XLEN bits should be useable for the factors, then the product before shift must be 2*XLEN
wide. Therefore mul+mulh is needed for the multiplication, and funnel shift instructions can help
with the final right shift. For fixed-point numbers with N fraction bits:
mul_fracN:
mulh a2, a0, a1
mul a0, a0, a1
fsri a0, a0, a2, N
ret
This section lists code snippets for computing arbitrary bit permutations that are defined by data
(as opposed to bit permutations that are known at compile time and can likely be compiled into
shift-and-mask operations and/or a few instances of bext/bdep).
The following macro performs a stage-N butterfly operation on the word in a0 using the mask in
a1.
The bitmask in a1 must be preformatted correctly for the selected butterfly stage. A butterfly
operation only has a XLEN/2 wide control word. The following macros format the mask assuming
those XLEN/2 bits in the lower half of a1 on entry (preformatted mask in a1 on exit):
bfly_msk_0:
zip a1, a1
slli a2, a1, 1
or a1, a1, a2
bfly_msk_1:
52 RISC-V Bitmanip Extension V0.90
zip2 a1, a1
slli a2, a1, 2
or a1, a1, a2
bfly_msk_2:
zip4 a1, a1
slli a2, a1, 4
or a1, a1, a2
...
A sequence of 2 · log2 (XLEN) − 1 butterfly operations can perform any arbitrary bit permutation
(Beneš network):
butterfly(LOG2_XLEN-1)
butterfly(LOG2_XLEN-2)
...
butterfly(0)
...
butterfly(LOG2_XLEN-2)
butterfly(LOG2_XLEN-1)
Many permutations arising from real-world applications can be implemented using shorter se-
quences. For example, any sheep-and-goats operation (SAG, see section 3.3.4) with either the
sheep or the goats bit reversed can be implemented in log2 (XLEN) butterfly operations.
Reversing a permutation implemented using butterfly operations is as simple as reversing the order
of butterfly operations.
zip a0, a0
grevi a2, a0, 1
cmix a0, a1, a2, a0
A sequence of log2 (XLEN) omega operations followed by log2 (XLEN) flip operations can implement
any arbitrary 32 bit permutation.
As for butterfly networks, permutations arising from real-world applications can often be imple-
mented using a shorter sequence.
Another way of implementing arbitrary 32 bit permutations is using a baseline network followed
by an inverse baseline network.
A baseline network is a sequence of log2 (XLEN) butterfly(0) operations interleaved with unzip
operations. For example, a 32-bit baseline network:
butterfly(0)
unzip
butterfly(0)
unzip.h
butterfly(0)
unzip.b
butterfly(0)
unzip.n
butterfly(0)
An inverse baseline network is a sequence of log2 (XLEN) butterfly(0) operations interleaved with
zip operations. The order is opposite to the order in a baseline network. For example, a 32-bit
inverse baseline network:
butterfly(0)
zip.n
butterfly(0)
zip.b
butterfly(0)
zip.h
butterfly(0)
zip
butterfly(0)
A baseline network followed by an inverse baseline network can implement any arbitrary bit per-
mutation.
The Sheep-and-goats (SAG) operation is a common operation for bit permutations. It moves all
the bits selected by a mask (goats) to the LSB end of the word and all the remaining bits (sheep)
54 RISC-V Bitmanip Extension V0.90
to the MSB end of the word, without changing the order of sheep or goats.
The SAG operation can easily be performed using bext (data in a0 and mask in a1):
Any arbitrary bit permutation can be implemented in log2 (XLEN) SAG operations.
The Hacker’s Delight describes an optimized standard C implementation of the SAG operation.
Their algorithm takes 254 instructions (for 32 bit) or 340 instructions (for 64 bit) on their reference
RISC instruction set. [9, p. 152f, 162f]
bat[x]or performs a permutation of bits within each byte when used with a permutation matrix
in rs2, and performs a permutation of bytes when used with a permutation matrix in rs1.
Bitboards are 64-bit bitmasks that are used to represent part of the game state in chess engines
(and other board game AIs). The bits in the bitmask correspond to squares on a 8 × 8 chess board:
56 57 58 59 60 61 62 63
48 49 50 51 52 53 54 55
40 41 42 43 44 45 46 47
32 33 34 35 36 37 38 39
24 25 26 27 28 29 30 31
16 17 18 19 20 21 22 23
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
Many bitboard operations are simple straight-forward operations such as bitwise-AND, but mir-
roring and rotating bitboards can take up to 20 instructions on x86.
Flip horizontal:
63 62 61 60 59 58 57 56 RISC-V Bitmanip:
55 54 53 52 51 50 49 48 rev.b
47 46 45 44 43 42 41 40
39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24 x86:
23 22 21 20 19 18 17 16 13 operations
15 14 13 12 11 10 9 8
7 6 5 4 3 2 1 0
Flip vertical:
0 1 2 3 4 5 6 7 RISC-V Bitmanip:
8 9 10 11 12 13 14 15 rev8
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 x86:
40 41 42 43 44 45 46 47 bswap
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
Rotate 180:
7 6 5 4 3 2 1 0 RISC-V Bitmanip:
15 14 13 12 11 10 9 8 rev
23 22 21 20 19 18 17 16
31 30 29 28 27 26 25 24
39 38 37 36 35 34 33 32 x86:
47 46 45 44 43 42 41 40 14 operations
55 54 53 52 51 50 49 48
63 62 61 60 59 58 57 56
Transpose:
7 15 23 31 39 47 55 63 RISC-V Bitmanip:
6 14 22 30 38 46 54 62 zip, zip, zip
5 13 21 29 37 45 53 61
4 12 20 28 36 44 52 60
3 11 19 27 35 43 51 59 x86:
2 10 18 26 34 42 50 58 18 operations
1 9 17 25 33 41 49 57
0 8 16 24 32 40 48 56
56 RISC-V Bitmanip Extension V0.90
A rotation is simply the composition of a flip operation and a transpose operation. This takes 19
operations on x86 [2]. With Bitmanip the rotate operation only takes 4 operations:
rotate_bitboard:
rev8 a0, a0
zip a0, a0
zip a0, a0
zip a0, a0
3.4.3 Explanation
The bit indices for a 64-bit word are 6 bits wide. Let i[5:0] be the index of a bit in the input,
and let i′ [5:0] be the index of the same bit after the permutation.
As an example, a rotate left shift by N can be expressed using this notation as i′ [5:0] = i[5:0] +
N (mod 64).
And a GZIP operation corresponds to a rotate left shift by one position of any contiguous region
of i[5:0]. For example, zip is a left rotate shift of the entire bit index:
In a bitboard, i[2:0] corresponds to the X coordinate of a board position, and i[5:3] corresponds
to the Y coordinate.
Therefore flipping the board horizontally is the same as negating bits i[2:0], which is the operation
performed by grevi rd, rs, 7 (rev.b).
Likewise flipping the board vertically is done by grevi rd, rs, 56 (rev8).
Finally, transposing corresponds by swapping the lower and upper half of i[5:0], or rotate shifting
i[5:0] by 3 positions. This can easily done by rotate shifting the entire i[5:0] by one bit position
(zip) three times.
Let’s define a bitcube as a 4 × 4 × 4 cube with x = i[1:0], y = i[3:2], and z = i[5:4]. Using
the same methods as described above we can easily rotate a bitcube by 90◦ around the X-, Y-, and
Z-axis:
RISC-V Bitmanip Extension V0.90 57
Xorshift RNGs are a class of fast RNGs for different bit widths. There are 648 Xorshift RNGs for
32 bits, but this is the one that the author of the original Xorshift RNG paper recommends. [16,
p. 4]
uint32_t xorshift32(uint32_t x)
{
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
return x;
}
This function of course has been designed and selected so it’s efficient, even without special bit-
manipulation instructions. So let’s look at the inverse instead. First, the naı̈ve form of inverting
this function:
uint32_t xorshift32_inv(uint32_t x)
{
uint32_t t;
t = x ^ (x << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
x = x ^ (t << 5);
x = x ^ (x >> 17);
t = x ^ (x << 13);
x = x ^ (t << 13);
return x;
}
This translates to 18 RISC-V instructions, not including the function call overhead.
Obviously the C expression x ^ (x >> 17) is already its own inverse (because 17 ≥ XLEN/2)
and therefore already has an effecient inverse. But the two other blocks can easily be implemented
using a single clmul instruction each:
58 RISC-V Bitmanip Extension V0.90
uint32_t xorshift32_inv(uint32_t x)
{
x = clmul(x, 0x42108421);
x = x ^ (x >> 17);
x = clmul(x, 0x04002001);
return x;
}
This are 8 RISC-V instructions, including 4 instructions for loading the constants, but not including
the function call overhead.
An optimizing compiler could easily generate the clmul instructions and the magic constants from
the C code for the naı̈ve implementation. (0x04002001 = (1 << 2*13) | (1 << 13) | 1 and
0x42108421 = (1 << 6*5) | (1 << 5*5) | ...| (1 << 5) | 1)
The obvious remaining question is “if clmul(x, 0x42108421) is the inverse of x ^ (x << 5),
what’s the inverse of x ^ (x >> 5)?” It’s clmulr(x, 0x84210842), where 0x84210842 is the
bit-reversal of 0x42108421.
A special case of xorshift is x ^ (x >> 1), which is a gray encoder. The corresponding gray
decoder is clmulr(x, 0xffffffff).
There are special instructions for performing CRCs using the two most widespread 32-bit CRC
polynomials, CRC-32 and CRC-32C.
CRCs with other polynomials can be computed efficiently using CLMUL. The following examples
are using CRC32Q.
The easiest way of implementing CRC32Q with clmul is using a Barrett reduction. On RV32:
uint32_t crc32q_simple(const uint32_t *data, int length)
{
uint32_t P = 0x814141AB; // CRC polynomial (implicit x^32)
uint32_t mu = 0xFEFF7F62; // x^64 divided by CRC polynomial
uint32_t mu1 = 0xFF7FBFB1; // "mu" with leading 1, shifted right by 1 bit
uint32_t crc = 0;
return crc;
}
The following python code calculates the value of mu for a given CRC polynomial:
RISC-V Bitmanip Extension V0.90 59
P = 0x1814141AB
print("0x%X" % (polydiv(1<<64, P))) # prints 0x1FEFF7F62
A more efficient method would be the following, which processes 64-bit at a time (RV64):
60 RISC-V Bitmanip Extension V0.90
t1 = clmulh(a0, k2);
t2 = clmul(a0, k2);
t2 = clmul(a0, k3);
a1 = a1 ^ t2;
// Barrett Reduction
return a0;
}
The main idea is to transform an array of arbitrary length to an array with the same CRC that’s
only two 64-bit elements long. (That’s the “Main loop” portion of above code.)
Then we further reduce it to just 64-bit. And then we use a Barrett reduction to get the final 32-bit
RISC-V Bitmanip Extension V0.90 61
result.
The following python code can be used to calculate the “magic constants” k1, k2, and k3:
def polymod(dividend, divisor):
quotient = 0
while dividend.bit_length() >= divisor.bit_length():
i = dividend.bit_length() - divisor.bit_length()
dividend = dividend ^ (divisor << i)
quotient |= 1 << i
return dividend
The following code snippets decode and sign-extend the immediate from RISC-V S-type, B-type,
J-type, and CJ-type instructions. They are nice “nothing up my sleeve”-examples for real-world
bit permutations.
31 27 26 25 24 20 19 15 14 12 11 7 6 0
imm[11:5] imm[4:0] S-type
imm[12|10:5] imm[4:1|11] B-type
imm[20|10:1|11|19:12] J-type
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
imm[11|4|9:8|10|6|7|3:1|5] CJ-type
decode_s: decode_b:
li t0, 0xfe000f80 li t0, 0xeaa800aa
bext a0, a0, t0 rori a0, a0, 8
c.slli a0, 20 grevi a0, a0, 8
c.srai a0, 20 shfli a0, a0, 7
ret bext a0, a0, t0
c.slli a0, 20
c.srai a0, 19
ret
62 RISC-V Bitmanip Extension V0.90
63
64 RISC-V Bitmanip Extension V0.90
[3] MC88110 Second Generation RISC Microprocessor User’s Manual. Motorola Inc., 1991.
[4] Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual. Cray Inc., 2003.
Version 1.1, S-2314-50.
[5] Cray XMT Principles of Operation. Cray Inc., 2009. Version 1.3, S-2473-13.
[6] SPARC T3 Supplement to the UltraSPARC Architecture 2007 Specification. Oracle, 2010.
[7] TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (Rev. J). Texas Instru-
ments, 2010.
[11] Vinodh Gopal, Erdinc Ozturk, Jim Guilford, Gil Wolrich, Wajdi Feghali, Martin Dixon,
and Deniz Karakoyunlu. Fast crc computation for generic polynomials using pclmulqdq
instruction. https://www.intel.com/content/dam/www/public/us/en/documents/
white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf,
2009. Intel White Paper, Accessed: 2018-10-23.
[12] James Hughes. Using carry-less multiplication (clmul) to implement erasure code. Patent
US13866453, 2013.
[13] Donald E. Knuth. The Art of Computer Programming, Volume 4A. Addison-Wesley, 2011.
[14] Geoff Langdale and Daniel Lemire. Parsing gigabytes of JSON per second. CoRR,
abs/1902.08318, 2019.
[15] Daniel Lemire and Owen Kaser. Faster 64-bit universal hashing using carry-less multiplications.
CoRR, abs/1503.03465, 2015.
65
66 RISC-V Bitmanip Extension V0.90
[16] George Marsaglia. Xorshift rngs. Journal of Statistical Software, Articles, 8(14):1–6, 2003.
[17] Prashant Pandey, Michael A. Bender, and Rob Johnson. A fast x86 implementation of select.
CoRR, abs/1706.00990, 2017.
[18] Henry S. Warren. Hacker’s Delight. Addison-Wesley Professional, 2nd edition, 2012.
[22] Clifford Wolf. Reference implementations of various crcs using carry-less multiply. http:
//svn.clifford.at/handicraft/2018/clmulcrc/. Accessed: 2018-11-06.
[23] Clifford Wolf. A simple synthetic compiler benchmark for bit manipulation operations. http:
//svn.clifford.at/handicraft/2017/bitcode/. Accessed: 2017-04-30.