Support for bit packing and unpacking operations (PEXT, PDEP) #93

mchristianl · 2022-01-11T11:48:53Z

Hi,

I have stumbled upon the Parallel bit deposit and extract operations pext and pdep of the X86 Bit manipulation instruction set (BMI). This might not be directly related to SIMD, but since these instructions are quite useful as horizontal operations, it seems a good fit for SIMD.jl

Is there (already) any way to use these assembly instructions from within Julia?

Example:

bitunpack(0b11001010,true)
# <8 x Bool>[1, 1, 0, 0, 1, 0, 1, 0]

In the special case of Vec{N,Bool} this could be achieved with some bit-twiddling using a single integer multiplication and only scalar bit operations to move the bits

# a slow function to compute periodic bit patterns for the use as constants
interleavemask(::Type{T},period,amount,shift,start=1,stop=sizeof(T)*8) where T =
  reduce(|,T(((n + (shift > 0 ? period-shift : -shift)) % period) < amount) << n for n in start-1:stop-1)

# pack a vector of 8 bits into a single byte
@inline function bitpack(xs::Vec{8,Bool},reverse::Bool=false)::UInt8
  x = reinterpret(UInt64,xs)
  m = (reverse ? interleavemask(UInt64,8+1,1,0)         # 0x8040201008040201
               : interleavemask(UInt64,8-1,1,0,2,64-1)) # 0x0102040810204080
  return unsafe_trunc(UInt8,(x*m) >> (64-8))
end

# unpack a single byte into a vector of 8 bits
@inline function bitunpack(x::UInt8,reverse::Bool=false)::Vec{8,Bool}
  if reverse
    m1 = interleavemask(UInt64,9,1,0) # 0x8040201008040201
    m2 = interleavemask(UInt64,8,1,7) # 0x8080808080808080
    return reinterpret(Vec{8,Bool},((x*m1) & m2) >> (8-1))
  else
    m1 = interleavemask(UInt64,7,1,0) # 0x8102040810204081
    m2 = interleavemask(UInt64,8,1,0) # 0x0101010101010101
    x1 = x & (~0b1) # remove first bit
    return reinterpret(Vec{8,Bool},((x1*m1 | x) & m2))
  end
end

for x in UnitRange(typemin(UInt8),typemax(UInt8))
  @assert bitpack(bitunpack(x,false),false) == x
  @assert bitpack(bitunpack(x,true),true) == x
end

This...

function bitpack2(xs::Vec{8,Bool})
  x = reinterpret(UInt64,xs)
  m = interleavemask(UInt64,7,1,0,2,63) # 0x0102040810204080
  return unsafe_trunc(UInt8,(x*m) >> (64-8))
end
# bitpack2
#   movabsq $72624976668147840, %rax    # imm = 0x102040810204080
#   imulq   (%rdi), %rax
#   shrq    $56, %rax
#   retq

...is in favor of an alternative with a hierarchical reduction; either with vector operations...

function bitpack3(x::Vec{8,Bool}) # ✗
  s = Vec((UInt8(n) for n in 0:7)...)
  x0 = reinterpret(Vec{8,UInt8},x)
  x1 = x0 << s
  reduce(|,x1)
end
# bitpack3
#   vmovq   (%rdi), %xmm0               # xmm0 = mem[0],zero
#   vpmovzxbw       %xmm0, %ymm0        # ymm0 = .......
#   movabsq $.rodata.cst32, %rax
#   vpmullw (%rax), %ymm0, %ymm0
#   movabsq $139838074081344, %rax      # imm = 0x7F2E96BB5040
#   vpand   (%rax), %ymm0, %ymm0
#   vextracti128    $1, %ymm0, %xmm1
#   vpackuswb       %xmm1, %xmm0, %xmm0
#   vpshufd $85, %xmm0, %xmm1           # xmm1 = xmm0[1,1,1,1]
#   vpor    %xmm1, %xmm0, %xmm0
#   vpsrld  $16, %xmm0, %xmm1
#   vpor    %xmm1, %xmm0, %xmm0
#   vpsrlw  $8, %xmm0, %xmm1
#   vpor    %xmm1, %xmm0, %xmm0
#   vmovd   %xmm0, %eax
#   vzeroupper
#   retq

...or without...

# https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/
function bitpack1(x::Vec{8,Bool})
  x0 = reinterpret(UInt64,x)
  x1 = (x0 | (x0 >> ( 8-1))) & interleavemask(UInt64,16,2,0) # 0x0003000300030003
  x2 = (x1 | (x1 >> (16-2))) & interleavemask(UInt64,32,4,0) # 0x0000000f0000000f
  x3 = (x2 | (x2 >> (32-4))) & interleavemask(UInt64,64,8,0) # 0x00000000000000ff
  unsafe_trunc(UInt8,x3)
end
# bitpack1
#   movq    (%rdi), %rax
#   movq    %rax, %rcx
#   shrq    $7, %rcx
#   orq     %rax, %rcx
#   movabsq $844437815230467, %rax      # imm = 0x3000300030003
#   andq    %rcx, %rax
#   movq    %rax, %rcx
#   shrq    $14, %rcx
#   orq     %rax, %rcx
#   movq    %rcx, %rax
#   shrq    $28, %rax
#   orl     %ecx, %eax
#   ret

Is there some Julia-package already providing such functionality?

kind regards,
Christian

PS: There are several variants I came up with to illustrate this on Vec{8,Bool}.

The text was updated successfully, but these errors were encountered:

KristofferC · 2022-01-11T12:46:39Z

This might not be directly related to SIMD, but since these instructions are quite useful as horizontal operations, it seems a good fit for SIMD.jl

Right now, SIMD.jl provides an interface to the LLVM vector intrinsics (in https://github.com/eschnett/SIMD.jl/blob/master/src/LLVM_intrinsics.jl), and a Vec type with standard Julia operations built on top of that (in https://github.com/eschnett/SIMD.jl/blob/master/src/simdvec.jl). There is no architecture-specific functionality implemented, that is up to LLVM to turn the IR into the best instructions for the architecture.

The advantage to this is that it keeps the package to a reasonable size (there are many many architecture-specific intrinsics https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html), it means the same functionality is provided for everyone and it avoids non-trivial things like runtime architecture detection.

Compilers are usually quite good at pattern matching operations to intrinsics if they are available. Does this not happen here?

mchristianl · 2022-01-11T13:49:41Z

Ok. According to llvm.org for SIMD.jl we have

llvm.vector.reduce.add.*
llvm.vector.reduce.fadd.*
llvm.vector.reduce.mul.*
llvm.vector.reduce.fmul.*
llvm.vector.reduce.and.*
llvm.vector.reduce.or.*
llvm.vector.reduce.xor.*
llvm.vector.reduce.smax.*
llvm.vector.reduce.smin.*
llvm.vector.reduce.umax.*
llvm.vector.reduce.umin.*
llvm.vector.reduce.fmax.*
llvm.vector.reduce.fmin.*
llvm.experimental.vector.insert
llvm.experimental.vector.extract
llvm.experimental.vector.reverse
llvm.experimental.vector.splice
llvm.experimental.stepvector

I found a discourse discussion from 2017 describing how to get this feature

pdep(x::UInt32, y::UInt32) = ccall("llvm.x86.bmi.pdep.32", llvmcall, UInt32, (UInt32, UInt32), x, y)
pdep(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pdep.64", llvmcall, UInt64, (UInt64, UInt64), x, y)
pext(x::UInt32, y::UInt32) = ccall("llvm.x86.bmi.pext.32", llvmcall, UInt32, (UInt32, UInt32), x, y)
pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

but since it lives in llvm.x86.* this means, that it is not within the scope of this package.

There are bit manipulation intrinsics

llvm.bitreverse.*
llvm.bswap.*
llvm.ctpop.*
llvm.ctlz.*
llvm.cttz.*
llvm.fshl.*
llvm.fshr.*

but none of them matches this kind of packing-operation.

Thank you for the clarification. Maybe this comes with future versions of LLVM.

PS: I was not able to write code in a way that produce these architecture-specific operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for bit packing and unpacking operations (PEXT, PDEP) #93

Support for bit packing and unpacking operations (PEXT, PDEP) #93

mchristianl commented Jan 11, 2022 •

edited

Loading

KristofferC commented Jan 11, 2022 •

edited

Loading

mchristianl commented Jan 11, 2022 •

edited

Loading

Support for bit packing and unpacking operations (PEXT, PDEP) #93

Support for bit packing and unpacking operations (PEXT, PDEP) #93

Comments

mchristianl commented Jan 11, 2022 • edited Loading

KristofferC commented Jan 11, 2022 • edited Loading

mchristianl commented Jan 11, 2022 • edited Loading

mchristianl commented Jan 11, 2022 •

edited

Loading

KristofferC commented Jan 11, 2022 •

edited

Loading

mchristianl commented Jan 11, 2022 •

edited

Loading