Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for bit packing and unpacking operations (PEXT, PDEP) #93

Open
mchristianl opened this issue Jan 11, 2022 · 2 comments
Open

Support for bit packing and unpacking operations (PEXT, PDEP) #93

mchristianl opened this issue Jan 11, 2022 · 2 comments

Comments

@mchristianl
Copy link

mchristianl commented Jan 11, 2022

Hi,

I have stumbled upon the Parallel bit deposit and extract operations pext and pdep of the X86 Bit manipulation instruction set (BMI). This might not be directly related to SIMD, but since these instructions are quite useful as horizontal operations, it seems a good fit for SIMD.jl

Is there (already) any way to use these assembly instructions from within Julia?

Example:

bitunpack(0b11001010,true)
# <8 x Bool>[1, 1, 0, 0, 1, 0, 1, 0]

In the special case of Vec{N,Bool} this could be achieved with some bit-twiddling using a single integer multiplication and only scalar bit operations to move the bits

# a slow function to compute periodic bit patterns for the use as constants
interleavemask(::Type{T},period,amount,shift,start=1,stop=sizeof(T)*8) where T =
  reduce(|,T(((n + (shift > 0 ? period-shift : -shift)) % period) < amount) << n for n in start-1:stop-1)

# pack a vector of 8 bits into a single byte
@inline function bitpack(xs::Vec{8,Bool},reverse::Bool=false)::UInt8
  x = reinterpret(UInt64,xs)
  m = (reverse ? interleavemask(UInt64,8+1,1,0)         # 0x8040201008040201
               : interleavemask(UInt64,8-1,1,0,2,64-1)) # 0x0102040810204080
  return unsafe_trunc(UInt8,(x*m) >> (64-8))
end

# unpack a single byte into a vector of 8 bits
@inline function bitunpack(x::UInt8,reverse::Bool=false)::Vec{8,Bool}
  if reverse
    m1 = interleavemask(UInt64,9,1,0) # 0x8040201008040201
    m2 = interleavemask(UInt64,8,1,7) # 0x8080808080808080
    return reinterpret(Vec{8,Bool},((x*m1) & m2) >> (8-1))
  else
    m1 = interleavemask(UInt64,7,1,0) # 0x8102040810204081
    m2 = interleavemask(UInt64,8,1,0) # 0x0101010101010101
    x1 = x & (~0b1) # remove first bit
    return reinterpret(Vec{8,Bool},((x1*m1 | x) & m2))
  end
end

for x in UnitRange(typemin(UInt8),typemax(UInt8))
  @assert bitpack(bitunpack(x,false),false) == x
  @assert bitpack(bitunpack(x,true),true) == x
end

This...

function bitpack2(xs::Vec{8,Bool})
  x = reinterpret(UInt64,xs)
  m = interleavemask(UInt64,7,1,0,2,63) # 0x0102040810204080
  return unsafe_trunc(UInt8,(x*m) >> (64-8))
end
# bitpack2
#   movabsq $72624976668147840, %rax    # imm = 0x102040810204080
#   imulq   (%rdi), %rax
#   shrq    $56, %rax
#   retq

...is in favor of an alternative with a hierarchical reduction; either with vector operations...

function bitpack3(x::Vec{8,Bool}) #
  s = Vec((UInt8(n) for n in 0:7)...)
  x0 = reinterpret(Vec{8,UInt8},x)
  x1 = x0 << s
  reduce(|,x1)
end
# bitpack3
#   vmovq   (%rdi), %xmm0               # xmm0 = mem[0],zero
#   vpmovzxbw       %xmm0, %ymm0        # ymm0 = .......
#   movabsq $.rodata.cst32, %rax
#   vpmullw (%rax), %ymm0, %ymm0
#   movabsq $139838074081344, %rax      # imm = 0x7F2E96BB5040
#   vpand   (%rax), %ymm0, %ymm0
#   vextracti128    $1, %ymm0, %xmm1
#   vpackuswb       %xmm1, %xmm0, %xmm0
#   vpshufd $85, %xmm0, %xmm1           # xmm1 = xmm0[1,1,1,1]
#   vpor    %xmm1, %xmm0, %xmm0
#   vpsrld  $16, %xmm0, %xmm1
#   vpor    %xmm1, %xmm0, %xmm0
#   vpsrlw  $8, %xmm0, %xmm1
#   vpor    %xmm1, %xmm0, %xmm0
#   vmovd   %xmm0, %eax
#   vzeroupper
#   retq

...or without...

# https://fgiesen.wordpress.com/2009/12/13/decoding-morton-codes/
function bitpack1(x::Vec{8,Bool})
  x0 = reinterpret(UInt64,x)
  x1 = (x0 | (x0 >> ( 8-1))) & interleavemask(UInt64,16,2,0) # 0x0003000300030003
  x2 = (x1 | (x1 >> (16-2))) & interleavemask(UInt64,32,4,0) # 0x0000000f0000000f
  x3 = (x2 | (x2 >> (32-4))) & interleavemask(UInt64,64,8,0) # 0x00000000000000ff
  unsafe_trunc(UInt8,x3)
end
# bitpack1
#   movq    (%rdi), %rax
#   movq    %rax, %rcx
#   shrq    $7, %rcx
#   orq     %rax, %rcx
#   movabsq $844437815230467, %rax      # imm = 0x3000300030003
#   andq    %rcx, %rax
#   movq    %rax, %rcx
#   shrq    $14, %rcx
#   orq     %rax, %rcx
#   movq    %rcx, %rax
#   shrq    $28, %rax
#   orl     %ecx, %eax
#   ret

Is there some Julia-package already providing such functionality?

kind regards,
Christian

PS: There are several variants I came up with to illustrate this on Vec{8,Bool}.

@KristofferC
Copy link
Collaborator

KristofferC commented Jan 11, 2022

This might not be directly related to SIMD, but since these instructions are quite useful as horizontal operations, it seems a good fit for SIMD.jl

Right now, SIMD.jl provides an interface to the LLVM vector intrinsics (in https://github.com/eschnett/SIMD.jl/blob/master/src/LLVM_intrinsics.jl), and a Vec type with standard Julia operations built on top of that (in https://github.com/eschnett/SIMD.jl/blob/master/src/simdvec.jl). There is no architecture-specific functionality implemented, that is up to LLVM to turn the IR into the best instructions for the architecture.

The advantage to this is that it keeps the package to a reasonable size (there are many many architecture-specific intrinsics https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html), it means the same functionality is provided for everyone and it avoids non-trivial things like runtime architecture detection.

Compilers are usually quite good at pattern matching operations to intrinsics if they are available. Does this not happen here?

@mchristianl
Copy link
Author

mchristianl commented Jan 11, 2022

Ok. According to llvm.org for SIMD.jl we have

llvm.vector.reduce.add.*
llvm.vector.reduce.fadd.*
llvm.vector.reduce.mul.*
llvm.vector.reduce.fmul.*
llvm.vector.reduce.and.*
llvm.vector.reduce.or.*
llvm.vector.reduce.xor.*
llvm.vector.reduce.smax.*
llvm.vector.reduce.smin.*
llvm.vector.reduce.umax.*
llvm.vector.reduce.umin.*
llvm.vector.reduce.fmax.*
llvm.vector.reduce.fmin.*
llvm.experimental.vector.insert
llvm.experimental.vector.extract
llvm.experimental.vector.reverse
llvm.experimental.vector.splice
llvm.experimental.stepvector

I found a discourse discussion from 2017 describing how to get this feature

pdep(x::UInt32, y::UInt32) = ccall("llvm.x86.bmi.pdep.32", llvmcall, UInt32, (UInt32, UInt32), x, y)
pdep(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pdep.64", llvmcall, UInt64, (UInt64, UInt64), x, y)
pext(x::UInt32, y::UInt32) = ccall("llvm.x86.bmi.pext.32", llvmcall, UInt32, (UInt32, UInt32), x, y)
pext(x::UInt64, y::UInt64) = ccall("llvm.x86.bmi.pext.64", llvmcall, UInt64, (UInt64, UInt64), x, y)

but since it lives in llvm.x86.* this means, that it is not within the scope of this package.

There are bit manipulation intrinsics

llvm.bitreverse.*
llvm.bswap.*
llvm.ctpop.*
llvm.ctlz.*
llvm.cttz.*
llvm.fshl.*
llvm.fshr.*

but none of them matches this kind of packing-operation.

Thank you for the clarification. Maybe this comes with future versions of LLVM.

PS: I was not able to write code in a way that produce these architecture-specific operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants