Efficient bit-fiddling in a LFSR implementation

Question

Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.

Here's the baseline Haskell code using lists and Data.Bits:

import           Control.Monad      (when)
import           Data.Bits          (Bits, shift, testBit, xor, (.&.), (.|.))
import           System.Environment (getArgs)
import           System.Exit        (exitFailure, exitSuccess)

tap :: [[Int]]
tap = [
    [],            [],            [],            [3, 2],
    [4, 3],        [5, 3],        [6, 5],        [7, 6],
    [8, 6, 5, 4],  [9, 5],        [10, 7],       [11, 9],
    [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
    [16,15,13,4],  [17, 14],      [18, 11],      [19, 6, 2, 1],
    [20, 17],      [21, 19],      [22, 21],      [23, 18],
    [24,23,22,17], [25, 22],      [26, 6, 2, 1], [27, 5, 2, 1],
    [28, 25],      [29, 27],      [30, 6, 4, 1], [31, 28],
    [32,22,2,1],   [33,20],       [34,27,2,1],   [35,33],
    [36,25],       [37,5,4,3,2,1],[38,6,5,1],    [39,35],
    [40,38,21,19], [41,38],       [42,41,20,19], [43,42,38,37],
    [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
    [48,47,21,20], [49,40],       [50,49,24,23], [51,50,36,35],
    [52,49],       [53,52,38,37], [54,53,18,17], [55,31],
    [56,55,35,34], [57,50],       [58,39],       [59,58,38,37],
    [60,59],       [61,60,46,45], [62,61,6,5],   [63,62]        ]

xor' :: [Bool] -> Bool
xor' = foldr xor False

mask ::  (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1

advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
    | d0        = shifted
    | otherwise = shifted .|. 1
    where
        shifted = shift lfsr 1 .&. mask len
        d0 = xor' $ map (testBit lfsr) tap'
        tap' = map (subtract 1) tap

main :: IO ()
main = do
    args <- getArgs
    when (null args) $ fail "Usage: lsfr <number-of-bits>"
    let len = read $ head args
    when (len < 8) $ fail "No need for LFSR"
    let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
    if out == 0 then do
        putStr "OK\n"
        exitSuccess
    else do
        putStr "FAIL\n"
        exitFailure

Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2ⁿ iterations.)

According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.

What I've tried so far:

use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline

The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).

The reference C++ program can be found on gist.github.com.

The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.

Update: Results of applying the optimisations suggested in the answers

The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of @Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of @Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
- Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of @cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.

So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!

Note The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.

How are you compiling? What compiler? What version? How are you executing it? What arguments? Why are you using foldr here? Where is the C implementation for comparison? — Thomas M. DuBuisson, Apr 25 '17 at 05:10
Try replacing `shift 1 len` with the more straight-forward `2^len` and re-benchmark. — Thomas M. DuBuisson, Apr 25 '17 at 05:20
And what's with the branch in advance? It seems more in the form of the other code if you use `advance ... = shifted .|. fromEnum d0`. — Thomas M. DuBuisson, Apr 25 '17 at 05:28
1. replace the branch: Good point. But it just makes for a 5% improvement in speed. — mcmayer, Apr 25 '17 at 05:59
2. Replace `shift 1 len` by `2 ^ len`: 2.259s -> 0.178s - that is bizarre, 2^len is needed exactly once, why would this make a difference at all? — mcmayer, Apr 25 '17 at 06:02
3. `foldr` vs. `foldl`: It makes almost no difference. Empirically `foldr` was slightly faster! — mcmayer, Apr 25 '17 at 06:20

Thomas M. DuBuisson · Accepted Answer · 2017-04-26T01:20:41.417

Up Front Matters

For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.

First Draft: Oh No! The slows!

Your starting code with parameter 25 runs:

% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main             ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25  7.25s user 0.50s system 99% cpu 7.748 total

So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.

Issue 1: Shifty Code

I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):

% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main             ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25  0.64s user 0.00s system 99% cpu 0.637 total

(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)

Issue 2: Lists Of Bits? Int operations save the day!

One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.

I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.

This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:

let tp = sum $ map ((2^) . subtract 1) (tap !! len)
    pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
    mask = 2^len - 1
    advance' :: Int -> Int
    advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr 
    out :: Int
    out = last $ take (2^len) $ iterate advance' 0

Our resulting performance is:

% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25      
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
OK
./so 25  0.06s user 0.00s system 96% cpu 0.067 total

That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.

Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.

If your compiler options aren't making the `iterate advance'` call strict, your answer may be further improved by using the [strict `iterate'` from my answer](http://stackoverflow.com/a/43606802/414413). — Cirdec, Apr 25 '17 at 09:34
It is, and it actually beats the C++ code, which is nice but not entirely unexpected since the algorithm has seen a small improvement. I couldn't get the unboxed vectors to perform as well but I also didn't bother disabling the bounds checks, which possibly accounts for a bit of the difference. — Thomas M. DuBuisson, Apr 26 '17 at 01:26

score 6 · Answer 2 · edited May 23 '17 at 11:54

I suspect that the following line is building a large list-like thunk in memory before evaluating it.

let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0` is

Let's find out if I'm right, and if I am, we'll fix it. The first debugging step is to get an idea of the memory used by the program. To do this we're going to compile with the options -rtsopts in addition to -O2. This enables running the program with RTS options, inclusing +RTS -s which outputs a small memory summary.

Initial Performance

Running your program as lfsr 25 +RTS -s I get the following output

OK
   5,420,148,768 bytes allocated in the heap
   6,705,977,216 bytes copied during GC
   1,567,511,384 bytes maximum residency (20 sample(s))
     357,862,432 bytes maximum slop
            3025 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10343 colls,     0 par    2.453s   2.522s     0.0002s    0.0009s
  Gen  1        20 colls,     0 par    2.281s   3.065s     0.1533s    0.7128s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.438s  (  1.162s elapsed)
  GC      time    4.734s  (  5.587s elapsed)
  EXIT    time    0.016s  (  0.218s elapsed)
  Total   time    6.188s  (  6.967s elapsed)

  %GC     time      76.5%  (80.2% elapsed)

  Alloc rate    3,770,538,273 bytes per MUT second

  Productivity  23.5% of total user, 19.8% of total elapsed

That's a lot of memory used at once. It's very likely there's a big thunk building up in there somewhere.

Trying to reduce the thunk size

I hypothesized that the thunk is being built in iterate (advance ...). If this is the case, we can try to reduce the thunk size by making advance more strict in its lsfr argument. This won't remove the spine of the thunk (the successive iterations), but it might reduce the size of the state that's built up as the spine's evaluated.

BangPatterns is an easy way to make a function strict in an argument. f !x = .. is shorthand for f x = seq x $ ...

{-# LANGUAGE BangPatterns #-}

advance :: Int -> [Int] -> Int -> Int
advance len tap = go
  where
    go !lfsr
      | d0        = shifted
      | otherwise = shifted .|. 1
      where
        shifted = shift lfsr 1 .&. mask len
        d0 = xor' $ map (testBit lfsr) tap'
    tap' = map (subtract 1) tap

Let's see what difference this makes ...

>lfsr 25 +RTS -s
OK
   5,420,149,072 bytes allocated in the heap
   6,705,979,368 bytes copied during GC
   1,567,511,448 bytes maximum residency (20 sample(s))
     357,862,448 bytes maximum slop
            3025 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10343 colls,     0 par    2.688s   2.711s     0.0003s    0.0059s
  Gen  1        20 colls,     0 par    2.438s   3.252s     0.1626s    0.8013s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.328s  (  1.146s elapsed)
  GC      time    5.125s  (  5.963s elapsed)
  EXIT    time    0.000s  (  0.226s elapsed)
  Total   time    6.484s  (  7.335s elapsed)

  %GC     time      79.0%  (81.3% elapsed)

  Alloc rate    4,081,053,418 bytes per MUT second

  Productivity  21.0% of total user, 18.7% of total elapsed

None that's noticeable.

Eliminating the Spine

I guess it's the spine of that iterate (advance ...) that's being built. After all, for the command I'm running the list would be 2^25, or a little over 33 million items long. The list itself is probably being removed by list fusion, but the thunk for the last item of the list is over 33 million applications of advance ...

To solve this problem we need a strict version of iterate so that the value is forced to an Int before applying the advance function again. This should keep the memory down to only a single lfsr value at a time, along with the currently computed application of advance.

Unfortunately, there isn't a strict iterate in Data.List. Here's one that doesn't give up on the list fusion that's providing other important (I think) performance optimizations to this problem.

{-# LANGUAGE BangPatterns #-}

import GHC.Base (build)

{-# NOINLINE [1] iterate' #-}
iterate' :: (a -> a) -> a -> [a]
iterate' f = go
  where go !x = x : go (f x)

{-# NOINLINE [0] iterateFB' #-}
iterateFB' :: (a -> b -> b) -> (a -> a) -> a -> b
iterateFB' c f = go
  where go !x = x `c` go (f x)

{-# RULES
"iterate'"    [~1] forall f x. iterate' f x = build (\c _n -> iterateFB' c f x)
"iterateFB'"  [1]              iterateFB' (:) = iterate'
 #-}

This is just iterate from GHC.List (along with all its rewrite rules), but made strict in the accumulated argument.

Equipped with a strict iterate, iterate', we can change the troublesome line to

let out = last $ take (shift 1 len) $ iterate' (advance len (tap!!len)) 0

I expect that this will perform much better. Let's see ...

>lfsr 25 +RTS -s
OK
   3,758,156,184 bytes allocated in the heap
         297,976 bytes copied during GC
          43,800 bytes maximum residency (1 sample(s))
          21,736 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7281 colls,     0 par    0.047s   0.008s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.750s  (  0.783s elapsed)
  GC      time    0.047s  (  0.008s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.797s  (  0.792s elapsed)

  %GC     time       5.9%  (1.0% elapsed)

  Alloc rate    5,010,874,912 bytes per MUT second

  Productivity  94.1% of total user, 99.0% of total elapsed

This used 0.00002 times as much memory and ran 10 times as fast.

I don't know if this will improve on Thomas DeBuisson's answer that improves advance but still leaves a lazy iterate advance' in place. It would be easy to check; add the iterate' code to that answer and use iterate' in place of iterate in that answer.

I'm wrong about the list fusion; the rules don't work when the argument to `take` is `shift 1 len`, but they do work if it's changed to `2^len`. — Cirdec, Apr 25 '17 at 14:47
The fusion rules working with `2^len` gets another 20% improvement. — Cirdec, Apr 25 '17 at 14:54

Alexey Romanov · Answer 3 · 2017-04-25T08:11:50.853

Does the compiler lift tap !! len out of the loop? I suspect it does, but moving it out to guarantee this can't hurt:
```
let tap1 = tap !! len
let out = last $ take (shift 1 len) $ iterate (advance len tap1) 0    
```

In the comments you say "2^len is needed exactly once", but this is wrong. You do it each time in advance. So you could try

advance len tap mask lfsr
    | d0        = shifted
    | otherwise = shifted .|. 1
    where
        shifted = shift lfsr 1 .&. mask
        d0 = xor' $ map (testBit lfsr) tap'
        tap' = map (subtract 1) tap

-- in main
let tap1 = tap !! len
let numIterations = 2^len
let mask = numIterations - 1
let out = iterate (advance len tap1 mask) 0 !! (numIterations - 1)

(the compiler can't optimize last $ take ... to !! in general, because they are different for finite lists, but iterate always returns an infinite one.)

You compared foldr with foldl, but foldl is almost never what you need; since xor always needs both arguments and is associative, foldl' is very likely to be the right choice (the compiler can optimize it, but if there is any real difference between foldl and foldr and not just random variation, it might have failed in this case).

Efficient bit-fiddling in a LFSR implementation

3 Answers3

Initial Performance

Trying to reduce the thunk size

Eliminating the Spine

Linked