Haskell Performance by Example

Question

I have these basic types in my code:

newtype Foo (m :: Factored) = Foo Int64 deriving (NFData)

foo :: forall m . (Fact m) => Foo m -> Foo m

class T t where t :: (Fact m ) => t m -> t m

instance T Foo where t = foo

newtype Bar t (m :: Factored) = Bar (t m) deriving (NFData)

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v

(ignore Fact and Factored for the moment). I'm benchmarking the code at different levels, comparing the performance of foo, t, and bar. In the benchmarks, t = foo, and bar just applies t through a newtype. Thus their runtime should be essentially identical, but criterion reports that foo takes 9.2ns, t takes about twice that at 17.45ns, and bar takes a whopping 268.1ns.

I've experimented with adding INLINABLE and even a SPECIALIZE pragma, but they aren't helping. I want to believe that GHC has some magic syntax/optimization/etc that can be consistently applied to solve these types of performance issues. For example, I've seen cases where writing code in point-free style has dramatic performance implications.

The complete code can be found here. I promise it's not intimidating. The modules are:

Foo: defines Foo, foo, and T
Bar: defines Bar and bar
FooBench: defines a benchmark for foo
TBench: defines a benchmark for t
BarBench: defines a benchmark for bar
Main: runs the three benchmarks
Factored: defines Fact and Factored using singletons

Most of the modules are tiny; I defined the three benchmarks in separate files so that I could examine their core. I generated the core for the three *Bench modules and aligned them the best I could. They're only ~250 lines each, and the first ~200 lines are identical, up to renaming. The problem is that I don't know what to make of the last 50 or so lines. The (core) diff for FooBench vs TBench is here, the diff for TBench vs BarBench is here, and the diff for FooBench vs BarBench is here.

Just a few of the questions I have:

At a high level, what is the essential difference between the core files? I'm looking for something like "Here you can see that GHC isn't inlining the call to x." What should I be looking for?
What can be done to make the three benchmarks all run in about 9.2ns? GHC optimizations? INLINE/INLINABLE pragmas? SPECIALIZE pragmas that I missed? (You aren't allowed to specialized for F128::Factored; in my real library this value may be reified at runtime.) Restricting/delaying inlining to a particular phase?

Although I'm looking for an actual solution to make the benchmarks all fast, it's possible that tricks that work for this example won't scale to my real library. As a result, I'm also looking for the "high-level" explanation of why a particular technique should work.

This may be helpful: http://stackoverflow.com/questions/6121146/reading-ghc-core — James Brock, Jul 13 '16 at 12:24
My _guess_ is that the difference you're observing is the cost of passing the `Fact` and `T` dictionaries around — Benjamin Hodgson, Jul 13 '16 at 12:48
@BenjaminHodgson I'd really like to get specialization all the way down the stack so that there aren't any dictionaries involved once you monomorphize at the top level. Any idea how to make that happen? — crockeea, Jul 13 '16 at 17:21
@Michael How do you suggest I benchmark code? Every question I've seen on SO involving people *not* using criterion ends up with everyone talking about how it's the only way to get reliable data. — crockeea, Jul 13 '16 at 22:33

score 6 · Accepted Answer · answered Jul 13 '16 at 13:18

6

First, looking at bar:

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v

we can write this without needing the argument using coerce:

bar :: (Fact m, T t) => Bar t m -> Bar t m
bar = (coerce :: (t m -> t m) -> Bar t m -> Bar t m) t

this (as we'd hope) gets bar performing the same as t. (In fact the core for TBench and BarBench is exactly the same, excluding type signatures).

I'm not entirely sure why, but using INLINE instead of INLINEABLE makes t and bar perform the same as foo. I'm not an expert but it's normally better to use INLINE for small functions that you're sure you want to inline.

That said I think some of these issues are from how criterion does benchmarks to stop ghc cheating. For instance, writing bench "Bar" $ nf (GHC.Magic.inline bar) x on your original code has bar performing the same as foo. I suspect a "real" program wouldn't be so delicate.

answered Jul 13 '16 at 13:18

cchalmers

2,896
1
11
11

1

How is `bar` not violating the [newtypes are free](https://ghc.haskell.org/trac/ghc/wiki/NewtypeWrappers) model? I changed my GADTs to newtypes for the example precisely so that I could rule out data constructors being the issue. Since I don't have newtypes in my real code, I can't use `coerce`. – crockeea Jul 13 '16 at 17:16
As for `INLINE`/`INLINABLE`, neither seem to help in my real code. It's very difficult to make an example small enough to get input on, but large enough that solutions still apply. Similarly with `GHC.Magic.inline` (which sounds great!), I only managed to slow my program down. I really do appreciate your effort though, and I'll continue to play around with the ideas. – crockeea Jul 13 '16 at 17:18
I'm also interested if you looked at the core for a diagnosis, or if you only looked at it as a confirmation of the solution. I really just want to understand how you approached the problem. – crockeea Jul 13 '16 at 18:39
1

Honestly I just tried out different things. I tried `GHC.Magic.inline` first so I knew it was in inlining problem. I did try to look at the core but I only understand the basics. I think by writing `bar (Bar v) = Bar $ t v` you change `bar`s airity so `bar` won't willingly inline unless it's directly applied to a `Bar` (which it isn't in the benchmark). – cchalmers Jul 13 '16 at 19:59

Haskell Performance by Example

1 Answers1