I have these basic types in my code:
newtype Foo (m :: Factored) = Foo Int64 deriving (NFData)
foo :: forall m . (Fact m) => Foo m -> Foo m
class T t where t :: (Fact m ) => t m -> t m
instance T Foo where t = foo
newtype Bar t (m :: Factored) = Bar (t m) deriving (NFData)
bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v
(ignore Fact and Factored for the moment). I'm benchmarking the code at different levels, comparing the performance of foo, t, and bar. In the benchmarks, t = foo, and bar just applies t through a newtype. Thus their runtime should be essentially identical, but criterion reports that foo takes 9.2ns, t takes about twice that at 17.45ns, and bar takes a whopping 268.1ns. 
I've experimented with adding INLINABLE and even a SPECIALIZE pragma, but they aren't helping. I want to believe that GHC has some magic syntax/optimization/etc that can be consistently applied to solve these types of performance issues. For example, I've seen cases where writing code in point-free style has dramatic performance implications. 
The complete code can be found here. I promise it's not intimidating. The modules are:
- Foo: defines 
Foo,foo, andT - Bar: defines 
Barandbar - FooBench: defines a benchmark for 
foo - TBench: defines a benchmark for 
t - BarBench: defines a benchmark for 
bar - Main: runs the three benchmarks
 - Factored: defines 
FactandFactoredusing singletons 
Most of the modules are tiny; I defined the three benchmarks in separate files so that I could examine their core. I generated the core for the three *Bench modules and aligned them the best I could. They're only ~250 lines each, and the first ~200 lines are identical, up to renaming. The problem is that I don't know what to make of the last 50 or so lines. The (core) diff for FooBench vs TBench is here, the diff for TBench vs BarBench is here, and the diff for FooBench vs BarBench is here.
Just a few of the questions I have:
At a high level, what is the essential difference between the core files? I'm looking for something like "Here you can see that GHC isn't inlining the call to
x." What should I be looking for?What can be done to make the three benchmarks all run in about 9.2ns? GHC optimizations?
INLINE/INLINABLEpragmas?SPECIALIZEpragmas that I missed? (You aren't allowed to specialized forF128::Factored; in my real library this value may be reified at runtime.) Restricting/delaying inlining to a particular phase?
Although I'm looking for an actual solution to make the benchmarks all fast, it's possible that tricks that work for this example won't scale to my real library. As a result, I'm also looking for the "high-level" explanation of why a particular technique should work.