Efficient handling of list of small vectors with Compile

Question

Often we need to process data consisting of a list of coordinates: data = {{x1,y1}, {x2,y2}, ..., {xn,yn}}. It could be 2D or 3D coordinates, or any other arbitrary length list of fixed length small vectors.

Let me illustrate how to use Compile for such problems using the simple example of summing up a list of 2D vectors:

data = RandomReal[1, {1000000, 2}];

First, obvious version:

fun1 = Compile[{{vec, _Real, 2}},
  Module[{sum = vec[[1]]},
   Do[sum += vec[[i]], {i, 2, Length[vec]}];
   sum
   ]
  ]

How fast is it?

In[13]:= Do[fun1[data], {10}] // Timing
Out[13]= {4.812, Null}

Second, less obvious version:

fun2 = Compile[{{vec, _Real, 1}},
  Module[{sum = vec[[1]]},
   Do[sum += vec[[i]], {i, 2, Length[vec]}];
   sum
   ]
  ]

In[18]:= Do[
  fun2 /@ Transpose[data],
  {10}
  ] // Timing

Out[18]= {1.078, Null}

As you can see, the second version is much faster. Why? Because the crucial operation, sum += ... is an addition of numbers in fun2 while it's an addition of arbitrary length vectors in fun1.

You can see a practical application of the same "optimization" in this asnwer of mine, but many other examples could be given where this is relevant.

Now in this simple example the code using fun2 is not longer or much more complex than fun1, but in the general case it very well might be.

How can I tell Compile that one of its arguments is not an arbitrary n*m matrix, but a special n*2 or n*3 one, so it can do these optimization automatically rather than using a generic vector addition function to add tiny length-2 or length-3 vectors?

Addendum

To make it more clear what's happening, we can use CompilePrint:

CompilePrint[fun1] gives

        1 argument
        5 Integer registers
        5 Tensor registers
        Underflow checking off
        Overflow checking off
        Integer overflow checking on
        RuntimeAttributes -> {}

        T(R2)0 = A1
        I1 = 2
        I0 = 1
        Result = T(R1)3

1   T(R1)3 = Part[ T(R2)0, I0]
2   I3 = Length[ T(R2)0]
3   I4 = I0
4   goto 8
5   T(R1)2 = Part[ T(R2)0, I4]
6   T(R1)4 = T(R1)3 + T(R1)2
7   T(R1)3 = CopyTensor[ T(R1)4]]
8   if[ ++ I4 < I3] goto 5
9   Return

CompilePrint[fun2] gives

        1 argument
        5 Integer registers
        4 Real registers
        1 Tensor register
        Underflow checking off
        Overflow checking off
        Integer overflow checking on
        RuntimeAttributes -> {}

        T(R1)0 = A1
        I1 = 2
        I0 = 1
        Result = R2

1   R2 = Part[ T(R1)0, I0]
2   I3 = Length[ T(R1)0]
3   I4 = I0
4   goto 8
5   R1 = Part[ T(R1)0, I4]
6   R3 = R2 + R1
7   R2 = R3
8   if[ ++ I4 < I3] goto 5
9   Return

I chose to include this rather than the considerably lengthier C version, where the timing difference is even more pronounced.

Perhaps this is an XY problem and the question should really be "how do I perform numerical tasks like this quickly"? If so, the correct answer is certainly "Mathematica is the wrong tool for this job". — J D, Nov 21 '11 at 22:01

Leonid Shifrin · Answer 1 · 2011-11-18T18:32:33.177

Your addendum is actually almost enough to see what the problem is. For the first version, you invoke CopyTensor in an inner loop, and this is the main reason for inefficiency, since lots of small buffers must be allocated on the heap and then released. To illustrate, here is a version which does not copy:

fun3 =
 Compile[{{vec, _Real, 2}},
   Module[{sum = vec[[1]], len = Length[vec[[1]]]},   
     Do[sum[[j]] += vec[[i, j]], {j, 1, len}, {i, 2, Length[vec]}];
     sum], CompilationTarget -> "C"]

(by the way, I think that the speed comparison is more fair when compiled to C, since the Mathematica virtual machine does, for example, much more heavily discourage nested loops). This function is still slower than yours, but about 3 times faster than fun1, for such small vectors.

The rest of the inefficiency is, I believe, inherent to this approach. The fact that you can decompose the problem into solving for sums of individual components is what makes your second function efficient, because you use structural operations like Transpose, and, most importantly, this allows you to squeeze more instructions out of the inner loop. Because this is what matters the most - you must have as few instructions in an inner loop as possible. You can see from CompilePrint that this is indeed the case for fun1 vs fun3. In a way, you found (for this problem) an efficient high-level way to manually unroll the outer loop (the one over the coordinate index). An alternative you suggest would ask the compiler to unroll the outer loop automatically, based on the extra information on vector dimensionality. This sounds like a plausible optimization, but has not probably been implemented for the Mathematica virtual machine yet.

Note also that for larger lengths of vectors (say 20), the difference between fun1 and fun2 disappears, because the cost of memory allocation / deallocation in tensor copying becomes insignificant compared to the cost of massive assignment (which is still implemented more efficiently when you assign vector to vector - perhaps because you can use things like memcpy in that case).

To conclude, I think that while it would be nice to have this optimization automatic, at least in this particular case, this is a kind of low-level optimization that is hard to expect to be fully automatic - even optimizing C compilers do not always perform it. One thing you could try is to hard-code the vector length into compiled function, then use SymbolicCGenerate (from CCodeGenerator` package) to generate symbolic C, then use ToCCodeString to generate the C code (or, whatever other way you use to get a C Code for the compiled function), and then try to create and load the library manually, enabling all optimizations for the C compiler via options to CreateLibrary. Whether or not this would work I don't know. EDIT I actually doubt that this will help at all, since the loops are already implemented with goto-s for speed when C code is generated, and this will likely prevent the compiler from attempting the loop unrolling.

score 5 · Answer 2 · edited Nov 18 '11 at 17:13

5

It is always a good option to look for a function that does exactly what you want to do.

In[50]:= fun3=Compile[{{vec,_Real,2}},Total[vec]]

Out[50]= CompiledFunction[{vec},Total[vec],-CompiledCode-]

In[51]:= Do[fun3[data],{10}]//Timing

Out[51]= {0.121982,Null}

In[52]:= fun3[data]===fun1[data]

Out[52]= True

Another option, less efficient (*due to the transpose *) is to use Listable

fun4 = Compile[{{vec, _Real, 1}}, Total[vec], 
  RuntimeAttributes -> {Listable}]

In[63]:= Do[fun4[Transpose[data]],{10}]//Timing

Out[63]= {0.235964,Null}

In[64]:= Do[Transpose[data],{10}]//Timing

Out[64]= {0.133979,Null}

In[65]:= fun4[Transpose[data]]===fun1[data]

Out[65]= True

edited Nov 18 '11 at 17:13

acl

6,490
1
27
33

answered Nov 18 '11 at 16:21

1

He probably used this simple example to illustrate the point, rather than because he was not aware of `Total`. – acl Nov 18 '11 at 17:15
Hi Oliver! *Finally*, you are here! Welcome to SO, and of course +1. – Leonid Shifrin Nov 18 '11 at 17:29
Yes, that was just an example. My point was that if I write C code by hand, it's very easy to handle such a situation (2*n matrix) efficiently, and I was wondering if there's a way to make Compile handle it efficiently too. But I'll add some better examples later! – Szabolcs Nov 19 '11 at 10:04
Szbolcs, in order to make efficient use of M- you have to let go of C-type thinking. Think vectors, then you will get the most of M-. So it were good if you could provide a better example. – Nov 19 '11 at 14:13
1) There was no explicit request of that form. I think it is very relevant to have an example that shows the problem. 2) The option to use Listable in does what the OP wanted. – Nov 20 '11 at 11:39
1

@Jon Harrop Congratulations! So far, you are doing everything right, provided that your goal is to make the SO Mathematica community maximally hostile to you. I find Oliver's post very relevant (if only to add more background on Compile), as his other posts here and on MathGroup, apart from the fact that he is one of the few people in the world with a genuine knowledge of internals of `Compile` and many other Mathematica features. – Leonid Shifrin Nov 20 '11 at 19:33
@Leonid If you look at Jon's voting record you'll see that 1/3 of his votes are downvotes. His way of having fun I presume... – Sjoerd C. de Vries Nov 20 '11 at 20:39
@Sjoerd Well, perhaps. I've seen some users with a similar voting habits who still manage to be on-topic most of the time (Tomalak Geretkal for instance). That said, I personally don't think downvotes are that helpful, although that also perhaps depends on the tag (community) in question. This particular downvote I find very unjustified. – Leonid Shifrin Nov 20 '11 at 20:46
@LeonidShifrin: "This particular downvote I find very unjustified". I downvoted because I see no merit in pointing out that Mathematica already includes a function (`Total`) that solves this specific example efficiently when Szabolcs is interested in the general case. – J D Nov 21 '11 at 22:07
@Jon Harrop We obviously have different voting habits. I actually don't think downvotes are generally useful for anything, except in extreme cases, as a protective measure against obvious spam (but then, you can flag instead). – Leonid Shifrin Nov 21 '11 at 22:17
@LeonidShifrin: I most commonly downvote when the answer is wrong. – J D Nov 23 '11 at 15:32
2

@Jon the answer is not wrong, it is just not what you want it to be. – Nov 23 '11 at 15:49
@ruebenko: I didn't say your answer was wrong. I said it isn't what the OP wanted. If you discussed the compilation of something more general than just a single call to the built-in `Total` function it would be much more useful. – J D Nov 24 '11 at 09:50
@Jon, so it is not wrong, and is not irrelevant. I still think for this particular problem using Total is a good solution. The OP agreed to provide a different example a while back already. – Nov 24 '11 at 10:12

Efficient handling of list of small vectors with Compile

Addendum

2 Answers2