Often we need to process data consisting of a list of coordinates: data = {{x1,y1}, {x2,y2}, ..., {xn,yn}}.  It could be 2D or 3D coordinates, or any other arbitrary length list of fixed length small vectors.
Let me illustrate how to use Compile for such problems using the simple example of summing up a list of 2D vectors:
data = RandomReal[1, {1000000, 2}];
First, obvious version:
fun1 = Compile[{{vec, _Real, 2}},
  Module[{sum = vec[[1]]},
   Do[sum += vec[[i]], {i, 2, Length[vec]}];
   sum
   ]
  ]
How fast is it?
In[13]:= Do[fun1[data], {10}] // Timing
Out[13]= {4.812, Null}
Second, less obvious version:
fun2 = Compile[{{vec, _Real, 1}},
  Module[{sum = vec[[1]]},
   Do[sum += vec[[i]], {i, 2, Length[vec]}];
   sum
   ]
  ]
In[18]:= Do[
  fun2 /@ Transpose[data],
  {10}
  ] // Timing
Out[18]= {1.078, Null}
As you can see, the second version is much faster.  Why?  Because the crucial operation, sum += ... is an addition of numbers in fun2 while it's an addition of arbitrary length vectors in fun1.
You can see a practical application of the same "optimization" in this asnwer of mine, but many other examples could be given where this is relevant.
Now in this simple example the code using fun2 is not longer or much more complex than fun1, but in the general case it very well might be.
How can I tell Compile that one of its arguments is not an arbitrary n*m matrix, but a special n*2 or n*3 one, so it can do these optimization automatically rather than using a generic vector addition function to add tiny length-2 or length-3 vectors?
Addendum
To make it more clear what's happening, we can use CompilePrint:
CompilePrint[fun1] gives
        1 argument
        5 Integer registers
        5 Tensor registers
        Underflow checking off
        Overflow checking off
        Integer overflow checking on
        RuntimeAttributes -> {}
        T(R2)0 = A1
        I1 = 2
        I0 = 1
        Result = T(R1)3
1   T(R1)3 = Part[ T(R2)0, I0]
2   I3 = Length[ T(R2)0]
3   I4 = I0
4   goto 8
5   T(R1)2 = Part[ T(R2)0, I4]
6   T(R1)4 = T(R1)3 + T(R1)2
7   T(R1)3 = CopyTensor[ T(R1)4]]
8   if[ ++ I4 < I3] goto 5
9   Return
CompilePrint[fun2] gives
        1 argument
        5 Integer registers
        4 Real registers
        1 Tensor register
        Underflow checking off
        Overflow checking off
        Integer overflow checking on
        RuntimeAttributes -> {}
        T(R1)0 = A1
        I1 = 2
        I0 = 1
        Result = R2
1   R2 = Part[ T(R1)0, I0]
2   I3 = Length[ T(R1)0]
3   I4 = I0
4   goto 8
5   R1 = Part[ T(R1)0, I4]
6   R3 = R2 + R1
7   R2 = R3
8   if[ ++ I4 < I3] goto 5
9   Return
I chose to include this rather than the considerably lengthier C version, where the timing difference is even more pronounced.