I've tested an example demonstrated in this talk [pytables] using numpy (page 20/57).
It is stated, that a[:,1].sum() takes 9.3 ms, whereas a[1,:].sum() takes only 72 us. 
I tried to reproduce it, but failed to do so. Am I measuring wrongly? Or have things changed in NumPy since 2010?
$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[:,1].sum()' 
1000 loops, best of 3: 16.5 usec per loop
$ python2 -m timeit -n1000 --setup \ 
  'import numpy as np; a = np.random.randn(4000,4000);' 'a[1,:].sum()' 
1000 loops, best of 3: 13.8 usec per loop
$ python2 --version
Python 2.7.7
$ python2 -c 'import numpy; print numpy.version.version'
1.8.1
While I can measure a benefit of the second version (supposedly fewer cache misses because numpy uses C-style row ordering), I don't see that drastic difference as stated by the pytables contributor.
Also, it seems I cannot see more cache misses when using column V row summation.
EDIT
- So far the insight for me was that I was using the - timeitmodule in the wrong way. Repeated runs with the same array (or row/column of an array) will almost certainly be cached (I've got- 32KiBof L1 data cache, so a line fits well inside:- 4000 * 4 byte = 15k < 32k).
- Using the script in the answer of @alim with a single loop ( - nloop=1) and ten trials- nrep=10, and varying the size of the random array (- n x n) I am measuring- n row/us col/us penalty col 1k 90 100 1 4k 100 210 2 10k* 110 350 3.5 20k* 120 1200 10- * - n=10kand higher doesn't fit into the L1d cache anymore.
I'm still not sure about tracing down the cause of this as perf shows about the same rate of cache misses (sometimes even a higher rate) for the faster row sum.
Perf data:
nloop = 2 and nrep=2, so I expect some of the data still in the cache... for the second run. 
Row sum n=10k
 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
row sum:    103.593 us
 Performance counter stats for './answer1.py':
          25850670      cache-references                                             [30.04%]
           1321945      cache-misses              #    5.114 % of all cache refs     [20.04%]
        5706371393      L1-dcache-loads                                              [20.00%]
          11733777      L1-dcache-load-misses     #    0.21% of all L1-dcache hits   [19.97%]
        2401264190      L1-dcache-stores                                             [20.04%]
         131964213      L1-dcache-store-misses                                       [20.03%]
           2007640      L1-dcache-prefetches                                         [20.04%]
       21894150686      cycles                    [20.02%]
       24582770606      instructions              #    1.12  insns per cycle         [30.06%]
        3534308182      branches                                                     [30.01%]
              3767      faults
                 6      migrations
       7.331092823 seconds time elapsed
Column sum n=10k
 perf stat -B -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,cycles,instructions,branches,faults,migrations ./answer1.py 2>&1 | sed 's/^/    /g'
column sum: 377.059 us
 Performance counter stats for './answer1.py':
          26673628      cache-references                                             [30.02%]
           1409989      cache-misses              #    5.286 % of all cache refs     [20.07%]
        5676222625      L1-dcache-loads                                              [20.06%]
          11050999      L1-dcache-load-misses     #    0.19% of all L1-dcache hits   [19.99%]
        2405281776      L1-dcache-stores                                             [20.01%]
         126425747      L1-dcache-store-misses                                       [20.02%]
           2128076      L1-dcache-prefetches                                         [20.04%]
       21876671763      cycles                    [20.00%]
       24607897857      instructions              #    1.12  insns per cycle         [30.00%]
        3536753654      branches                                                     [29.98%]
              3763      faults
                 9      migrations
       7.327833360 seconds time elapsed
EDIT2
I think I have understood some aspects, but the question has not been answered yet I think.  At the moment I think this summation example doesn't reveal anything about CPU caches at all. In order to eliminate uncertainty by numpy/python, I tried to use perf on doing the summation in C, and the results are in an answer below.
 
     
     
     
    