Here's one vectorized solution -
m,n = a.shape
idx = np.mod((n-1)*np.arange(m)[:,None] + np.arange(n), n)
out = a[np.arange(m)[:,None], idx]
Sample input, output -
In [256]: a
Out[256]:
array([[73, 55, 79, 52, 15],
[45, 11, 19, 93, 12],
[78, 50, 30, 88, 53],
[98, 13, 58, 34, 35]])
In [257]: out
Out[257]:
array([[73, 55, 79, 52, 15],
[12, 45, 11, 19, 93],
[88, 53, 78, 50, 30],
[58, 34, 35, 98, 13]])
Since, you have mentioned that you are calling such a rolling routine multiple times, create the indexing array idx once and re-use it later on.
Further improvement
For repeated usages, you are better off creating the full linear indices and then using np.take to extract the rolled elements, like so -
full_idx = idx + n*np.arange(m)[:,None]
out = np.take(a,full_idx)
Let's see what's the improvement like -
In [330]: a = np.random.randint(11,99,(600,600))
In [331]: m,n = a.shape
...: idx = np.mod((n-1)*np.arange(m)[:,None] + np.arange(n), n)
...:
In [332]: full_idx = idx + n*np.arange(m)[:,None]
In [333]: %timeit a[np.arange(m)[:,None], idx] # Approach #1
1000 loops, best of 3: 1.42 ms per loop
In [334]: %timeit np.take(a,full_idx) # Improvement
1000 loops, best of 3: 486 µs per loop
Around 3x improvement there!