I'm trying to understand what's the execution complexity of the iloc function in pandas.
I read the following Stack Exchange thread (Pandas DataFrame search is linear time or constant time?) that:
"accessing single row by index (index is sorted and unique) should have runtime O(m) where
m << n_rows"
mentioning that iloc runs on O(m) time. What is m (linear, log, constant,...)?
Some experiments I ran:
import pandas as pd
>>> a = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,4,5]], columns=['a','b','c'])
>>> a = a.set_index('a').sort_index()
>>> a
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
>>> a.iloc[[0,1,2,3]]
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
So iloc clearly works with offsets and not on the integer-based index (column a). Even if we delete few rows at the top, the iloc offset-based lookup works correctly:
>>> a.drop([1]).iloc[[0,1]]
b c
a
2 2 3
2 3 4
So why isn't iloc offset-lookup running on a comparable time to numpy arrays when each column is simply a numpy array that can be accessed in constant time (few operations)? And what's its complexity?
UPDATE:
I tried to compare the efficiency of pandas vs numpy on a 10000000x2 matrix. Comparing the efficiency of a value increment per row in a DataFrame df and an array arr, with and without a for loop:
# Initialization
SIZE = 10000000
arr = np.ones((SIZE,2), dtype=np.uint32)
df = pd.DataFrame(arr)
# numpy, no for-loop
arr[range(SIZE),1] += 1
# pandas, no for-loop
df.iloc[range(SIZE),1] += 1
# numpy, for-loop
for i in range(SIZE):
arr[i,1] += 1
# pandas, for-loop
for i in range(SIZE):
df.iloc[i,1] += 1
| Method | Execution time |
|---|---|
| numpy, no for-loop | 7 seconds |
| pandas, no for-loop | 24 seconds |
| numpy, with for-loop | 27 seconds |
| pandas, with for-loop | > 2 hours |