There are many questions (1, 2, 3) dealing with counting values in a single series.
However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.
Below is some benchmarking for three potential methods. I have two specific questions:
- Why is groupermore efficient thancount? I expectedcountto be the more efficient, as it is implemented in C. The superior performance ofgrouperpersists even if number of columns is increased from 2 to 4.
- Why does value_counterunderperformgrouperby so much? Is this due to the cost of constructing a list, or series from list?
I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:
x, z = grouper(df), count(df)
%timeit x[x.values > 10]                        # 749µs
%timeit {k: v for k, v in z.items() if v > 10}  # 9.37ms
However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.
Benchmarking code
import pandas as pd
import numpy as np
from collections import Counter
np.random.seed(0)
m, n = 1000, 100000
df = pd.DataFrame({'A': np.random.randint(0, m, n),
                   'B': np.random.randint(0, m, n)})
def grouper(df):
    return df.groupby(['A', 'B'], sort=False).size()
def value_counter(df):
    return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)
def count(df):
    return Counter(zip(df.A.values, df.B.values))
x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)
assert (x == y) & (y == z), "Dictionary mismatch!"
for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:
    df = pd.DataFrame({'A': np.random.randint(0, m, n),
                       'B': np.random.randint(0, m, n)})
    print(m, n)
    %timeit grouper(df)
    %timeit value_counter(df)
    %timeit count(df)
Benchmarking results
Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1
Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.
Key: g = grouper, v = value_counter, c = count.
m           n        g        v       c
100     10000     2.91    18.30    8.41
1000    10000     4.10    27.20    6.98[1]
100    100000    17.90   130.00   84.50
1000   100000    43.90   309.00   93.50
1 This is not a typo.
 
    