How to efficiently compute a rolling unique count in a pandas time series?

Question

I have a time series of people visiting a building. Each person has a unique ID. For every record in the time series, I want to know the number of unique people visiting the building in the last 365 days (i.e. a rolling unique count with a window of 365 days).

pandas does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.)

Is there a better way to calculate than what I've done below? I'm not sure why the fast method I made, windowed_nunique (under "Speed test 3"), is off by 1.

Thanks for any help!

Initialization

In [1]:

# Import libraries.
import pandas as pd
import numba
import numpy as np

In [2]:

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns

Out[2]:

Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        |

Speed reference

In [3]:

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()

3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed test 1

In [4]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())

2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]:

# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values

Speed test 2

In [6]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
    return len(set(arr))

In [7]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)

430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]:

# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)

Speed test 3

In [9]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
        window (int): Width of window in units of difference of `dates`.

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if `len(dates) != len(pids)`

    Notes:
        * May be off by 1 compared to `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert dates.shape == pids.shape

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids)
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # If person count went from 0 to 1, increment unique person count.
        date = dates[idx]
        pid = pids[idx]
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        while (date - date_min) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts

In [10]:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

In [11]:

%%timeit
windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)

107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:

# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))

In [13]:

# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns

Out[13]:

Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   |

Sorry if this is a stupid comment, but wouldnt a 365 rolling count of unique Ids be as simple as: `df.rolling(365)['PersonId'].apply(lambda x: len(set(x)))` ??? — Woody Pride, Sep 28 '17 at 14:10
@WoodyPride Thanks, that's what I did under "Speed test 2" but with a just-in-time compiler (see function `nunique`). The calculation is correct but inefficient since `set` operates on every element in the window every time the window calculation executes. It's more efficient to keep a running tally of each element, as in "Speed test 3" (more efficient on the example data by ~4000x comparing "Speed test 2" and "Speed test 3"). However, my implementation `windowed_nunique` is off by 1, and I wonder if anyone can help find the problem. — Samuel Harrold, Sep 28 '17 at 14:34
Awesome work! I try applying your speed test 3, but keep getting the following error below, any idea what gives? TypingError: Failed in nopython mode pipeline (step: nopython frontend) non-precise type array(pyobject, 1d, C) During: typing of argument at (24) File "", line 24: def windowed_nunique(dates, pids, window): # Check arguments. assert dates.shape == pids.shape ^ — Rajko Radovanovic, Jan 02 '22 at 12:14
Here, `personId` is a number. But if we try to apply to a non-numeric type we get `DataError: No numeric types to aggregate`. Any way to make it work? — David Davó, Dec 12 '22 at 11:34

score 5 · Accepted Answer · answered Oct 03 '17 at 19:20

I had 2 errors in the fast method windowed_nunique, now corrected in windowed_nunique_corrected below:

The size of the array for memoizing the number of unique counts for each person ID within the window, pid_cts, was too small.
Because the leading and trailing edges of the window include integer days, date_min should be updated when (date - date_min + 1) > window.

Related links:

Source Jupyter Notebook updated with solution: https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc

In [14]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique_corrected(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
            Required: min(pids) >= 0
        window (int): Width of window in units of difference of `dates`.
            Required: window >= 1

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if not...
            * len(dates) == len(pids)
            * min(pids) >= 0
            * window >= 1

    Notes:
        * Matches `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert len(dates) == len(pids)
    assert np.min(pids) >= 0
    assert window >= 1

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids) + 1
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # Lookup date, person.
        date = dates[idx]
        pid = pids[idx]

        # If person count went from 0 to 1, increment unique person count.
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        # Note: If window=3, it includes day0,day1,day2.
        while (date - date_min + 1) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts

In [15]:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

In [16]:

%%timeit
windowed_nunique_corrected(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)

98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]:

# Check accuracy of results.
test = windowed_nunique_corrected(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
assert all(ref == test)

score 1 · Answer 2 · answered Sep 28 '17 at 16:25

1

Very close to your time in seed test two, but as a one liner, re sampled over a year.

 df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])

Time results

1 loop, best of 3: 483 ms per loop

answered Sep 28 '17 at 16:25

DJK

8,924
4
24
40

That's close to "Speed test 2" for speed, but `np.unique` is operating on every element in the window. It's more efficient to keep a running tally of each element as in "Speed test 3". (See my comment with Woody Pride.) My implementation of a running tally, `windowed_nunique`, is off by 1, though. Any other thoughts? Thanks – Samuel Harrold Sep 28 '17 at 16:33

score -1 · Answer 3 · answered Sep 28 '17 at 14:47

-1

If you only want the number of unique person that went in the building in the last 365 days you could first limit your dataset at the last 365 days with .loc :

df = df.loc[df['date'] > '2016-09-28',:]

and the with a groupby you get as many rows as unique people that came in and if you do it by count you also get the nubmer of times they came in :

df = df.groupby('PersonID').count()

that seem to work for your question, but maybe i got it wrong. have a good day

answered Sep 28 '17 at 14:47

Forrains_459

9
4

Thanks, but I'm looking for an efficient rolling unique count. The output must have the same `len` as the input (from the example, `len(df) == len(ref) == 9181`) and be faster than "Speed test 2". – Samuel Harrold Sep 28 '17 at 14:59
@SamuelHarrold, what do you mean rolling unique count? what period are you rolling over in a year? – DJK Sep 28 '17 at 16:11
@djk47463 Example rolling unique count (similar to function `nunique` defined under "Speed test 2" above): `df.rolling(window='365D', on='Date').apply(lambda arr: len(set(arr)))`. The challenge is to make this more efficient (compare "Speed test 2" and "Speed test 3"). I almost succeeded but my solution `windowed_nunique` is off by 1, and I wondered if someone could find my error. – Samuel Harrold Sep 28 '17 at 16:20
@djk47463 `windowed_nunique` becomes off by 1 at record 78 (from `Out[13]` in the example), which has a corresponding date of '2010-01-19', which is before any extra leap day. – Samuel Harrold Sep 28 '17 at 16:26