I have a pandas column of datetime objects, and from this column I want to extract a list of unique years from it. What would be the most efficient way of doing this?
edit: Wasn't sure how the order of performing the operations was treated by pandas
I have a pandas column of datetime objects, and from this column I want to extract a list of unique years from it. What would be the most efficient way of doing this?
edit: Wasn't sure how the order of performing the operations was treated by pandas
 
    
    You can use pd.Series.dt.year followed by pd.Series.unique.
Timings on Python 3.6 / Pandas 0.19 below using data from @Engineero.
%timeit df['time'].dt.year.unique().tolist()                  # 739 µs per loop
%timeit df['time'].apply(lambda x: x.year).unique().tolist()  # 5.9 ms per loop
%timeit list(set(df['time'].dt.year.values))                  # 823 µs per loop
 
    
    If you consider the speed , data from Engineero
list(set(df['time'].dt.year.values))
 
    
    I made an example that gives us 1000 rows covering 165 years:
base = datetime.today()
date_list = [base + x*timedelta(days=60) for x in range(0, 1000)]
df = pd.DataFrame(data={'times': date_list})
Trying a bunch of different options and listing them in order from fastest to slowest(ish):
Using dt.year.unique() directly on the times column:
df.times.dt.year.unique()
398 µs ± 5.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using set with dt.year.values we get about the same performance as dt.year.unique(), but with greater variability:
set(df.times.dt.year.values)
422 µs ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using apply:
df.times.apply(lambda x: x.year).unique()
5.51 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using set and a list comprehension:
years = set([time.year for time in df.times])
3.48 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Interestingly, using the almost syntactically identical set and a generator, we get something closer to df.apply:
years = set((time.year for time in df.times))
5.85 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And using np.unique with a list comprehension:
np.unique([time.year for time in df.times])
6.09 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It looks like, at least on my machine, df.times.dt.year.unique() is the fastest so far. If I think of others I will add them to the list.
