I'm trying to plot the average number of clicks on a given day over a one week period. The data i'm working with can be found here if you want to take a look. But briefly, it is structured as so:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400165 entries, 0 to 400164
Data columns (total 9 columns):
uuid 400165 non-null object
timestamp 400165 non-null object
session_id 400165 non-null object
group 400165 non-null object
action 400165 non-null object
checkin 223824 non-null float64
page_id 400165 non-null object
n_results 136234 non-null float64
result_position 230482 non-null float64
dtypes: float64(3), object(6)
memory usage: 27.5+ MB
And the output of df.head() looks like:
uuid timestamp session_id group action checkin page_id n_results result_position
0 00000736167c507e8ec225bd9e71f9e5 20160301103842 78245c2c3fba013a b searchResultPage NaN cbeb66d1bc1f1bc2 5.0 NaN
1 00000c69fe345268935463abbfa5d5b3 20160307005226 c559c3be98dca8a4 a searchResultPage NaN eb658e8722aad674 10.0 NaN
2 00003bfdab715ee59077a3670331b787 20160302145305 760bf89817ce4b08 a checkin 30.0 f99a9fc1f7fdd21e NaN NaN
3 0000465cd7c35ad2bdeafec953e08c1a 20160302222701 fb905603d31b2071 a checkin 60.0 e5626962a6939a75 NaN 10.0
4 000050cbb4ef5b42b16c4d2cf69e6358 20160302095023 c2bf5e5172a892dc a checkin 30.0 787dd6a4c371cbf9 NaN NaN
There are 8 unique values of df['date'] and I want to make a new variable that equals (the count of all values of sessionid on a given date) divided by (the count of unique values of sessionid on that same date).
I'd love to do this in one step, but i think i've managed to get part of the way there with this code:
df['sessions_bydate'] = df.groupby('date')['session_id'].nunique().sum()
pd.crosstab(df['date'], df['sessions_bydate'])
#Which produces this output:
test2 68110
date
2016-03-01 59324
2016-03-02 60322
2016-03-03 60643
2016-03-04 45745
2016-03-05 36242
2016-03-06 41199
2016-03-07 52557
2016-03-08 44129
Interestingly though, this looks like the count of all values of sessionid by date, not the count of all unique values, even though i'm using nunique(). Not sure what i'm doing wrong there.
Ultimately, my objective is to produce a line plot with lowess estimation that shows the average number of clicks per session per date. If i'm approaching it in a strange way, let me know. I'm very open to alternatives.
If helpful, in R, what i'd want to do is:
# Data transformation
CTR <- df %>%
group_by(date) %>%
summarise("click_through_rate" = n()/
n_distinct(session_id))
# Graphical representation
ggplot(CTR, aes(x = date, y = click_through_rate)) +
geom_step() +
ylim(c(0,NA)) +
geom_smooth() +
labs(title = "Daily click-through rate")
EDIT:
At the suggestion of one of the comments, i include here a small snippet of data that should have the right structure needed to replicate my problem (obtained using df1.head(18).to_dict()). I've manually made this snippet in excel, and the variables all_values and unique_values are not in my actual data. I think I want to create them in order to make what I actually want: all_values \ unique_values. But if there's a way to skip straight to the finish line, i'm all ears
df = pd.DataFrame({
'all values': {0: 3, 1: 3, 2: 3, 3: 4, 4: 4, 5: 4, 6: 4, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 3, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5},
'sessionid': {0: '000936ae06d62383', 1: '000936ae06d62383', 2: '0024f4f005f34c9d', 3: '002601319d1a02e1', 4: '0029420a5f8c7d90', 5: '002601319d1a02e1', 6: '002601319d1a02e1', 7: '002ad4609eedc350', 8: '002ad4609eedc350', 9: '002ad4609eedc350', 10: '002b97995ca9ce77', 11: '002b97995ca9ce77', 12: '002ad4609eedc350', 13: '003f93d4791463c9', 14: '003f93d4791463c9', 15: '00429c6a475ab250', 16: '002b97995ca9ce77', 17: '00429c6a475ab250'},
'unique_values': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3},
'\ufeffdate': {0: '02/01/2016', 1: '02/01/2016', 2: '02/01/2016', 3: '03/01/2016', 4: '03/01/2016', 5: '03/01/2016', 6: '03/01/2016', 7: '04/01/2016', 8: '04/01/2016', 9: '04/01/2016', 10: '05/01/2016', 11: '05/01/2016', 12: '05/01/2016', 13: '06/01/2016', 14: '06/01/2016', 15: '06/01/2016', 16: '06/01/2016', 17: '06/01/2016'}
})