I have been working through some pandas problems to get better at the fundamentals, and there is this one issue I consistently run into while trying to solve problems. Whenever I perform a groupby operation on a DataFrame followed by a series of other operations to get what I need, I have lost important information (other descriptive columns) about my output, and going back to retrieve the same is a new, typically more complicated problem to solve.
Take a look at this problem for instance:
Given a phone log table that has information about callers' call history, find out the callers whose first and last calls were to the same person on a given day. Output the caller ID, recipient ID, and the date called.
Input: caller_history:
caller_id:int64recipient_id:int64date_called:datetime64[ns]
Here is the code that I wrote for it:
import pandas as pd
caller_history['date_called'] = pd.to_datetime(caller_history['date_called'], format='%Y-%m-%d %H:%M:%S')
caller_history = caller_history.sort_values(by=['caller_id', 'date_called'])
grouping = caller_history.groupby(['caller_id', caller_history['date_called'].apply(lambda x: x.day)])
grouping.apply(lambda x: x['recipient_id'].iloc[0] == x['recipient_id'].iloc[-1]).reset_index()
My thought process was:
- Convert the
date_calledcolumn to a format python can understand - Sort
caller_history, first by thecaller_id(bunching up together all the calls by a particular user) and then by thedate_called - Create groups within
caller_history, first bycaller_idto isolate users and then by the day (which is what the call toapplyis doing). - Now finally check if, within the groups, the
recipient_idmatches on the first and last calls of every day.
This gives me 'fundamentally' the correct solution. But now, I have no way of retrieving the recipient_id and date_called. The call to reset_index() is a last-ditch effort to recover whatever possible, in this case, I can recover the caller_id and the day portion of date_called.