I want to create a new dataframe using information from a given dataset. What I'm doing right now uses .iterrows(), and it's frustratingly slow. This is what I've got so far:
The original dataset (data) has two columns: user ID and a timestamp. I'm creating new dataframe (session_data) with three columns: user ID, session_start, and session duration.
#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration']) 
for index, row in data.iterrows():
    if row['ID'] in session_data.ID:
        # update the session duration 
    else:
        session = pd.DataFrame([[row['ID'], row['timestamp'], 0]], columns=['ID', 'session_start', 'session_duration'])
        session_data = session_data.append(session)
I'm thinking that instead of using a dataframe for session_data, I should create some sort of other object and use that to create a dataframe after I've iterated through the data. However as a noob I'm really struggling with what data type to use instead of the session_data dataframe, and whether I need to be using .iterrows() at all.
Any help is appreciated! Please let me know if I need to add more information.
EDIT: Here's some more information to create a reproducible example.
To get data, I'm linking to an external .csv with 100,000 rows. For convenience, here's a sample dataframe:
data = pd.DataFrame({'ID': ['1234', '5678', '5678', '1234'], 
                   'timestamp': ['12/23/14 16:53', '12/23/14 16:50', '12/23/14 16:52', '12/23/14 17:20']})
I've created session_data in the above snippet like so:
#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration'])
In the end, I want session data to look something like this:
   user_id  session_start session_duration
0  1234    12/23/14 16:53  27 minutes
1  5678    12/23/14 16:50  2 minutes
 
    