I'm working on a script that calculate time unavailability of some equipment I maintain.
I have as an input a csv file of our supervising tool (around 2M lines), containing the alarms for a month.
Problem is, it takes a huge time to process it!
Once converted as a Pandas DF, I have a df with these columns:
['date','alarm_key','pcause_id_hex','activity','model_name']
- Date : timestamp of the alarm
- Alarm_key : id of the alarm
- Pcause_id_hex : description of the alarm
- Activity : Generated / Cleared (Generated means the alarm started, and Cleared means it ended)
- Model_name : name of the equipment
The alarm_key is the same when it's generated and when it's cleared.
I want as an output a new dataframe which contains :
['station','name','start_date','end_date','duration']
- Station and Name : I get it from the model_name
- Start_date : date of the "Generated" alarm
- End_date : date of the "Cleared" alarm
- Duration : I have a function that calculates it
Below is my code:
df = pd.DataFrame([
        ['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
        ['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
        ['01/03/2022 00:02','5693352','CONNECTION KO','Generated','Equip3_Station3'],
        ['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
    ],
        columns=['date','alarm_key','pcause_id_hex','activity','model_name']
    )
list_alarms = [{}]
for i, row in df.iterrows():
    # Process row information
    row_info = {
        'date': row['date'],
        'alarm_key': row['alarm_key'],
        'pcause_id_hex': row['pcause_id_hex'],
        'activity': row['activity'],
        'model_name': row['model_name'],
    }
    # Check if it's a generated alarm
    if row_info['activity'] == 'Generated':
        alarm_info = {
            'station': '',
            'name': '',
            'start_date': '',
            'end_date': '',
            'duration': 0
        }
        # Fill name / station info
        if re.search('_', row_info['model_name']):
            alarm_info['name'] = row_info['model_name'].split('_', 1)[
                0]
            alarm_info['station'] = row_info['model_name'].split('_', 1)[
                1]
        else:
            alarm_info['name'] = ''
            alarm_info['station'] = row_info['model_name']
        # Fill start date
        alarm_info['start_date'] = row_info['date']
        start_datetime = datetime.strptime(
            row_info['date'], '%d/%m/%Y %H:%M')
        # Search for next iteration of the alarm key
        row_cleared = df.loc[(df['alarm_key'] == row_info['alarm_key']) & (
            df['date'] > row_info['date'])]
        if not row_cleared.empty:
            # If found, get end date
            end_date = row_cleared.iloc[0, 0]
            alarm_info['end_date'] = end_date
            end_datetime = datetime.strptime(
                end_date, '%d/%m/%Y %H:%M')
        else:
            # If not found, set end date to last day of the month
            end_datetime = start_datetime.replace(day=monthrange(
                start_datetime.year, start_datetime.month)[1])
            alarm_info['end_date'] = end_datetime.strftime(
                '%d/%m/%Y %H:%M')
        # Calculate duration of the alarm
        alarm_info['duration'] = _get_unavailability_time(
            start_datetime, end_datetime)
        list_alarms.append(alarm_info)
list_alarms.pop(0)
df_output = pd.DataFrame(list_alarms)
For the example set in the code above, I would like a result like this one :
    station    name        start_date          end_date    duration
0  Station1  Equip1  01/03/2022 00:01  01/03/2022 02:02    0.983333
1  Station3  Equip3  01/03/2022 00:02  31/03/2022 00:02  600.000000
I iterate through the dataframe, getting the row info. If it's a Generated one, I look for the next iteration of the alarm key with a Cleared activity. Once done, I store the end date in a list containing the information related to the alarm. (If an alarm isn't cleared, I set the end date as the last day of the month)
I don't know how to speed it up way more. (as you may see, I'm absolutely not an expert in this)
If you have some suggestions to improve the process, please let me know!
 
    