import numpy as np
import pandas as pd
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
reference = pd.DataFrame({'Position':['Contractor','President'],
'Item(s)':[(1,), (1,2)]})
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)
yields
Position Overall%
0 Contractor 100.0
1 President 80.0
Explanation: There is an article by Hadley Wickham
(PDF) propounding the advantages
of making data "tidy". The main tenet is that each row should
represent an "observation" and each column represent some factor or variable.
It frequently turns out that the tools you will need to express your calculation
will fall into place quite naturally once the data is tidy.
The difficulty of this problem largely comes from the data not being tidy.
Consider results:
In [405]: results
Out[405]:
Contractor Item 1 Item 2 President
0 1 1.0 1 1
1 1 1.0 0 0
2 0 0.0 0 0
3 0 0.0 1 0
4 0 1.0 0 1
5 1 NaN 1 1
Instead of having separate columns for Contractor and President, it would be nicer to have one column called Position, since Position is the variable, and each observation or row can have one value for Position -- either a Contractor or a President.
Similarly, Item 1 and Item 2 should be coalesced into a single column Item:
In [416]: melted
Out[416]:
Position Item value
0 Contractor Item 1 1.0
1 Contractor Item 1 1.0
2 Contractor Item 1 NaN
3 President Item 1 1.0
4 President Item 1 1.0
5 President Item 1 NaN
6 Contractor Item 2 1.0
7 Contractor Item 2 0.0
8 Contractor Item 2 1.0
9 President Item 2 1.0
10 President Item 2 0.0
11 President Item 2 1.0
melted contains the same information as results, but in a tidy format. The value column contains the values in results[['Item 1', 'Item 2']]. Each row corresponds to an "observation" where either results['Contractor'] or result['President']` equals 1, since the calculation's logic only requires these values.
Similarly, instead of
In [407]: reference
Out[407]:
Item(s) Position
0 (1,) Contractor
1 (1, 2) President
it would be tidier to have a DataFrame whose columns are Item and Position:
In [408]: longref
Out[408]:
Item Position
0 Item 1 Contractor
1 Item 1 President
2 Item 2 President
Once you have the tidy version of your data in the form of melted and longref,
calculating the desired result is fairly straight-forward:
merged = pd.merge(longref, melted, how='left')
# Item Position value
# 0 Item 1 Contractor 1.0
# 1 Item 1 Contractor 1.0
# 2 Item 1 Contractor NaN
# 3 Item 1 President 1.0
# 4 Item 1 President 1.0
# 5 Item 1 President NaN
# 6 Item 2 President 1.0
# 7 Item 2 President 0.0
# 8 Item 2 President 1.0
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum() / grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
How to tidy-up reference to make longref:
Just iterate through the rows of reference and for each row iterate through the tuple of items to build the new DataFrame, longref:
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
How to tidy-up results to make melted:
It can be done with two calls to pd.melt. pd.melt converts "wide" format to "long" format DataFrames. It can coalesce multiple columns into a single column. For example, to coalesce the Contractor and President columns into a single Position column you could use:
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
# Item 1 Item 2 Position value
# 0 1.0 1 Contractor 1
# 1 1.0 0 Contractor 1
# 5 NaN 1 Contractor 1
# 6 1.0 1 President 1
# 10 1.0 0 President 1
# 11 NaN 1 President 1
and similarly, to coalesce the Item 1 and Item 2 columns into a single Item column, use:
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
# Position Item value
# 0 Contractor Item 1 1.0
# 1 Contractor Item 1 1.0
# 2 Contractor Item 1 NaN
# 3 President Item 1 1.0
# 4 President Item 1 1.0
# 5 President Item 1 NaN
# 6 Contractor Item 2 1.0
# 7 Contractor Item 2 0.0
# 8 Contractor Item 2 1.0
# 9 President Item 2 1.0
# 10 President Item 2 0.0
# 11 President Item 2 1.0