Hei,
I want to combine data frames (incidentally with a multi-index), into a larger data frame. Sometimes the data needs to be appended (new rows or columns are being added), sometime existing data needs to be updated. Somehow, I cant find a way to do both. It is either append (using .append()) or some sort of updating (.merge(), .update()) I tried to search for this and read the documentation, but cant really figure it out.
This is a test code
import pandas as pd
import numpy as np
zones = ['A', 'B', 'C']
# input data frames
dates0 = pd.date_range('20180101', '20180102', freq='D')
dates1 = pd.date_range('20180103', '20180104', freq='D')
idx00 = pd.MultiIndex.from_product(iterables=[dates0, [zones[0]]], names=    ['UTC', 'zone'])
df00 = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[1, 2], [3, 4]])
idx01 = pd.MultiIndex.from_product(iterables=[dates1, [zones[0]]], names=['UTC', 'zone'])
df01 = pd.DataFrame(index=idx01, columns=['a', 'b'], data=[[5, 6], [7, 8]])
idx10 = pd.MultiIndex.from_product(iterables=[dates0, [zones[1]]], names=['UTC', 'zone'])
df10 = pd.DataFrame(index=idx10, columns=['b', 'c'], data=np.random.rand(2, 2))
idx11 = pd.MultiIndex.from_product(iterables=[dates1, [zones[1]]], names=['UTC', 'zone'])
df11 = pd.DataFrame(index=idx11, columns=['b', 'c'], data=np.random.rand(2, 2))
# append - works, but only if the data is not yet there
df_append = df00.append(df01)
df_append = df_append.append(df10)
df_append = df_append.append(df11)
df_append.sort_index(inplace=True)
df_append
# append adds a second data point, where there should only be one
df00b = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[10, 20], [30, 40]])
df_append2 = df_append.append(df00b)
df_append2.sort_index(inplace=True)
df_append2.loc[('2018-01-01', 'A'), :]
# merge - does not what I want, changes column names
df_merge = df00.merge(df01, how='outer', left_index=True, right_index=True)
df_merge
# update - does not what I want, does not add new columns
df_update = df00
df_update.update(df01)
df_update
# join - gives an error, as no suffix defined and join wants to create a new column
df_join = df00
df00.join(df01)
** My issues ** .append() only works if the area (index+column) in the right data frame is not yet in the left data frame. Otherwise it just adds a second data point to the same index/column
.merge() changes the column names, if they exist in both the left and right data frame. But I want the column names to stay the same, and the data to be updated if it already exists
.update() does not append data, if the column / row does not yet exist
.join() gives an error..
What I need is a "update + append if not exist". Any idea how to go about it?
Thanks in advance, Theo
pS: Output from above
df_append
                   a         b         c
UTC        zone                         
2018-01-01 A     1.0  2.000000       NaN
           B     NaN  0.100551  0.271616
2018-01-02 A     3.0  4.000000       NaN
           B     NaN  0.489322  0.606215
2018-01-03 A     5.0  6.000000       NaN
           B     NaN  0.245451  0.242021
2018-01-04 A     7.0  8.000000       NaN
           B     NaN  0.047900  0.642140
df_append2.loc[('2018-01-01', 'A'), :]
                    a     b   c
UTC        zone                
2018-01-01 A      1.0   2.0 NaN
           A     10.0  20.0 NaN
df_merge
Out[4]: 
                 a_x  b_x  a_y  b_y
UTC        zone    
2018-01-01 A     1.0  2.0  NaN  NaN
2018-01-02 A     3.0  4.0  NaN  NaN
2018-01-03 A     NaN  NaN  5.0  6.0
2018-01-04 A     NaN  NaN  7.0  8.0
 
    
