Hello Stack Overflow community, I am having an issue where Pandas is not understanding my merge conditions. It works with the other 'keys', but breaks as soon as I include the "Date" column as a key. The "Date" columns are string objects in both dataframes (not timestamps).
In other words, I want all 4 'keys' to be identical before "left merging" the columns from df2 to df without losing any data in df. Also, when I open the csv files in Excel, the dates format look exactly the same (ex: 5/10/2015).
But, Pandas reads the date column in "csv_file1", [df], as "5-6-2015" :
In [1]: df['Date']
Out[1]: 
         Date 
0   2015-5-11    
1   2015-5-11    
2   2015-5-10   
3   2015-5-12  
Pandas reads the date column in "csv_file2", [df2], as "5/6/2015" :
In [2]: df2['Date']
Out[2]: 
         Date 
0   5/11/2015    
1   5/11/2015    
2   5/12/2015 
3   5/13/2015
4   5/17/2015 
The dtypes for both are "obj"; I do not understand why Pandas would read the format of the 'Date' columns differently.
Here is what the dataframes look like before the left-merge:
In [3]: df
Out[3]: 
         Date Hour    Make   Model  Gas Rating  Safety Rating
0   2015-5-11    1   Honda   Accord         9             8
1   2015-5-11    0   Toyota  Camry          9            10
2   2015-5-10   23   Chevy   Sonic          7             6
3   2015-5-12   13   Honda   Civic          8             7
In [4]: df2
Out[4]: 
         Date Hour    Make   Model  Mileage  Rating  Speed Rating
0   5/11/2015    1   Honda   Accord             10            7
1   5/11/2015    0   Toyota   Camry             10            7
2   5/12/2015   23   Honda    Civic              9            6
3   5/13/2015   23   Honda    Civic              9            6
4   5/17/2015    7   Chevy   Impala                
This is what happens when I try to left-merge:
In [5]: final = pd.merge(left=df, right=df2, how='left', on=['Date', 'Hour', 'Make', 'Model'])
In [6]: final
Out[6]: 
            Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
   0   2015-5-11    1  Honda   Accord         9             8           NaN   
   1   2015-5-11    0  Toyota  Camry          9            10           NaN     
   2   2015-5-10   23  Chevy   Sonic          7             6           NaN   
   3   2015-5-12   13  Honda   Civic          8             7           NaN   
     Speed Rating  
   0          NaN  
   1          NaN  
   2          NaN  
   3          NaN    
If I on try merging without the 'Date' key, The data transfers correctly for the most part, but this is an excess of data due to duplicates in both and will not be accurate because I only need data where all four keys ('Date', 'Hour', 'Make', 'Model') match and anything from df before left merging the data.
There will always be many more duplicates of Make/Model & Hour in df2 so I only want to left merge matches to df, no matter how many duplicate instances within df. I also do not wish to lose any data in df so any dates from df that is not found in df2, should remain.
If the 'Date' merge condition worked, this is the output I am trying to achieve:
In [7]: final
Out[7]: 
                Date Hour   Make   Model  Gas Rating  Safety Rating  Mileage Rating \
       0   5/11/2015    1  Honda   Accord         9             8            10   
       1   5/11/2015    0  Toyota  Camry          9            10            10     
       2   5/10/2015   23  Chevy   Sonic          7             6           NaN   
       3   5/12/2015   13  Honda   Civic          8             7             8   
          Speed Rating  
       0            7  
       1            7  
       2          NaN  
       3            7 
Does anyone have an idea why this is happening? I have tried even splicing the 'Date' column into 3 columns ('Month', 'Day', 'Year') and changing the dtype to int64, bool, obj and no success there either. So I assume it has something to do with the format.
Thanks ahead of time Stack Overflow community!
 
     
    