I have two XLS sheets that I've read into two dataframes. I would like to find the intersection between the two, where the contents of one df column is contained in a specific column in the other df. And from that intersection, I'd like to create a new df that contains the intersection AND the values from a particular column in one of those dfs.
I cannot figure out what function in Pandas to get that intersection to work, and to include the data from the additional column.
This is what I tried to get the intersection:
import pandas as pd
test_file = '/Users/Bill/Box Sync/Documents/Jupyter/test_data.xlsx'
# read in the xls sheets
prj_df = pd.read_excel(test_file, sheet_name = 'Sheet1')
opp_df = pd.read_excel(test_file, sheet_name = 'Sheet2')
# intersect the sheets
report_df = opp_df[opp_df['opp'].isin(prj_df['opp numbers'])]
...but as you'll see below, report_df only includes the rows where the contents of "opp" and "opp numbers" match exactly. I need something that is more like the Python "in" keyword when looking for a string in a string.
Here is some test data:
prj_df:
   Project            opp numbers          URL
0  Project 1 title    ab-cdefg;12-34567    http://1.2.3.4/abc
1  project 2 title    MJ-98733             http://1.2.3.4/UJUJUJUJ
2  Project 3 title    No code              http://1.2.3.4/99a
3  Project A title    01-PKL23             http://1.2.3.4/azzz
4  project B title    44-0098876           http://test.133
5  project c title    342-0981;98-09913    http://2.3.4.5/iiihh
6  project 99 title   25-AAAJJ12;99-49494  http://1.2.3.4/ghhi
7  project 303 title  77-AUDIJJ            http://1.2.3.4/def
opp_df:
0  opp        product     price
1  98-09913   widget1     123.55
2  66-99999   widget2     44.99
3  AB-DEFFF1  widget3     2345.5
4  01-PKL23   item a      9990
5  77-AUDIJJ  item b      84.56
6  KK-KIEW89  product 99  22.99
7  WE-24422   name 123    1.99
8  12-34567   stuff 1     395.5
report_df:
   opp        product     price
3  01-PKL23   item a      9990.00
4  77-AUDIJJ  item b      84.56
....but that data in report_df isn't complete: I also need row 1 and row 8 to show up because the 'opp' values from those rows are also contained in prj_df.
Now, in addition to getting the intersection correct, what I'd like to do is to add the values from the URL and Project columns of prj_df onto the end of report_df. Since I've been using the openpyxl library, my brain is only tuned to iterating through the sheet/dataframe and searching prj_df for the values instead of using the isin intersection I already did.
Is there a way to a) get that intersection to work, and b) pull in the Project and URL values into that intersected dataframe?
 
    