I am trying to scrape data from a website: https://www.oddsportal.com/american-football/usa/nfl/
This link displays the upcoming games.
Currently, I have tried using pandas to read the html data retrieved by Selenium, but the dataframe is a multilevelindex and I'm not sure how to unpack the dataframe into a more readable format.
import undetected_chromedriver.v2 as uc
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
driver = uc.Chrome()
driver.get("https://www.oddsportal.com/american-football/usa/nfl/")
table = driver.find_element(By.XPATH, '//*[@id="tournamentTable"]').get_attribute('outerHTML')
df = pd.read_html(table)[0]
Out:
   American football» USA»NFL  ...     
                Today, 17 Oct  ...  B's
0                       13:30  ...   11
1                       17:00  ...   11
2                       17:00  ...   11
3                       17:00  ...   11
4                       17:00  ...   11
5                       17:00  ...   11
6                       17:00  ...   11
7                       17:00  ...   11
8                       2Q 6'  ...   11
9                       2Q 4'  ...   11
10                      2Q 3'  ...   11
11           Tomorrow, 18 Oct  ...  B's
12                        NaN  ...  NaN
13                      00:20  ...   11
14                19 Oct 2021  ...  B's
15                        NaN  ...  NaN
16                      00:15  ...   11
17                15 Nov 2021  ...  B's
18                        NaN  ...  NaN
19                      00:20  ...    1
[20 rows x 7 columns]
I can unpack the HTML data using this:
all_matches = [i.text for i in driver.find_elements(By.XPATH, '//*[@id="tournamentTable"]/tbody/tr') if "American football" not in i.text]
Out:
["Today, 17 Oct 1 2 B's",
 '',
 '13:30 Jacksonville Jaguars - Miami Dolphins\n  23:20\n+110\n-128\n11',
 '17:00 Baltimore Ravens - Los Angeles Chargers 34:6\n-159\n+138\n11',
 '17:00 Carolina Panthers - Minnesota Vikings 28:34 OT\n+114\n-133\n11',
 '17:00 Chicago Bears - Green Bay Packers 14:24\n+195\n-233\n11',
 '17:00 Detroit Lions - Cincinnati Bengals 11:34\n+165\n-192\n11',
 '17:00 Indianapolis Colts - Houston Texans 31:3\n-556\n+428\n11',
 '17:00 New York Giants - Los Angeles Rams 11:38\n+286\n-345\n11',
 '17:00 Washington Football Team - Kansas City Chiefs 13:31\n+241\n-294\n11',
 "2Q 15'\nCleveland Browns - Arizona Cardinals 14:23\n-149\n+127\n11",
 "2Q 14'\nDenver Broncos - Las Vegas Raiders 7:10\n-213\n+181\n11",
 "2Q 10'\nNew England Patriots - Dallas Cowboys 14:10\n+163\n-189\n11",
 "Tomorrow, 18 Oct 1 2 B's",
 '',
 '00:20 Pittsburgh Steelers - Seattle Seahawks\n-217\n+185\n11',
 "19 Oct 2021 1 2 B's",
 '',
 '00:15 Tennessee Titans - Buffalo Bills\n+206\n-250\n11',
 '',
 '',
 '']
But this would require me to parse the data using dictionaries and would be messy to format correctly.
My expected output is a DF formatted:
    date         game_time          Team1              Team2         Score     1     2   
0   2021-10-17    13:30      Jacksonville Jaguars   Miami Dolphins  23:20    +110  -128
1   2021-10-17    17:00      Baltimore Ravens       Los Angeles     34:6     -159  +138
2   2021-10-17    17:00      Carolina Panthers      Minnesota Vik   28:34    +114  -133
3   2021-10-17    17:00      Chicago Bears          Green Bay Pack  14:24    +195  -233
4   2021-10-17    17:00      Detroit Lions          Cincinnati Ben  11:34    +165  -192
I'm hoping there is an easier way to pass the data to the pandas read_HTML function that drops the multilevelindex and gets me even close to the format. If I can get close I can format the rest, but I'd like to avoid using dictionaries, but I understand if that is not possible.
 
    