I am scraping files from a website, and want to rename those files based on existing directory names on my computer (or if simpler, a list containing those directory names). This is to maintain a consistent naming convention.
For example, I already have directories named:
Barone Capital Management, Gabagool Alternative Investments, Aprile Asset Management, Webistics Investments
The scraped data consists of some exact matches, some "fuzzy" matches, and some new values:
Barone, Gabagool LLC, Aprile Asset Management, New Name, Webistics Investments
I want the scraped files to adopt the naming convention of the existing directories. For example, Barone would become Barone Capital Management, and Gabagool LLC would be renamed Gabagool Alternative Investments.
So what's the best way to accomplish this? I looked at fuzzywuzzy and some other libraries, but not sure what the right path is.
This is my existing code which just names the file based on the anchor:
import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]
#letter_urls = []
for anchor in table.findAll('a'):
try:
if not anchor:
continue
fund_name = anchor.text
letter_link = anchor['href']
urllib.request.urlretrieve(letter_link, '2018 Q4 ' + fund_name + '.pdf')
except:
pass
Note that the list of directories are already created, and look something like this:
- /Users/user/Dropbox/Letters/Barone Capital Management
- /Users/user/Dropbox/Letters/Aprile Asset Management
- /Users/user/Dropbox/Letters/Webistics Investments
- /Users/user/Dropbox/Letters/Gabagool Alternative Investments
- /Users/user/Dropbox/Letters/Ro Capital
- /Users/user/Dropbox/Letters/Vitoon Capital