Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.
This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".
Here is one example of the 10-Q filings that I need to parse: https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm
If a firm have no share repurchase, this table can be missing from the quarterly report.
I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.
For example, the only way I can think of to parse these forms is
from bs4 import BeautifulSoup
import requests
import unicodedata
import re
url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'
def parse_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tables = soup.find_all('table') 
    identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)
    n = len(tables) -1
    rep_tables = []
    while n >= 0:
        table = tables[n]
        remove_invalid_tags(table)
        table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
        if re.search(identifier, table_text):
            rep_tables += [table]
            n -= 1
        else:
            n -= 1
    return rep_tables
def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
    for tag in invalid_tags:
        tags = soup.find_all(tag)
        if tags:
            [x.replaceWith(' ') for x in tags]
The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!