I have a multitude of PDFs with different structures and i need to extract the text from them and find some key indicators.
I am using pyPdf module and in case the PDFs is not retriving any text, I am also using PDF Miner.
The problem is that for some of the files, no modules work, in the sense that no text is extracted from the PDF. I saw that some of them are scanned or only image PDF but some of them appear to have a constant structuture as the ones that can be parsed.
Here are the 2 functions I use, maybe I am missing something:
Using pyPdf
def getPDFContent(path):
        content = ""
        pdf = pyPdf.PdfFileReader(file(path, "rb"))
        for i in range(0, pdf.getNumPages()):
            content += pdf.getPage(i).extractText() + " "
        content = " ".join(content.replace(u"/xa0", " ").strip().split())
        return content
mt = getPDFContent(filename).encode("ascii", "xmlcharrefreplace")
Using PDF Miner
def getPDFContent(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
        retstr.write("nextpage")
    text = retstr.getvalue() 
    fp.close()
    device.close()
    retstr.close()
    return text
 
    