PDF text extract with Python3.4

Question

The texts in the pdf files are text formats, not scanned. PDFMiner does not support python3, is there any other solutions?

There is a 3k version of the PDFMiner library: https://pypi.python.org/pypi/pdfminer3k — Christian O'Reilly, Nov 12 '15 at 14:29

score 3 · Answer 1 · edited May 23 '17 at 12:34

There is also the pdfminer2 fork, supported for python 3.4, which available through pip3. https://github.com/metachris/pdfminer

This thread helped me patch something together.

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

if __name__ == "__main__":
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
    pdfFile = BytesIO(scrape.read())
    outputString = readPDF(pdfFile)
    print(outputString)
    pdfFile.close()

score 1 · Answer 2 · edited Oct 09 '18 at 10:57

1

For python3, you can download pdfminer as:

python -m pip install pdfminer.six

edited Oct 09 '18 at 10:57

Durdu

4,649
2
27
47

answered Oct 09 '18 at 08:53

Shruti Agrawal

21
2

score 0 · Answer 3 · answered Jun 20 '19 at 08:07

tika worked the best for me. It won't be wrong if I say it's better than PyPDF2 and pdfminer This made it really easy to extract each line in the pdf into a list. You can install it by pip install tika And, use the code below:

from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)

PDF text extract with Python3.4

3 Answers3