The texts in the pdf files are text formats, not scanned. PDFMiner does not support python3, is there any other solutions?
            Asked
            
        
        
            Active
            
        
            Viewed 3,029 times
        
    3
            
            
        - 
                    https://github.com/mstamy2/PyPDF2 ? – Eric Levieil Jun 24 '15 at 11:17
- 
                    1There is a 3k version of the PDFMiner library: https://pypi.python.org/pypi/pdfminer3k – Christian O'Reilly Nov 12 '15 at 14:29
3 Answers
3
            
            
        There is also the pdfminer2 fork, supported for python 3.4, which available through pip3. https://github.com/metachris/pdfminer
This thread helped me patch something together.
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO
def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr
if __name__ == "__main__":
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
    pdfFile = BytesIO(scrape.read())
    outputString = readPDF(pdfFile)
    print(outputString)
    pdfFile.close()    
1
            
            
        For python3, you can download pdfminer as:
python -m pip install pdfminer.six
 
    
    
        Durdu
        
- 4,649
- 2
- 27
- 47
 
    
    
        Shruti Agrawal
        
- 21
- 2
0
            
            
        tika worked the best for me. It won't be wrong if I say it's better than PyPDF2 and pdfminer This made it really easy to extract each line in the pdf into a list. You can install it by pip install tika
And, use the code below:
from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)
 
    
    
        Siddharth Das
        
- 1,057
- 1
- 15
- 33
 
     
    