I have tried reading a PDF file with tabular data with texts and succeed it. But i have an image which is in PDF format and contains some text which need to be fetched for record purpose.All the PDFs are in a specific folder. I know only basics in python. Could anyone help me with this?
            Asked
            
        
        
            Active
            
        
            Viewed 430 times
        
    1
            
            
        - 
                    This is a duplicate. Check out this post: https://stackoverflow.com/questions/17630650/simple-python-library-for-recognition-text-from-image – Floam Nov 20 '19 at 04:23
- 
                    https://tabula.technology/ this could probably solve your problems using the coordinates of the your particular field you are extracting – aayush_malik Nov 20 '19 at 05:28
- 
                    Try this one: https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file – Lê Tư Thành Nov 20 '19 at 08:01
- 
                    I have tried with pyPDF2 . it recognizes tabular data and texts in pdf which are converted from MS word to PDF but i need to read an image which has some random texts .Can anyone help in that? – Prithivi Raj Nov 20 '19 at 11:53
1 Answers
0
            
            
        You can extract the both images (inline & XObject) and texts (plain and containing PDF operators) from PDF document using pdfreader
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
images = []
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass
You can also convert images to PIL/Pillow object and save
for i, img in enumerate(images):
    img.to_Pillow().save("{}.png".format(i))
 
    
    
        Maksym Polshcha
        
- 18,030
- 8
- 52
- 77
