I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.
I managed to extract text from one pdf file with tika package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.
# import parser object from tike 
from tika import parser   
  
# opening pdf file 
parsed_pdf = parser.from_file("ducument_1.pdf") 
  
# saving content of pdf 
# you can also bring text only, by parsed_pdf['text']  
# parsed_pdf['content'] returns string  
data = parsed_pdf['content']  
  
# Printing of content  
print(data) 
  
# <class 'str'> 
print(type(data))
The desired output should look like this:
| Folder_Name | pdf1 | pdf2 | 
|---|---|---|
| 17534 | text of the pdf1 | text of the pdf 2 | 
| 63546 | text of the pdf1 | text of the pdf1 | 
| 26374 | text of the pdf1 | - |