I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.
But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:
Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er
All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.
First I thought it's because of the PDF Parsing library I'm using, but also with another library I get the exact same issue.
I had a look on the singleSpaceWidth from the parsed words and I noticed that it's varying always then, when it's adding a whitespace. I tried to put them manually together. But since there isn't really a pattern to recombine the words it's almost impossible.
Did anyone else have a similar issue or even a solution to that problem?
As requested, here is some more information:
- iText Version 5.2.1
- http://prine.ch/whitespacesProblem.pdf (Link to the pdf)
Parsing with SemTextExtractionStrategy:
PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    // Set the page number on the strategy. Is used in the Parsing strategies.
    semTextExtractionStrategy.pageNumber = i;
    // Parse text from page
    PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}
Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:
@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {      
    this.pageNumber = pageNumber;
    String text = renderInfo.getText();
    currTextBlock.getText().append(text + " ");
    ....
}
Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):
public class SemTextExtractionStrategy implements TextExtractionStrategy {
    // Text Extraction Strategies
    public ColumnDetecter columnDetecter = new ColumnDetecter();
    // Image Extraction Strategies
    public ImageRetriever imageRetriever = new ImageRetriever();
    public int pageNumber = -1;
    public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
    public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();
    public SemTextExtractionStrategy() {
        // Add all text parsing strategies which are later on applied on the extracted text
        // textParsingStrategies.add(fontSizeMatcher);
        textParsingStrategies.add(columnDetecter);
        // Add all image parsing strategies which are later on applied on the extracted text
        imageParsingStrategies.add(imageRetriever);
    }
    @Override
    public void beginTextBlock() {
    }
    @Override
    public void renderText(TextRenderInfo renderInfo) {
        // TEXT PARSING
        for(TextParsingStrategy strategy : textParsingStrategies) {
            strategy.parseText(renderInfo, pageNumber);
        }
    }
    @Override
    public void endTextBlock() {
    }
    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
        for(ImageParsingStrategy strategy : imageParsingStrategies) {
            strategy.parseImage(renderInfo);
        }
    }
}
 
     
     
     
    