2

Usually, scanned pages need to be deskewed before applying an OCR tool. Here, my input is a straight scanned page, and the OCR output is sometimes skewed, either clockwise or counter-clockwise. In my use case of a 260 pages english book, it happens for 14 pages.

Example: input.pdf: input.pdf

output.pdf output.pdf

Command:

convert -density 300 -quality 100 input.pdf -level 0%,100%,4.0 -black-threshold 75% convert.pdf && pdfsandwich -noimage -coo "-normalize  -density 300 -black-threshold 75%" convert.pdf -o output.pdf

How can I avoid this output skew?

Alternatively, how can I deskew the output without loosing the OCR? All the methods and tools I have found convert first to an image format which makes loosing the OCR, and then is useless here.

lalebarde
  • 765

2 Answers2

0

I was also having this problem, It's to do with one of the commands pdfsandwich runs: unpaper. The deskewing algorithm in depaper is broken. You can pass along parameters to unpaper via the -unpo switch, like -unpo "-dv 0" which should disable the deskew. If your pages are already crisp and OCR-ready you can disable all preprocessing entirely with -noprepro

EkriirkE
  • 328
0

Thanks to Remy F, I could write this solution, with the help of this LaTeX file, which import output.pdf, and rotate it:

\documentclass{article} 
\usepackage[paperwidth=6.38in,paperheight=10.32in,bindingoffset=0in,top=-0.39in,bottom=0in,left=-0.29in,right=0in,footskip=0in]{geometry}
\usepackage{graphicx}

\begin{document}

\pagestyle{empty}
\begin{figure}[t]
    \includegraphics[scale=0.233,angle=-4]{output.pdf} 
\end{figure}

\end{document}

Then:

pdflatex output_tex.tex

Creates output_tex.pdf: enter image description here

It would be nice to be able to tune the scale and margins automatically to be able to automatize the process.

EDIT: I have made some progress to obtain the deskew angle automatically:

angle=`convert output.pdf -deskew 40 -format "%[deskew:angle]" info:`

If I automatize, it leads to:

#/bin/bash
name=${1%.*}
ext=${1##*.}
convert -density 300 -quality 100 ${name}.$ext -level 0%,100%,4.0 -black-threshold 75%  ${name}_convert.$ext
pdfsandwich -noimage -coo "-normalize  -density 300 -black-threshold 75%" ${name}_convert.$ext -o ${name}_ocr.$ext
angle=`convert ${name}_ocr.$ext -deskew 40 -format "%[deskew:angle]" info:`
angle=`echo "${angle}*-1" | bc`
echo "  angle = $angle"
sed -e "s/ANGLE/$angle/" -e "s/FILE/${name}_ocr.$ext/" /var/ocr/pdfrotate.tex > ${name}_ocr_straight.tex
pdflatex ${name}_ocr_straight.tex
rm ${name}_convert.$ext ${name}_ocr_straight.tex ${name}_ocr_straight.aux ${name}_ocr_straight.log

With /var/ocr/pdfrotate.tex:

\documentclass{article}
\usepackage[paperwidth=6.38in,paperheight=10.32in,bindingoffset=0in,top=-0.39in,bottom=0in,left=-0.29in,right=0in,footskip=0in]{geometry}
    \usepackage{graphicx}
    \begin{document}
    \pagestyle{empty}
    \begin{figure}[t]
        \includegraphics[scale=0.233,angle=ANGLE]{FILE}
    \end{figure}
    \end{document}

The scale looks right and is document dependent. But unfortunatly, the geometry parameters top and left I tuned for my trial page are not good for other pages. I don't know how to automatize them. Possibly by bluring the original page and the result one, and performing an optimisation of a correlation of them, with top and left as parameters.

lalebarde
  • 765