Tuesday, May 26, 2015

Tesseract

So you have a big PDF. Or a bunch of PDFs. And they're the kind that have text in image format, so the text is not searchable. Want to command-line OCR it? That's what tesseract allows. Awesome, right? Yeahhhh...

So, follow the steps on either of these sites to get it installed:

But the site with the most comprehensive information on the entire OCR process is:

The first two sites suggest using imagemagick to convert PDF pages to image files, but the third one suggests using ghostscript with special parameterization for optimal performance, duly justified. Ghostscript worked better for me.

The PDF->image command they recommend is:
  • gs -dSAFER -dQUIET -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -dFirstPage=STARTPAGE -dLastPage=ENDPAGE -r300 -o image%03d.png -c 30000000 setvmthreshold -f mypdf.pdf
This tells ghostscript to convert pages in file mypdf.pdf from STARTPAGE to ENDPAGE into sequential 300dpi-resolution png files called imageXXX.png (where XXX is a 3-digit 0-padded number) using 8 simultaneous rendering threads, maximizing the amount of RAM used (to process faster). The resulting images, they find, are of good enough quality to submit to the tesseract OCR.

The actual OCR command is much simpler:
  • tesseract image001.png image001
This tells tesseract to recognize all text in image001.png, and to save the textual result into image001.txt. tesseract allows many other options, including recognition parameters and language, but I didn't really get into the details of those, so I'll omit them here.

So, follow these two steps, and you can OCR your own PDFs! Follow only the second one if your input is already in image (jpg/png/bmp) format.

Another link I found with more info on the subject:
And a list of available language models for the tesseract algorithm to use:

No comments:

Post a Comment