Tuesday, May 26, 2015

Tesseract

So you have a big PDF. Or a bunch of PDFs. And they're the kind that have text in image format, so the text is not searchable. Want to command-line OCR it? That's what tesseract allows. Awesome, right? Yeahhhh...

So, follow the steps on either of these sites to get it installed:

But the site with the most comprehensive information on the entire OCR process is:

The first two sites suggest using imagemagick to convert PDF pages to image files, but the third one suggests using ghostscript with special parameterization for optimal performance, duly justified. Ghostscript worked better for me.

The PDF->image command they recommend is:
  • gs -dSAFER -dQUIET -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -dFirstPage=STARTPAGE -dLastPage=ENDPAGE -r300 -o image%03d.png -c 30000000 setvmthreshold -f mypdf.pdf
This tells ghostscript to convert pages in file mypdf.pdf from STARTPAGE to ENDPAGE into sequential 300dpi-resolution png files called imageXXX.png (where XXX is a 3-digit 0-padded number) using 8 simultaneous rendering threads, maximizing the amount of RAM used (to process faster). The resulting images, they find, are of good enough quality to submit to the tesseract OCR.

The actual OCR command is much simpler:
  • tesseract image001.png image001
This tells tesseract to recognize all text in image001.png, and to save the textual result into image001.txt. tesseract allows many other options, including recognition parameters and language, but I didn't really get into the details of those, so I'll omit them here.

So, follow these two steps, and you can OCR your own PDFs! Follow only the second one if your input is already in image (jpg/png/bmp) format.

Another link I found with more info on the subject:
And a list of available language models for the tesseract algorithm to use:

Ghostscript

Ghostscript is awesome. It's available on command line, it comes with Mac OS X by default, and it allows a nice range of PDF manipulations. I'd already written a post where PDF documents are joined together into a single one (See PDF Merge). Now for a slight upgrade, here's a way to grab only certain pages from a document:

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
       -dFirstPage=22 -dLastPage=36 \
       -sOutputFile=outfile.pdf inputfile.pdf

This script will take pages 22-36 from inputfile.pdf and save them into the new file outfile.pdf. Together with PDF merging, this little technique allows us to mix and match PDF pages into new documents as we best see fit.

My thanks to this site for the great tip!