Thursday, June 11, 2015

Command-line WiFi manipulation!

I found this blog post a guy made on manipulating WiFi from command line. It's self-explanatory, so I'll copy-paste it and link to it. Many thanks, Matt Crampton!!

ORIGINAL LINK

Turn off wifi on your macbook from the Mac OSX terminal command line:
networksetup -setairportpower en0 off
Turn on wifi on your macbook from the Mac OSX terminal command line:
networksetup -setairportpower en0 on
List available wifi networks from the Mac OSX terminal command line:
/System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport scan
Join a wifi network from the Mac OSX terminal command line:
networksetup -setairportnetwork en0 WIFI_SSID_I_WANT_TO_JOIN WIFI_PASSWORD
Find your network interface name
networksetup -listallhardwareports

Tuesday, May 26, 2015

Tesseract

So you have a big PDF. Or a bunch of PDFs. And they're the kind that have text in image format, so the text is not searchable. Want to command-line OCR it? That's what tesseract allows. Awesome, right? Yeahhhh...

So, follow the steps on either of these sites to get it installed:

But the site with the most comprehensive information on the entire OCR process is:

The first two sites suggest using imagemagick to convert PDF pages to image files, but the third one suggests using ghostscript with special parameterization for optimal performance, duly justified. Ghostscript worked better for me.

The PDF->image command they recommend is:
  • gs -dSAFER -dQUIET -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -dFirstPage=STARTPAGE -dLastPage=ENDPAGE -r300 -o image%03d.png -c 30000000 setvmthreshold -f mypdf.pdf
This tells ghostscript to convert pages in file mypdf.pdf from STARTPAGE to ENDPAGE into sequential 300dpi-resolution png files called imageXXX.png (where XXX is a 3-digit 0-padded number) using 8 simultaneous rendering threads, maximizing the amount of RAM used (to process faster). The resulting images, they find, are of good enough quality to submit to the tesseract OCR.

The actual OCR command is much simpler:
  • tesseract image001.png image001
This tells tesseract to recognize all text in image001.png, and to save the textual result into image001.txt. tesseract allows many other options, including recognition parameters and language, but I didn't really get into the details of those, so I'll omit them here.

So, follow these two steps, and you can OCR your own PDFs! Follow only the second one if your input is already in image (jpg/png/bmp) format.

Another link I found with more info on the subject:
And a list of available language models for the tesseract algorithm to use:

Ghostscript

Ghostscript is awesome. It's available on command line, it comes with Mac OS X by default, and it allows a nice range of PDF manipulations. I'd already written a post where PDF documents are joined together into a single one (See PDF Merge). Now for a slight upgrade, here's a way to grab only certain pages from a document:

gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
       -dFirstPage=22 -dLastPage=36 \
       -sOutputFile=outfile.pdf inputfile.pdf

This script will take pages 22-36 from inputfile.pdf and save them into the new file outfile.pdf. Together with PDF merging, this little technique allows us to mix and match PDF pages into new documents as we best see fit.

My thanks to this site for the great tip!