Workflow Pt 1 – Book Scanning
Now that another round of semester-end work has past I thought I would post a little bit about how I accomplish it all. I had a few conversations with colleagues about my reading and writing process and I found that my ten-second explanations left them more confused about what I was doing than when I had started. Pictures are definitely needed, and it would be nice just to give someone a link. In addition, my current approach has been cobbled together from various other people around the web so I thought I would compile all my tips and tricks in one place so that others might benefit. First, I’ll do a post on my book-scanning process, but I hope to also write a few posts about how I read PDFs, write in Markdown, and compile in LaTeX. (as a bonus I might also do a post on SublimeText 2 with Vim keybindings.)
Building the Scanner
So here is my infamous book-scanner:
I learned nearly everything I needed to know to construct this thing over at the Do It Yourself Book Scanner community. I borrowed most of the design for my scanner from Daniel Reetz’s “new standard” build. You can see my own build thread there as well.
The way that this works, I simply place a book under the glass platan, which presses the pages flat at precisely 90 degrees. I have two cannon A480s (a red and a blue so I can keep odds and evens straight) which I bought refurbished for $60 apiece. The model is important because I had to pick a model that could run a hacked firmware called CHDK. With this installed on the cameras’ flash drives, I wrote a little script that takes pictures every six seconds (with a beep every second so I know where I am in the cycle):
rem Timer shutter control for DIY Bookscanners rem tested on Canon Powershot S5 IS @title DIY Timer Shutter V1 @param s Seconds @default s 6 @param a Sounds (0=No 1=Yes) @default a 1 if s<2 then s=2 t=s-1 :wait_interval i=t do if a<>0 then playsound 4 sleep 1000 i=i-1 until i<1 if a<>0 then playsound 0 click "shoot_full" do get_prop 206 p until p<>1 sleep 1000 goto "wait_interval" rem Get_prop 206 tests when DIGIC III camera is ready to shoot again rem Use get_prop 205 instead for DIGIC II cameras rem Don't activate AE lock, this loop is endless if AE lock is activated end
With this script running on the cameras, all I have to do is lift the glass and turn the pages. With two cameras firing every six seconds this turns out 1200 pages per hour. As I’ve gotten better at it, I have upped the cycle speed to five seconds for 1400 pages an hour.
Now once we have run through a book, the pictures come off the camera’s flash drives looking like this:
I then collate the pages from the two cameras into one folder. All the pictures coming off the red camera are evens and all the pictures coming off the blue camera are odds, so when they are put into a single directory the images will be in order. This could be easily accomplished with a shell script, but I like to check and make sure I didn’t miss any pages etc. so I use a program called File Wrangler.
I then point an amazing utility called Scan Tailor at the directory and it does most of the post-processing heavy work. When Scan Tailor gets done with the directory (usually about ten minutes for a 300 page book), I have another directory with cleaned up black and white TIFFs. They look like this (click on the image to get a zoomed in view):
Now I string all these together into a single script using an available automator action. In Automator I set up a system service that takes selected files from finder and strings them together into a single PDF saving it to my desktop.
This is already a readable PDF with clean text, but it is somewhat large and there is no OCR (optical character recognition) so I can’t highlight or copy and paste. There are some free OCR programs out there such as Tesserract, but I really like Adobe’s ClearScan OCR which comes with Acrobat Pro. It took me a while to find this since it isn’t the default option in Acrobat and there is not much documentation on it, but I like it because it does a really nice job of extrapolating a vectorized font from the image. This means that the text looks really crisp and clear on various different screens and resolutions from my Air to my Kindle to my iPad. When I get done the page is like this: