Sunday, August 26, 2007

Scanning Books, (first things first)


This blog has moved permanently to a new home! You will be redirected there in 30 seconds, or just go to:


Please update your bookmarks!


DIFFICULTY AT THE BEGINNING works supreme success,
Furthering through perserverance.
Nothing should be undertaken.
It furthers one to appoint helpers.
- The I Ching, China, around 2850 BC'

Physical book scanning is a challenging topic on it's own. I needed to ask around and do my homework, and stay focused on the time-consuming task at hand. Scanning technology has come a LONG way since I was in college and first had access to scanners, back then a full-color 300dpi scan seemed to take all afternoon to complete- (actually it was a few minutes, but...). The lamps in the machines had inconsistent color, and their dynamic range was paltry. On top of it, the machines cost $5,000 or more. Although my life became deeply involved with computing and media, I didn't really touch any scanners until years later.

About 2 years ago I purchased a cheap desktop scanner for miscellaneous office tasks, and was amazed with it's performance- scanners had come a long way in a short time!

Enough with my spiel, the point is- it's totally within technical and financial reach to attempt to scan my books. The scanners are fast, color is consistent, and without doing any real measurements or experiments, the dynamic range of the cheap stuff seems way better than most reflective materials would actually need.

About Light (the dynamic range thing, skip ahead if this sounds boring)
You see, the range thing is important in any digital imaging, which is really what scanning is about. I can't think about this just in scanner vocabulary if I'm going to feel comfortable sinking years into this, the technology of scanning changes to fast.
The world we see with our eyes has a gigantic dynamic range, photographers call it luminosity. Everything which 'reads' light has limits to how the light is processed. Our eyes pupils adjust to changing light, sunlight vs. nighttime for example. Film, and CCD's have their own limits interpreting luminosity values from the visible (and invisible) light spectrum. Photographers are really concerned with this, but scanner devices are doing precisely the same thing. Film has levels of 'opacity', which is the dynamic range photographers are again interested in. Digital photography has it's own emerging vocabulary, which I know little about- so I'll skip commenting.

However, the important part is here- the 'reflectance' of printed matter. Film, as well as the CCD's in a camera or scanner all have limits to the usable dynamic range they can carry- but a reduced slice of the light spectrum is the best we get. Reflective media can only deal with an even smaller slice of that range. Glossy things have greater range possibilities than matte surfaces, but it's relatively useless to think about here.

So, in the end, using 'cheap' scanners for books is more than adequate with regard to digitized quality, because the scanners can easily trump the dynamic range available in most (if not all) reflective media- like pages of books, and other printed matter.

Trumping even the quality issues, much of the gory details of this discussion are moot- with most printed matter, the main goal is to get at the TEXT CONTENT. We mostly have to make sure the OCR software can read the digitized pages, who cares if we see the fibers of paper in most pages?!?

My desired scanner features for book-scanning:

+ good dynamic range
(above discussion makes this a pleasant non-issue)

+ Fast scans at 300 DPI, full color
(future posts on OCR softwares will explain why)

+ Non-damaging to pages and bindings
This is my personal collection, man!!!

Here's what I started with:

What I currently have is el-cheapo, and isn't ideal, but works far better than I'd thought for this task:
It's a Canon CanoScan LiDE 500F, which I believe is no longer in production- it's replacement seems to be:
Canon LiDE70 2400 x 4800dpi 48bit USB 2.0 Hi-Speed Interface Flatbed Scanner ($68 from NewEgg! Wow.)

I started scanning the first books, and here's notes for what I found lacked in this device, (applicable to what I'd want in other scanners):
- Full flatbed scans take around 12 seconds, however, a book which took approximately 60% of the flatbed area took about 8 seconds per scan. That's sortof a long time, for a 300 page book.

- The 'scan' button on the scanner is awesome for me, (the task gets fairly mind-numbing). Paired with the nice Canon scanner software, it's easy to use and setup.

- The Canon scan software is good (used with Mac OSX for now). I think it's fine for what it is, people around the web say it's some of the best stuff- (read: other stuff must really suck, the Canon stuff simply works).

- I tried using the stock 'Image Capture' software bundled with OSX, which is nice too- but it is noticeably slower with the scans. It takes an extra 2-3 seconds before the scanner starts moving, likely it's loading drivers for each scan. That's too bad, I'd rather aim at installing as little software as possible, long-run.

- The 'lip' around the glass area is tall, and a wee bit bulky- making it difficult to get the book spine clean over the edge.

- !!! It's a flatbed, this will DESTROY some fairly rare paperbacks I own, so those books will never touch this scanner.

- The lid. Notice, I detached it? It simply got in the way however I put it down on the table. When I'm not scanning, I simply set the lid back on top- the foam backing will protect the glass etc...

In future posts, I'll go through my URLS and research for more appropriate scanners, as well as some other devices of definite relevance.

Bibliography, (CLASSIC stuff about light):

Photographic Sensitometry: The Study of Tone Reproduction
Hollis N. Todd, Richard D. Zakia, 1969
(wow, it's $5 used!)

The Print, (Book 3 in the classic photographic series)
Ansel Adams