Monday, August 27, 2007

Scanning Books, (quell technolust)

NOTE:

This blog has moved permanently to a new home! You will be redirected there in 30 seconds, or just go to:

http://BOOKS.DOTIKE.NET/

Please update your bookmarks!

--


(manufacturer demo image [Remember, all demos are rigged])

THE IMAGE
Clouds and thunder:
The image of DIFFICULTY AT THE BEGINNING.
Thus the superior man
Brings order out of confusion.
- again, The I Ching, China, around 2850 BC'

This is a followup to my previous post on Book Scanning, where I started with a cheap little flatbed scanner. Now I'm going to focus on Speed, and go over what I've found out about Fully-Automated scanning solutions.

Don't get too exited about this post, none of these results are practical- but all of it interesting and worth mention- I'd simply be a fool not to at least explore every possibility at this stage... In the end, I found that most mechanical answers just don't seem to be all they're cracked up to be- robots will not make my problems go away, and it seems, they could distract me from my real goal- scanning my book collection. I could also be very wrong with any observations below, I'm weighting my practical experiences against whatever information I can find online, I HAVE NOT TRIED ANY OF THE SYSTEMS DISCUSSED BELOW! I'm just trying to apply common sense. Lets get on with it...

--
My little engine that could, a Canon CanoScan LiDE 500F:



To recap, good things about my current cheap flatbed setup are:
  • Simplicity. The scanner is powered completely by USB even, so only 1 cable!

  • More than adequate quality for scanning reflective materials

  • Canon Software is easy to setup for preset output, (300 DPI, full color)

  • It's a great start, for almost no cost.


Scanning problems I need to solve as I continue are:
  • Speed, currently between 8-14 seconds per scan, this could be much better.

  • A typical flatbed scanner is not ideal for fragile or rare materials (small-ish hardcover books are great on flatbeds btw, very durable!)

  • Large volumes have 'spine shadow' problems, and also focus problems if content lifts off the flatbed near the spine.

--
The Cut-Spine, Sheetfed Scanner Method

To get it out of the way, a solution to take seriously, is to cut the spine off the books- and use auto-page-feed scanners.

This approach is out of the question for me, as I plan to keep my books intact, however, it's the first approach people have brought up with me- based on Google's widely publicized successes implementing this technique- for their google books project.
For smaller-scale projects, any Kinkos can cut off the spine for $1 per cut, using big paper-ream cutters.
Once the pages are loose, there's TONS of 'Sheetfed' or 'Document' scanners on the market, which take reams of paper to scan. Prices range from $150 to $10k+, but from what I've seen, models in the $300 range seem to be more than adequate.

But, the benefits in speed here, are trumped by the value of my actual books. It made sense for Google, the vast destruction was an investment in making their search engine indexing better- (taxonomy and trust, right...)- but actually reading, or loaning out, my books collection is far more valuable than destroying them. And then there's my autographed copies, stuff written by friends and acquaintances... all priceless to me. I'd thought of re-packing the loose pages into some binding which I could shelve, but that's just silly- it's messy enough having a lot of books... Add to this the hassle of carting books back and forth to Kinkos, a good workout for any hacker nerd like me, but I'd rather be skateboarding :)

I'm not much for the 'rare book' scene, nor am I an archivist. I mean I really go after some relatively rare books, but I happily prefer 'reference copies', (rare stuff that's banged up with use or has notation). So this issue isn't about monetary value, it's about the fact that I simply want to keep using the books themselves- and preserve them within reason as I proceed. I must find balance speed gains with a less violent solution...

--
Mechanical Page Turner Systems
So my thinking next is, if I somehow can go full-auto with the scanning, I can spend time on the other aspects of this project... It seems my thought is not a new one, people have been trying to make good automated page-turners for A LONG TIME.

For example, take this mid-20th century example from engineering at MIT, (which cites some Da Vinci invention I can't find anywhere):
"Automatic page turner could help musicians and the disabled"
http://web.mit.edu/newsoffice/1999/pageturner.html
The article is about mechanical engineering professor Ernesto E. Blanco, and his fantastic page turner- (which seems not to exist as a product, and designs are not available). So it either is very special purpose, as it was created for turning sheet music, *or* it outright sucks.

--
Lego my ego, (is that a Beach Boys song?)

Some people's perseverance and ingenuity blows my mind, a man made a fully auto, page turning, book-scanner- out of Legos. Yes, Legos. Reading the webpage about it, about it nearly made me weep in awe.



Seriously folks, please go look at the webpage of MURANUSHI Takayuki, featuring his book scanner. I almost want to go to Japan, just to buy this guy a drink.

Now for me, I'm not about to dive into any Lego engineering, so lets move on to some commercial automated offerings.

--
With page-turning on the brain, this looks pretty darned cool- and has loads of buzz about it on the net:

Atiz BookDrive, a portable automatic book scanner
http://www.atiz.com/bookdrive.php


OK, lets look at the specs from their site:


For my project, it seems their machine meets my needs about as well (or poorly) as my cheap-o flatbed, except this thing is fully automated. Just imagine, put in a book, and press 'Go!'!
  • It can't handle my many fragile rare books.

  • It requires windows (I'll accept Mac, but would love UNIX/X11/Shell compatability)

  • Cruddy advertising rambles regarding Speed, (look at the PDF for full specs!)

    • 1 page at 300 DPI (color?) takes 32.5 sec., according to their docs.

    • I can't tell if this is for both open pages at once, (16 sec. per page then?)


  • Can't find many user reports regarding page jams, practical experiences

  • Internet Rumored Pricing, (have to email for quote) $35,000

  • Suffers from software/hardware integration problems. (I'll explain more about why they should be as separate as possible, once I start posting about Book Processing- the software part...)



I don't think this is a good solution for me at all. Perhaps it's great for a research library, with massive volumes of similar-sized textbooks to scan, and appropriate budgets for this kind of thing- but I still consider it an 'experimental' technology, until some independent source shows some real progress with it.
Someone said to me I'm just whining because I can't afford one. While that truly isn't my attitude, the price to value to risk ratio here is just outside the scope of sanity for me. (If this product really worked well, I may as well spend the next year or two working my knuckles to the bone, eating rice every day- and buy one, it will take me years to scan all my books anyhow).

--
Mucho Macho Product, the Orgasmatron of book scanning:

The Kirtas-Tech "APT 1200" and "APT 2400", models respectively.
http://www.kirtas-tech.com/ (Just like Atiz, awesome animations on their site)


From their website literature:
"The APT BookScan 1200™ is based on a disruptive digital imaging technology initially developed at Xerox PARC and protected by several patents which have already been issued and some pending."
The APT BookScan 1200 and 2400 cost $75,000 and $189,000 each, respectively, (again, web-rumors, you have to email their sales reps for a quote).
Disruptive digital imaging technology?!? Sounds kindof awesome- but does this mean it will eat my books, like paper jamming in a printer? What if my cat jumps in there while it's on?
I'm scared. However, this machine WOULD LOOK AWESOME in my living room. I love sexy industrial design. But seriously, I simply can't take this thing seriously.

Let's step aside with the mechanical page-turning idea:

On on the topic of mechanical page turning, a good friend said something like:
"I don't think it's really possible- at least not in a way which helps people scan books faster than a flatbed. Pages stick together and are inconsistent, different thicknesses, etc... Heck, the design of books is a somewhat poor user interface to begin with."

After handling a lot of books, and thinking a lot about handling them, I tend to think he's right- ESPECIALLY after watching one of the big-machines in action on YouTube, (important: note the human hand required to keep pages down for this particular book, what's the point?):

(I don't have $70k for this, but run with me here...) For the $70-189k price tag, I could just hire a small army of high-school kids in my Brooklyn neighborhood to do the flatbed scanning as an after-school job, and likely get WAY more speed and value for my scanned dollar, while investing in my neighborhood economy. I'm certain the high-school kids on my block would do a far better job than any state-of-the-art mechanical device,

Dontcha' love rigged demos? I couldn't resist this one :)

--
Copy-Stand-ish solutions:

There are tons of other high-end solutions for book scanning, but right off the bad they don't look like they solve my problems- as they are aimed at high-end conservation/archivist applications. Many of them look like fancy photo copy-stands, and look good for massive reprography applications as well. They also seem to go by the moniker "Face-Up Publication Scanners". One has to manually turns the pages, (good), but these all-in-one units are again prohibitively expensive (think $12,000 range).

I won't get too deep into this tangent, but here's a look at some well received models, for the record:

"Bookeye Color Planetary Scanner"
http://www.dlsg.net/bookeye.htm


Minolta PS 7000 Digital Publication Scanner
http://www.mid-america.com/scanning/minolta_book_scanners.html


http://www.microfilmshop.com/index.asp?PageAction=VIEWPROD&ProdID=231


--
Time to move beyond all this technolust and focus on solving the problem- finding better tools for Non-Destructive Home Book Scanning.

I'll focus in on more practical solutions, in my next posts regarding Scanning Books!

--
P.S. Regarding the Kraftwerk images in this post, I dearly love early Kraftwerk, (pre Autobahn album, 1975). Their work after 1975 ties directly into fueling what I feel is a totally annoying idealogy for the Western world, schizophrenic "Techno-Worship/Nihilism"- the ancestral ideals behind today's "Cult of the Algorithm". But this is a personal rant, and not really relevant in this blog about my book scanning... (or is it?) I'm merely stating, I don't believe in technology.

Labels:

Sunday, August 26, 2007

NY Times Says... "Scan This Book!"

NOTE:

This blog has moved permanently to a new home! You will be redirected there in 30 seconds, or just go to:

http://BOOKS.DOTIKE.NET/

Please update your bookmarks!

--

This article really interests me, but plainly because it shows that other people REALLY want the same thing I want- (I feel less crazy when people ask me why I'm scanning, now that the NY Times says it's OK :) It's a year old, but WAY relevant today- they explore the massive idea of the Internet becoming "the universal library".

Seriously though, this article is really interesting in a few areas, I'll call it 'The Pandora's Box Article'. The author blows the lid open on several issues, and follows what I see as some noteworthy academic trends in thinking (however flawed or naive the ideas are, they represent ideas that have major backing- it's the current discourse). I love some of the thoughts presented, but I hate much of the tone- and ideologically revealing vocabulary.:

New York Times, May 14, 2006
"Scan This Book!"
By KEVIN KELLY


"The universal library should include a copy of every painting, photograph, film and piece of music produced by all artists, present and past. Still more, it should include all radio and television broadcasts. Commercials too. And how can we forget the Web? The grand library naturally needs a copy of the billions of dead Web pages no longer online and the tens of millions of blog posts now gone — the ephemeral literature of our time. In short, the entire works of humankind, from the beginning of recorded history, in all languages, available to all people, all the time."

Wow.

One of my favorite stories is the one about the Tower of Babel.

--
Regardless of the usual techno-hubris I oozing out of this article, the author does hit on most of the big current thoughts in book/library digitization, let me rip through them (though each topic deserves deeper discussion):

1. Scanning the Library of Libraries
This is the tower of babel thing. The aim is admirable, but so much of the vocabulary used in this kind of description leads me to believe many people truly aren't fully comprehending the consequences, or the scale. Additionally, many of them feel there will be 'an end' to it all, but media breeds media- exponentially- it seems to have been that way forever. Additionally, every act of cataloguing leads to some sort of revision, in the context of the scope of understanding of it's creators. Therefore, once something like 'the universal library' exists, it will make itself obsolete by revealing it's own inadequacy.
Oh btw- we gotta' hire a LOT of people in China to scan everything at a cut rate so we can accomplish this humanitarian goal. Let's go, chop chop. :|

2. What Happens When Books Connect
Yeah, hyperlinks yadda yadda. But machine thinking sucks for deciding the relationships... (AI?)

3. Books: The Liquid Version
Search is good, byte streams beat paper in the usefulness category- hands down. This discussion I really find interesting, insomuch as it touches on the borders between the frame of a work which have been blurred with the internet. This is right in line with thoughts I've had about the beginning of the conceptual breakdown of the Filesystem metaphor, whose hierarchy is now being challenged with various technologies, (relational database design, the database storage of the BeOS Filesystem, the MacOSX 'Smart Folders' which are merely a relational collection, etc...). This stuff isn't new, expect a blog post in the future on this topic...

4. The Triumph of the Copy
Guttenberg, yeah! Then came copyright. These stats are really cool, and make me feel less crazy with my esoteric tastes:
"In the world of books, the indefinite extension of copyright has had a perverse effect. It has created a vast collection of works that have been abandoned by publishers, a continent of books left permanently in the dark. In most cases, the original publisher simply doesn't find it profitable to keep these books in print. In other cases, the publishing company doesn't know whether it even owns the work, since author contracts in the past were not as explicit as they are now. The size of this abandoned library is shocking: about 75 percent of all books in the world's libraries are orphaned. Only about 15 percent of all books are in the public domain. A luckier 10 percent are still in print. The rest, the bulk of our universal library, is dark."

5. The Moral Imperative to Scanning
Google's scan plan,

6. The Case Against Google
Oy vey, another case? The skinny is interesting. Google needs 'Good' (trusted) information to make it's search engine, (and ad-sense) better. So, when they scanned some 70% of the world's copyright-protected books for themselves, the world was too focused on how they accomplished this MASSIVE feat using nitty-gritty methodology, (and in astonishing speed- most of it done in 9 months or so). Then when they made the book search another 'Public Beta', everyone freaked out, and the mainstream (and market) focused on copyright/humanitarian issues with content production.
What most people missed, is that this massive archive was created TO MAKE THEIR SEARCH INDEXES BETTER. It's a taxonomy and statistics algorithms thing. Interestingly enough, they found the copyright loophole- nobody said they couldn't use copyrighted materials privately to make their information business make money! Very interesting trick, that most people still don't understand.

7. When Business Models Collide
Oy vey- this is a HUGE topic, this is DRM, Copyright, DMCA, RIAA, MPAA, Hollywood, and the Publishers, (he leaves out the Telcos- who vie for a massive stake in content).

8. Search Changes Everything
This is basically the discussion of an idea I think is really dirty, the idea that the search engine really helps us 'Find What we are looking for'. Kindof, nut this again is where I'll cite Babel and promise to elaborate in abother post- with one of my favorite stories from my misadventures in library science...

--
Look, I may sound negative, but I REALLY LIKE this author- this article puts him right in the middle of the things I think and care about in media, software, networks, and in general, information culture.

I've made a fistfull of blog post spawned from thoughts in this one- like I said at the start, this is 'The Pandora's Box Article' for me...

Labels: ,

Scanning Books, (first things first)

NOTE:

This blog has moved permanently to a new home! You will be redirected there in 30 seconds, or just go to:

http://BOOKS.DOTIKE.NET/

Please update your bookmarks!

--

DIFFICULTY AT THE BEGINNING works supreme success,
Furthering through perserverance.
Nothing should be undertaken.
It furthers one to appoint helpers.
- The I Ching, China, around 2850 BC'

Physical book scanning is a challenging topic on it's own. I needed to ask around and do my homework, and stay focused on the time-consuming task at hand. Scanning technology has come a LONG way since I was in college and first had access to scanners, back then a full-color 300dpi scan seemed to take all afternoon to complete- (actually it was a few minutes, but...). The lamps in the machines had inconsistent color, and their dynamic range was paltry. On top of it, the machines cost $5,000 or more. Although my life became deeply involved with computing and media, I didn't really touch any scanners until years later.

About 2 years ago I purchased a cheap desktop scanner for miscellaneous office tasks, and was amazed with it's performance- scanners had come a long way in a short time!

Enough with my spiel, the point is- it's totally within technical and financial reach to attempt to scan my books. The scanners are fast, color is consistent, and without doing any real measurements or experiments, the dynamic range of the cheap stuff seems way better than most reflective materials would actually need.

--
About Light (the dynamic range thing, skip ahead if this sounds boring)
You see, the range thing is important in any digital imaging, which is really what scanning is about. I can't think about this just in scanner vocabulary if I'm going to feel comfortable sinking years into this, the technology of scanning changes to fast.
The world we see with our eyes has a gigantic dynamic range, photographers call it luminosity. Everything which 'reads' light has limits to how the light is processed. Our eyes pupils adjust to changing light, sunlight vs. nighttime for example. Film, and CCD's have their own limits interpreting luminosity values from the visible (and invisible) light spectrum. Photographers are really concerned with this, but scanner devices are doing precisely the same thing. Film has levels of 'opacity', which is the dynamic range photographers are again interested in. Digital photography has it's own emerging vocabulary, which I know little about- so I'll skip commenting.

However, the important part is here- the 'reflectance' of printed matter. Film, as well as the CCD's in a camera or scanner all have limits to the usable dynamic range they can carry- but a reduced slice of the light spectrum is the best we get. Reflective media can only deal with an even smaller slice of that range. Glossy things have greater range possibilities than matte surfaces, but it's relatively useless to think about here.

So, in the end, using 'cheap' scanners for books is more than adequate with regard to digitized quality, because the scanners can easily trump the dynamic range available in most (if not all) reflective media- like pages of books, and other printed matter.

Trumping even the quality issues, much of the gory details of this discussion are moot- with most printed matter, the main goal is to get at the TEXT CONTENT. We mostly have to make sure the OCR software can read the digitized pages, who cares if we see the fibers of paper in most pages?!?


--
My desired scanner features for book-scanning:

+ good dynamic range
(above discussion makes this a pleasant non-issue)

+ Fast scans at 300 DPI, full color
(future posts on OCR softwares will explain why)

+ Non-damaging to pages and bindings
This is my personal collection, man!!!


--
Here's what I started with:

What I currently have is el-cheapo, and isn't ideal, but works far better than I'd thought for this task:
It's a Canon CanoScan LiDE 500F, which I believe is no longer in production- it's replacement seems to be:
Canon LiDE70 2400 x 4800dpi 48bit USB 2.0 Hi-Speed Interface Flatbed Scanner ($68 from NewEgg! Wow.)


I started scanning the first books, and here's notes for what I found lacked in this device, (applicable to what I'd want in other scanners):
- Full flatbed scans take around 12 seconds, however, a book which took approximately 60% of the flatbed area took about 8 seconds per scan. That's sortof a long time, for a 300 page book.

- The 'scan' button on the scanner is awesome for me, (the task gets fairly mind-numbing). Paired with the nice Canon scanner software, it's easy to use and setup.

- The Canon scan software is good (used with Mac OSX for now). I think it's fine for what it is, people around the web say it's some of the best stuff- (read: other stuff must really suck, the Canon stuff simply works).

- I tried using the stock 'Image Capture' software bundled with OSX, which is nice too- but it is noticeably slower with the scans. It takes an extra 2-3 seconds before the scanner starts moving, likely it's loading drivers for each scan. That's too bad, I'd rather aim at installing as little software as possible, long-run.

- The 'lip' around the glass area is tall, and a wee bit bulky- making it difficult to get the book spine clean over the edge.

- !!! It's a flatbed, this will DESTROY some fairly rare paperbacks I own, so those books will never touch this scanner.

- The lid. Notice, I detached it? It simply got in the way however I put it down on the table. When I'm not scanning, I simply set the lid back on top- the foam backing will protect the glass etc...


In future posts, I'll go through my URLS and research for more appropriate scanners, as well as some other devices of definite relevance.

--
Bibliography, (CLASSIC stuff about light):

Photographic Sensitometry: The Study of Tone Reproduction
Hollis N. Todd, Richard D. Zakia, 1969
http://www.amazon.com/Photographic-Sensitometry-Study-Tone-Reproduction/dp/0871001802
(wow, it's $5 used!)

The Print, (Book 3 in the classic photographic series)
Ansel Adams
http://www.amazon.com/Print-Ansel-Adams-Photography-Book/dp/0821221876/

Labels:

About This Blog

NOTE:

This blog has moved permanently to a new home! You will be redirected there in 30 seconds, or just go to:

http://BOOKS.DOTIKE.NET/

Please update your bookmarks!

--

I spent a few years with the majority of my books in boxes, and lived off my harddrives. Not only did I get very comfortable reading works on-screen, but I became addicted to how malleable the digitized content is. A simple text search is an astoundingly powerful thing.

Now I want MY BOOKSHELF searchable, hyper-linked, hackable.

--
This blog seemed like the most appropriate place to collect my notes, and document my progress, as I begin a very long-term project- digitizing my book collection. Eventually, I would like to ideally extend the project so that every piece of media I own is catalogued, and the content searchable in useful ways. I've been told by many people this is a fairly ambitious undertaking. After hacking around the idea a bit, I get it.

--
This project has 3 main components (which I'll treat separately):

1) Book Scanning
The act of physically scanning the books is itself a challenge. My aim here is completeness, while balancing various losses in the translation from physical to digitized form. Additionally, minimally damaging processes are my focus, as this is my personal book collection.
There is a great deal of tools and information, of all shapes and sizes, worth exploring- but at the end of the day, the job has to simply get done...

2) Book Processing
This encompasses storage, OCR processing, data storage formatting, and presentation formatting. My aim here is to siphon content value mechanically as much as possible, (e.g. I don't plan to proofread it all in my lifetime), however I do wish to treat some texts, even some fragments of texts, with incredible care.
Cumulative 'polishing' is my gameplan here- to constantly make it simple to update and clean the data as I use it.

3) Media Indexing
This is what I like to call 'The Amazon Challenge', to create a usable catalog of the book media, which naturally leads to catalog *all* my media- (just as it led Amazon to catalog nearly every product imaginable). I've got between 3 and 4TB of data CD's and DVD's stretching back through 15 years of my life, with all sorts of things I'd love to be able to find in there...

Each aspect of this project has it's own particularities- and after hacking around for a month or so on this, and spending a great deal of time researching various aspects of this project, I'm exited to start getting somewhere.

--
People have already repeatedly asked me, why the heck are you doing this?
First off, because it's fun.
I spend a lot of time in my books, I'm a pretty nerdy and a media junkie. My tastes are more and more esoteric as I get older, and I find mainstream internet companies and services understandably disappointing. For example, Amazon's suggestions can't figure me out at all- no matter how involved I get in their site, (or how many books I buy).

At this point in my life, I find myself referencing a wide array of my own materials all the time- and I constantly am frustrated by how much I miss. I believe I won't have time to re-read much of what's on my bookshelves in the rest of my life, yet so many works have relevant components I re-visit all the time. The more I reference works, the more I find other things in the materials which I really wanted perhaps a month earlier...

So with that, I see this as a way to make computing machines serve me better, the notes presented here I hope can help others who have similar aims and projects!

If you wish to contact me, please leave a comment on a particular post, and I'll try to get back to you!

Labels: , ,