Tuesday, July 26, 2011
Scanners Work In Vain
One of the arguments you sometimes hear in the brouhaha over "intellectual property"
is "well you can just scan it in", referring to copying a book or a story.
My current response to the "you can just scan it in argument" is "like Hell you can!"
For the past couple of weeks I have been trying to scan in an a novella I wrote with Ernest Hogan, called "Obsidian Harvest" that we intend to publish as an ebook. On the basis of my experiences I'd say scanning digest-sized (paperback book size) pages is in fact extremely difficult. The real reason we're no overrun with scanned pirated versions of books is it's damned hard to do.
Part of the reason it took so long was that I was also in the final stages of publishing my new book titled "Shift Happens: The New E-Publishing Paradigm And What It Means For Writers." That was another "interesting" experience, albeit much less frustrating than the scanner.
My first attempt at scanning was with my quirky HP 8500 all-in-one printer-scanner-fax machine. It took me over an hour to scan in the 28 magazine pages. Then it went through OCR and the fun really began.
First off, the quality of the scan and resulting OCR was lousy. There was a mistake or two on nearly every line. Worse, whole sections of the story had simply not been picked up. At several places in the copy I was missing half a page or more. In short the scan was unusable.
Now I regularly use the scanner for contracts and such without the OCR and it has performed satisfactorily. So I assumed the OCR software that came with the machine (OEMed) from Iris wasn't up to the job.
My next step was to go to Nuance and order OmniPage 18, a highly recommended scanning and OCR package. After some hassles getting it installed, I tried it on the pdf file the IRIS software had created. This is a file of graphic images and most OCR packages will accept it.
The results were definitely better, but there were still a lot of mistakes. And of course the gaps in the document was still there.
Okay, recalling the famous dictum of John W. Campbell Jr.: "Always use the proper tool for the job. The proper tool to fix a television is a television repairman." I decided to take doczilla to a scanning service and have it scanned professionally.
The first place I tried was a regular commercial service. After some back and forth on the phone the guy at the service told me that basically they couldn't do it. Not only was my project to small physically for their scanners to feed, it was also too small a job for them. "Now if you had 2800 pages instead of 28 . . ." my informant told me.
After calling a couple of more services I got the same response. They all dealt in letter or legal sized pages printed on dead-white background and in quantities in the thousands.
Then I decided to try one of my local quick print places. They did indeed have a scanner for small quantities, but when I took it down there the answer was the same: They couldn't do it. Their problem was the paper size. It was too small to feed reliably through their sheet feeder.
In talking to the very helpful guy at the copy center, I found out why I was getting gaps in the scans. My all-in-one simply didn't have enough RAM to handle the job. When it ran out of RAM it quit OCR until it caught -- with no warning, naturally.
Okay, I've got one final shot. I dug the original manuscript out of my files and today I'll take it back to the copy center and see if they can do that. It's a little dog eared, but it is a clearly printed original. If that doesn't work, it's time to hire a typist.
The point of this long, rambling tale is that "just scanning it in" isn't easy, especially when you're dealing with digest-size or paperback book-size packages. While it's theoretically easy, the practice for some kinds of documents is a lot harder. It doesn't help that you've got to take the pages out of the original to get a clean scan.
There are a lot of things like this in our high-tech world where the gap between "we can do it" and "we can do it easily and routinely" is broad enough to defeat even semi-serious efforts to make it work. Just because we can do something doesn't mean it has been reduced to everyday practice and just because something is reduced to everyday practice in one field doesn't mean it will transfer easily to another, even closely related, field.
is "well you can just scan it in", referring to copying a book or a story.
My current response to the "you can just scan it in argument" is "like Hell you can!"
For the past couple of weeks I have been trying to scan in an a novella I wrote with Ernest Hogan, called "Obsidian Harvest" that we intend to publish as an ebook. On the basis of my experiences I'd say scanning digest-sized (paperback book size) pages is in fact extremely difficult. The real reason we're no overrun with scanned pirated versions of books is it's damned hard to do.
Part of the reason it took so long was that I was also in the final stages of publishing my new book titled "Shift Happens: The New E-Publishing Paradigm And What It Means For Writers." That was another "interesting" experience, albeit much less frustrating than the scanner.
My first attempt at scanning was with my quirky HP 8500 all-in-one printer-scanner-fax machine. It took me over an hour to scan in the 28 magazine pages. Then it went through OCR and the fun really began.
First off, the quality of the scan and resulting OCR was lousy. There was a mistake or two on nearly every line. Worse, whole sections of the story had simply not been picked up. At several places in the copy I was missing half a page or more. In short the scan was unusable.
Now I regularly use the scanner for contracts and such without the OCR and it has performed satisfactorily. So I assumed the OCR software that came with the machine (OEMed) from Iris wasn't up to the job.
My next step was to go to Nuance and order OmniPage 18, a highly recommended scanning and OCR package. After some hassles getting it installed, I tried it on the pdf file the IRIS software had created. This is a file of graphic images and most OCR packages will accept it.
The results were definitely better, but there were still a lot of mistakes. And of course the gaps in the document was still there.
Okay, recalling the famous dictum of John W. Campbell Jr.: "Always use the proper tool for the job. The proper tool to fix a television is a television repairman." I decided to take doczilla to a scanning service and have it scanned professionally.
The first place I tried was a regular commercial service. After some back and forth on the phone the guy at the service told me that basically they couldn't do it. Not only was my project to small physically for their scanners to feed, it was also too small a job for them. "Now if you had 2800 pages instead of 28 . . ." my informant told me.
After calling a couple of more services I got the same response. They all dealt in letter or legal sized pages printed on dead-white background and in quantities in the thousands.
Then I decided to try one of my local quick print places. They did indeed have a scanner for small quantities, but when I took it down there the answer was the same: They couldn't do it. Their problem was the paper size. It was too small to feed reliably through their sheet feeder.
In talking to the very helpful guy at the copy center, I found out why I was getting gaps in the scans. My all-in-one simply didn't have enough RAM to handle the job. When it ran out of RAM it quit OCR until it caught -- with no warning, naturally.
Okay, I've got one final shot. I dug the original manuscript out of my files and today I'll take it back to the copy center and see if they can do that. It's a little dog eared, but it is a clearly printed original. If that doesn't work, it's time to hire a typist.
The point of this long, rambling tale is that "just scanning it in" isn't easy, especially when you're dealing with digest-size or paperback book-size packages. While it's theoretically easy, the practice for some kinds of documents is a lot harder. It doesn't help that you've got to take the pages out of the original to get a clean scan.
There are a lot of things like this in our high-tech world where the gap between "we can do it" and "we can do it easily and routinely" is broad enough to defeat even semi-serious efforts to make it work. Just because we can do something doesn't mean it has been reduced to everyday practice and just because something is reduced to everyday practice in one field doesn't mean it will transfer easily to another, even closely related, field.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment