0

I would like to scan some old text documents. My purpose is twofold: disaster recovery (e.g. fire), and to save space on bulky documents I rarely refer to (e.g. old phone bills).

After scanning I intend to destroy some of the originals, where I rarely refer to them and they are bulky. The rest I will keep and continue referring to. I do not intend to OCR the documents.

I estimate there are a few thousand sides of A4 to scan, and I am aiming for only a few failures (missed or illegible sides) per 1000 sides scanned. By illegible I mean text that a human cannot read reliably.

I would like to do this myself rather than using a commercial service.

I believe the documents are fairly typical of what home users will have collected in their filing cabinets over the past say 10 or 20 years:

  • Mostly (perhaps 80%) standard paper size or close to standard size (A4, would be US letter elsewhere presumably)
  • Some bills that are longer than A4 (less than 10%)
  • A small number of "very miscellaneous" pages (less than 10%)
  • Mostly relatively flat good quality paper
  • The documents are printed on various papers since they include bills, receipts, letters, etc.
  • Many but not all documents are printed on both sides
  • A mixture of colour and in black and white only. Most of the documents do not use colour in an important way
  • A minority of pages with some graphics and pictures, etc. (perhaps 5 or 10%)
  • A minority of yellowed pages (less than 5%)

I would like to scan in colour because I do not want to verify that all of the colour information is unimportant. I will exclude large format documents (e.g. A3), but I would ideally like to scan bills that are longer than A4.

I don't mind scanning the "awkward cases" sheet-by-sheet but would like to save time using a sheet feeder where possible. However I anticipate that a high-end professional scanner isn't really called for. Also, as long as documents are still human-legible, damage to the paper is not very important.

Aside from dpi, what features in a scanner and sheet feeder are important for a job like this? By "features" I mean specific technical features (or performance characteristics) of the design, rather than broad categories like "reliability".

I am not looking for product recommendations. I would like to know what features are relevant for this scale of application.

Croad Langshan
  • 878
  • 9
  • 22

3 Answers3

1

If your pages (or some of them) where folded or are wrinkled (e.g. paper dried after exposure to water or high humidity) better chose a scanner with CCD instead of CIS. CCD elements have a much greater depth of field than CIS. Scanning such paper with a CIS scanner will result in unsharp areas on your scan. OCR often fails in unsharp areas. You might sharpen such areas with settings in the driver or with software but this might still not do the trick to get reliable OCR. With a CCD scanner you avoid the problem in the first place.

Regarding pages longer than A4: Probably all sheet feed scanners at your price point support that. It's usally a setting in the scanner driver that switches off multi-page feed detection by length.

Comparing scanners by advertised speed (pages/images per minute) can be very misleading. Some producers state it at 150 others at 200 or 300 dpi. Speed very much depends on the scanner driver settings you chose. Example: If you scan a newspaper/magazine article with (screen-printed) pictures/graphics at 300 dpi and aim for small document size, you need to choose the descreen function in the driver. This will cause your scanner to slow down considerably. Although you set 300 dpi for such a scan the speed will be comparable to a scan at about 600 dpi (remember that we talk about rather inexpensive document scanners for 500 GBP only).

Chose a scanner with LEDs as light source instead of cold cathode discharge lamps, which is an older kind of lightning. LEDs have a longer live span and do not need a warm-up time.

0

As for any job of that importance, I would say that the reliability of the product / company is of importance. (The specs don't matter if the quality of the scan will be low, or the feeder breaks.) Also, I assume (although I might be wrong, of course) that all scanners today will have high enough dpi and will be able to output to the usual file types (jpeg for lower file size, png for higher quality, etc.)

However, I'd recommend taking a moment to consider whether digital preservation is reliable enough. E.g.

  • Are we sure that a dvd, HDD, or flash drive will hold its memory for many years (assuming you want this for many years).
  • Are we sure that we'll be able to read the files a decade from now? (Think file type, and hardware type. - how would you read information from a floppy disk today?!)

See Digital Preservation on Wikipedia. And this answer on this site.

ispiro
  • 1,651
-1

Assuming that you intend to continue scanning incoming documents on a regular basis (if you only plan to scan old ones you better get it done at a scan service anyway):

Scan profiles, some scanner producers call it scan presets, will make your work much easier and faster. With a profile/preset you save a combination of scanner driver settings for later reuse. Example: Profile A for plain black print on standard white paper, B for colored magazine articles, C for sales slips of different sizes (e.g. auto-crop to original size instead of scanning small slips at a standardized page sizes), D for thin paper with print on both sides (driver settings e.g. see-through or bleed-through prevention), E for documents with extra length, etc.

Considering the documents you mentioned you will probably get to the point where you need more than 9 scan profiles. Many ADF scanners offer just 9 profiles, some even less. Some producers implement scan profiles in the driver, others in "scan utility" software. Some offer hardware buttons to choose among profiles. Many models with hardware buttons and display just show the profile number without additional text. Will you later remember what profile 3 does? A few scanners have a display that shows text as well, so you can give your profiles speaking names. And more than 9 profiles? Often implemented in software – but such demands get you quickly beyond consumer-grade hardware/software.

I recommend buying a scanner where auto-crop is already supported in the driver. If you have to crop your scans with additional software you have to live with a lot of compromises. So better do not count on upgrading this feature with additional software at a later stage. Reliable auto-crop is very hard to implement on the software level alone (and requires quite some CPU power). Even if a consumer-level third-party software claims to support auto-crop you will get a lot of false results (from not enough cropped to cropped too much, to even cropped completely at random - there is consumer and semi-professional software for around 200 USD that cropped completely at random in my tests).

Why did I not limit my answer to hardware? Because buying a scanner is not like buying a printer as those that did not use a document scanner before might think. The print dialogue is more or less standardized and variations are quite limited across the many printer producers and models we use for our general printing needs. WIA drivers (Windows) for scanners are similarly standardized but you get only a fraction of your scanner's capabilities. TWAIN drivers are a completely different story. If you have no prior experience with scanner drivers and image processing, the time necessary for understanding and using your scanner's driver and scan utility software to its full potential can vary a lot depending on the scanner's producer and even the producer's model. And even after you understood one model you might be lost with another one to the point that you want to through it out of your window.

Once you bought your scanner, you are stuck with its driver(s) and scan utility software – assuming you are not prepared to go beyond your budget with additional third-party software or you are not willing or able to patch your workflow with scripts or manually go through process steps with a number of free or open source software. If you are willing to spend additionally for additional image processing capabilities, more scan profiles, more automation (file naming, distributing files to specific folders, etc.) it gets expensive quickly because you enter a market focused on larger companies that is only slowly moving towards small companies with limited IT resources. Your scanning needs overlap with the needs of many small companies or SOHOs.