When you scan a document onto your computer, the computer reads it as an image file. To the computer, it’s a meaningless pattern of pixels. Optical Character Recognition (OCR) is the process of turning a picture of a text into a text file itself. In other words, producing something like a TXT or DOC file from a scanned JPG of a printed or handwritten page.
Once a printed page is in the machine-readable text form, you are able to do much more with the document:
- Search through it by keyword
- Edit it with a word processor
- Incorporate it into a web page
- Compress it into a ZIP file
- Send it by email
Most people don’t need to use OCR on an industrial scale. It’s more likely you’ll want to use OCR to convert printed articles into an editable format, or to scan something to be republished as a web page.
In Practice, This is What Every Day OCR Actually Involves:
- Printout: The quality of the original printout makes a huge difference in the accuracy of the OCR process. Dirty marks, folds, coffee stains, ink blots, and any other stray marks will all reduce the likelihood of correct letter and word recognition.
- Scanning: You run the printout through your optical scanner. Sheet-feed scanners are better for OCR than flatbed scanners because you can scan pages one after another. Most modern OCR programs will scan each page, recognize the text on it, and then scan the next page automatically. If you’re using a flatbed scanner, you’ll have to insert the pages one at a time by hand.
- Two-color: Firstly, OCR involves generating a black-and-white (two-color/one-bit) version of the color or grayscale scanned page, similar to what you’d see coming out of a fax machine. OCR is essentially a binary process, it recognizes things that are either there or not. If the original scanned image is perfect, any black it contains will be part of a character that needs to be recognized while any white will be part of the background. Reducing the image to black and white is the first stage in figuring out the text that needs processing. If you have a color scan of a newspaper with a large brown coffee stain over the words, it’s easy to tell the text from the stain. If you reduce the scan to a black-and-white image, the stain will turn to black and white too and may confuse the OCR process.
- OCR: All OCR programs are slightly different. Generally, they process the image of each page by recognizing the text character by character, word by word, and line by line. In the mid-1990s, OCR programs were so slow that you could literally watch them “reading” and processing the text while you waited. Computers are much faster now and OCR is pretty much instantaneous.
- Basic error correction: Some programs give you the opportunity to review and correct each page in turn. They instantly process the entire page. Then, they use a built-in spellchecker to highlight any apparently misspelled words that may indicate a misrecognition. You can automatically correct the mistake.
- Layout analysis: Good OCR programs automatically detect complex page layouts. Examples include multiple columns of text, tables, images, and so on. Images are automatically turned into graphics, tables are turned into tables, and columns are split up correctly.
- Proofreading: Even the best OCR programs aren’t perfect. Especially when they’re working from very old documents or poor quality printed text. Therefore, the final stage in OCR should always be a good, old-fashioned human proofread.
Get Customized Document OCR Scanning Software For Your Business
Our network of scanning service professionals have extensive experience in helping businesses of all sizes migrate to a paperless office or digital filing system. We use proven methods combined with the latest scanning software and equipment. This helps create a very useful document management system that will change the way you do business.
To get started, click the button below, fill out the form, or give us a call at (866) 385-3706, and we will send you FREE no-hassle quotes for your scanning job.