Skip to Main Content

Optical Character Recognition (OCR): An Introduction

What is Optical Character Recognition?

Optical Character Recognition (OCR) is the electronic conversion of images of text into digitally-encoded text using specialized software. OCR software enables a computer to convert a scanned document, a digital photo of text, or any another digital image of text into machine-readable,  editable data. OCR typically involves three steps: opening and/or scanning a document in the OCR software, recognizing the document in the OCR software, and then saving the OCR-produced document in a format of your choosing.

OCR processes are particularly used in text and/or data mining projects, textual comparisons, and is an important resource for creating accessible documents for blind and visually-impaired persons.

What do I need to know about OCR?

No special skills are required to use OCR software. However:

  • An OCR software's ability to accurately analyze your document is dependent on the condition of the original and/or quality of the digital file.

  • If you do not have a digital document, or if what you have is poor quality, you are able to scan the original document using your OCR program as your scanning software.

  • Very few digital outputs are 100% accurate, and the system cannot do this check itself. The editing/correcting process may take a considerable amount of time for large amounts of text and/or poor quality original text. 

Some textual considerations to keep in mind:

  • Materials printed and published before 1850 often do not have good results with OCR software. 

  • Handwriting is typically very difficult for OCR to capture.

  • Documents with low contrast can result in poor OCR quality, and (generally) typescript results in poorer OCR than printed type 

  • Inconsistent use of font faces and sizes can lower OCR accuracy. 

  • Very few computational processes are 100% accurate and some loss or error is to be expected! You will need to check and correct the text after it has completed the original recognition process. This may be tedious for large projects.

Best practices for successful OCR:

  • The recommended resolution for best scanning results for OCR accuracy is 300 dots per inch (dpi).
  • Brightness settings that are too high or too low can have negative effects on the accuracy of your image. A brightness of 50% is recommended.

  • The straightness of the initial scan can affect OCR quality. Skewed pages can lead to inaccurate recognition.

  • Older and discolored documents must be scanned in RGB mode in order to capture all of the image data.