Here are some things to consider when planning your project:
An OCR project can take a considerable amount of time and effort, depending on the number of documents you need to OCR, whether or not you'll be digitizing the documents, the quality of the documents and image preprocessing needs, the quality of the OCR results, and whether you'll need to edit/correct the initial OCR results.
Consider the best file format for your project, based on your research needs, who you want to have access to your text, and/or how you want to make it accessible to your intended audience. For example, if you want (your audience) to be able to view the digitized document as it appears in its original form as well as search, copy, and paste text, then a searchable PDF file format may be suitable for your project. If you want (your audience) to use the OCR'd text for text mining/analysis, then a plain text file format (.txt) may be best.
If you are collaborating with others on the project, it's also important that everyone is on the same page in terms of requirements and expectations for the final product. Coming up with a data management plan to ensure everyone has access to the materials after they are created is likely good idea, too.
If you have a fairly large OCR project on your hands, drafting a project plan can help you to determine and document many of the considerations above, break the project into smaller segments, explicitly account for all required pieces/resources, and stay on track
Provided you are not breaking any copyright laws, intellectual property policies, or licensing conditions, OCR is legal. This means that you are not making complete copies of something covered under copyright and you are not re-distributing the original work for free publicly. The question of 'sharing' is the biggest issue here: you cannot distribute full-text information but you can, however, produce OCR'd materials for your research. Derived and other kinds of transformed data (such as extracted features or n-gram data) fall under the guidelines of fair use and can be shared. If you have sensitive or otherwise confidential information that may breach the privacy of others in your files you may need to redact some of it before performing analyses, too.
Working on contemporary, in-copyright materials may require negotiating and acquiring permissions to use content from publishers, which can take some time.
Learn more about what is covered by copyright at our Copyright Basics page.
Accuracy in text recognition is hugely dependent on the condition and/or quality of the digital scans or the original documents you are working with.It will also help you to have realistic expectations of the quality of the OCR and plan accordingly, such as allotting time for correcting the initial OCR results, and remembering that 100% accuracy is basically impossible. You can get high accuracy, but even the cleanest, clearest sheet might not get rendered correctly. Depending on typeface, some letters can look similar, producing common errors such as mix-ups between e, a, o, u.
Assess the quality of your documents before attempting to OCR them. This will give a good idea of what you'll need to do to help the OCR software produce the optimal results, such as editing your scanned images in Photoshop before OCRing.
Questions to ask of your documents before starting:
What are the structural elements of your document (e.g., headings, images, tables, captions)? These may complicate the OCR process. You may need to de-skew or crop your image before OCRing. Some OCR software do this automatically or can be configured by the user to do so, so be prepared to some test pages with especially complex pages to see what the output looks like.
Do I have any special features I am trying to capture? OCR software struggles with handwritten text, special fonts, very small fonts (e.g., 6pt), and low contrast text? For these, it might be easier to hand-transcribe than hand-correct the OCR.
You may be able to scan the original document using your OCR program as your scanning software, which should have the best scanning settings for its OCR processing included.
If you're scanning your document, follow these best practices:
The recommended resolution for scanning documents for optimal OCR accuracy is 300 dots per inch (dpi). However, if the text font size is particularly small (less than 10pt), a dpi of 400-600 may be best. If the resolution is too high, loading and processing the image will take more time, without improving the quality of the recognition.
A brightness of 50% is usually best. Brightness settings that are too high or too low can cause defects to the text (overexposure or obscuring, respectively) and reduce the accuracy of your image.
The straighter the initial scan, the better the OCR quality.
Older and discolored documents must be scanned in RGB mode in order to capture all of the image data. Grayscale mode and, especially, black and white mode can cause loss in detail and proper contrast.
If you're photographing your documents, apply the principles above and follow these best practices:
If possible, use a tripod to avoid shaking.
Shoot in color and save your photos in an uncompressed format.
Shoot in ambient light (preferable daylight) that maximizes the contrast between the text and the background, and turn off the flash.
Shoot your document against a white board or piece of paper to set white balance
Use a white balance setting that picks up on areas that the camera thinks are white, then adjusts the color balance.
Avoid using any sharpening or other contrast/clarity-boosting filters to prevent graininess.
Position the lens parallel to the plane of the document and point it toward the center of the text. The distance between the camera and the document should, usually, be 50-60 cm.
Check that your choice of OCR software adequately supports the needs of your project. This includes features like appropriate language(s), functionality, file formats, etc. Some programs may allow you to use additional language packages to supplement its language sets. (See the OCR Software page for additional details on what is available to you.).
It may be worth using specialized training data, models/patterns, and/or dictionaries to increase recognition of text in your particular set of documents. Some OCR software enable you to modify or disable default patterns and dictionaries, if they're not appropriate for your documents, and to create/import your own.
If there are a significant number of errors to correct, you'll want to take note of any patterns in errors so that you can correct them efficiently/consistently and document your process. This proofreading process can be done either in the OCR program you're using or in a text editor, preferably one with spelling and grammar checking.
Automate the checking and correction process by using text preprocessing tools for performing regular spelling correction, unwanted characters and white spaces, etc.
When hand-correcting the text, it's best to save the corrected text as a separate file. Always keep the original output, in case something goes awry or otherwise comes up along the correction process. Some people like versioning software like Github to replicate this process.