Skip to Main Content

Optical Character Recognition (OCR): An Introduction

What are my OCR software options?

The following recommended tools vary by type (e.g., command-line program, desktop application, mobile application, web-based application, web browser extension) and may or may not support batch processing (i.e., OCRing multiple documents in a single processing job). Most of these resources are freely available; the ones that are not will be marked with a dollar sign symbol ($).

ABBYY FineReader PDF 15 ($)

ABBYY Fine Reader PDF 15 (previously ABBYY FineReader 15) is a state-of-the-art OCR application with the latest AI-based technology and allows the conversion of image documents (e.g., photos, scans, PDF files) into editable electronic formats (e.g., Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Rich Text Format, HTML, PDF/A, searchable PDF, CSV and plain text files) and supports recognition of text in 192 languages, with a built-in spell check for 48 of them.

The Preservation, Conservation & Digitization department at Penn State University Libraries provides on-demand OCR processing service via ABBYY. Please note that requests are processed by the Digitization Team, and output files will be delivered electronically. For questions and requests, please contact digiconversion@psu.edu.

Adobe Acrobat Pro DC ($)

Adobe Acrobat Pro DC works as a text converter, automatically extracting text from any scanned paper document or image file and converting it to editable text in a PDF. Acrobat can recognize text and its formatting. Your new PDF will match your original printout thanks to automatic custom font generation. You can work with converted PDF files in other applications, preserve the exact look and feel of your documents, and restrict editing capabilities by saving them as smart PDFs that include text you can search and copy. 

Penn State provides access to Adobe Acrobat Pro through our Adobe subscription. You can download the software directly to your computer for your own OCR project.

Tesseract

Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page, and you can download this program to your computer from the web.

The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

Free Online OCR

Free Online OCR (newOCR.com) is a free online OCR service, based on Tesseract OCR engine, that can analyze the text in any image file that you upload, and then convert the text from the image into text that you can easily edit on your computer. Free Online OCR allows unlimited uploads and the following input files: image files (JPEG, JFIF, PNG, GIF, BMP, PBM, PGM, PPM, PCX); multi page documents (TIFF, PDF, DjVu); compressed files (Unix compress, bzip2, bzip, gzip), including multiple images in ZIP archive; and DOCX, ODT files with images. Free Online OCR supports 122 recognition languages and fonts, multi-language recognition, mathematical equations recognition, page layout analysis (multi-column text recognition), selection of area on page for OCR, page rotation, poorly scanned and photographed pages, and low-resolution images. 

  • Type: Web application
  • Batch Processing: No
  • Helpful Resource(s): N/A 

tesserocr

tesserocr is a Python wrapper for the Tesseract API. You can download this program to your computer from the web.

  • Type: Python wrapper
  • Tested/Compatible Platform(s): *BSD, Debian, Linux, macOS, Ubuntu, Windows

Neural Network OCR

Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database. You can download this program to your computer from the web.

  • Type: JavaScript scripts
  • Tested/Compatible Platform(s): macOS

doc2text

doc2text was created to help researchers fix common errors in poorly scanned PDFs and extract the highest quality text from their pdfs as possible. It can detect text blocks and OCR poorly scanned PDFs in bulk. You can download this program to your computer from the web.

  • Type: Python module
  • Tested/Compatible Platform(s): Ubuntu

PyOCR

PyOCR is a Python wrapper for Tesseract and Cuneiform, which simplifies the use of these OCR tools. You can download this program to your computer from the web.

  • Type: Python wrapper
  • Tested/Compatible Platform(s): GNU/Linux, macOS (probably)

Online OCR

Online OCR is a free online OCR service for extracting text from scanned PDF and image (JPG, BMP, TIFF, GIF) files no larger than 15 MB, then converting text into editable Word, Excel and Text output formats. In a "Guest mode" (without registration) the service allows you to convert 15 files per hour (and 15 pages into multipage files). Registration will give you ability to convert multipage PDF documents and other features. Online OCR supports 46 languages including Chinese, Japanese and Korean. Converted documents look exactly like the original—tables, columns and graphics.

  • Type: Web application
  • Batch Processing: No
  • Helpful Resource(s): N/A

Copyfish

Copyfish is a free OCR software that allows you to copy, paste and translate text from image, video, and PDF files. The web browser extension (Chrome, FireFox, Microsoft Edge) works with every website, including videos and PDF documents. The desktop capture OCR feature, which you can install in addition to the browser extension, allows you to extract text from opened documents (e.g., text and tables from brochures and leaflets that are only available as graphics), file menus, browser extensions, web pages, presentations, games, and PDF files.

 

This sounds complicated. Are there easier platforms to work with?

Google offers two ways to extract text using OCR:

Drive

Google Drive is a file storage and synchronization service that allows users to extract text from PDF (multipage documents) and image (JPEG, PNG, GIF) files no larger than 2 MB, as well as store files in the cloud, synchronize files across devices, and share files. 

​Google Lens

Google Lens is an image recognition technology that uses visual analysis based on a neural network to extract text from images and bring up  relevant information related to objects it identifies. Users can copy text once it has been recognized. Google Lens can be used as a standalone app or as an integrated feature in the Google Photos, Google Assistant, Google Image Search, and Chrome mobile apps. The mobile apps also enable translation of recognized text using Google Translate.

What about handwriting?

OCR software is typically not very good at handwriting. However, there have been recent advances in this domain! 

Transkribus users can teach the program to read a specific hand by doing some transcriptions and then cleaning up the program's initial attempts. The more work you put in, the more accurate Transkribus becomes; however, this can be laborious, and the process must be repeated with each new hand. 

OneNote (part of the Office365 suite we subscribe to) offers some handwriting recognition but might not be perfect.

Evernote offers the most decent handwriting-to-text recognition engine but excels at modern handwriting and not historical documents.

If you have a lot of handwritten documents that you want to make machine readable, you might want to consider crowdsourcing them as a transcription project.