What is optical character recognition (OCR) and how does BHL use it?

Optical Character Recognition (OCR) is the process of converting images of text into machine readable text characters. This process is performed by special software such as ABBYY FineReader or Tesseract Open Source OCR. It is important to note that OCR text derived from this automated process is uncorrected and of varying quality.

BHL uses OCR to process all the page images in our collection so that the text contained within the images can be indexed and made searchable in support of full text search functionality and the taxonomic name finding algorithm.

BHL’s OCR is generated by its Internet Archive digitization partner using Tesseract Open Source OCR (as of 2020). Items uploaded to BHL prior to 2020 used ABBYY FineReader to generate OCR. To learn more about the quality of Tesseract generated OCR, please see the BHL blog post on OCR Improvements (July 2022).

Tags: search, text mining, data mining, text recognition