Does BHL use crowdsourced transcriptions?

We have implemented functionality to allow BHL Staff to upload transcriptions in place of the automatically-generated OCR (Optical Character Recognition) text for digitized materials in BHL’s collection. This functionality supports transcriptions generated as part of in-house or crowdsourced transcription projects hosted by BHL Partners. The Show Text tab now indicates whether the text has been:

  1. automatically generated and uncorrected;
  2. automatically generated and error corrected, by machines, which may still include inconsistencies;
  3. or manually transcribed by humans.

Please note that BHL’s OCR is generated by its Internet Archive digitization partner using Tesseract Open Source OCR (as of 2020) or ABBYYFineReader.

Web-based crowdsource transcription projects are largely managed through the following providers, DigiVol, FromThePage, and Smithsonian Transcription Center.

Example of correspondence in BHL for which a crowdsourced transcription has been uploaded in place of the automatically-generated OCR, making the document searchable and easily-readable.

Especially for archival materials, like field notes and correspondence with handwritten text, transcriptions make these items full-text searchable and enable our taxonomic name recognition software to index scientific names within their pages. Since the transcribed text can be viewed alongside the digitized page image, users can also more easily read materials with difficult-to-decipher handwriting. Thus, this new functionality makes it easier for researchers and the public to explore these valuable primary source materials and access specific information from their pages.

Interested in transcribing materials? Several BHL Partners have transcription projects on various crowdsourcing platforms. Follow the links below to explore the opportunities and get involved:

Tags: crowdsourcing, citizen science, transcription, OCR, full text search, archives