Skip to main content

Clean optically recognized texts

General information

In this manual, we count on the availability of edited text of a source, where characters have been optically recognized (OCR-ed) in specialised software.

Format and tools for cleaning

The cleaning of the text for research use depends on the form you receive. We generally recommend editing a .docx (Microsoft Word) output of optical character recognition software (such as ABBYY FineReader) in Microsoft Word, as this editor has powerful search & replace options combining regular expressions (called wildcards in the software) and search by formatting (e.g., italics, superscript etc.).

This does not prevent you from subsequently using AI-powered tools for text correction, just be mindful that if they don't work from the original images, they would be only guessing the most probable word, which might not be the one which actually is in the edition.

General workflow

You generally need to:

  1. Inspect the general OCR quality, and redo the scanning and/or OCR if it is below your needs and expectations. For older or low-quality prints, a manual transcription (e.g., by a commercial company) might still be the best choice.
  2. Remove CAFE - stands for Critical Apparatus, Footnotes and other Editorial matter.
  3. Restore the flow of text broken by page breaks and footnotes / critical apparatus.
  4. Remove footnote marks in the main text (usually numbers or letters in superscript, which your OCR software should preserve.
  5. Remove unwanted hyphens which break words into two.
  6. Remove unwanted brackets used e.g. for restoring text from another manuscript.
  7. Correct OCR errors, especially those which are widespread, such as lowercase l misrecognized as 1 or uppercase I, uppercase I misrecognized as lowercase L, etc. Also e/c is often misrecognized.
  8. Transform useful formatting into XML-like markup, e.g. if the edition marks biblical quotations by italics, this is a feature you don't want to lose in the plain text.
  9. Produce a plain text in Unicode UTF-8 encoding from the docx.

During this process, save versions often to your repository with good descriptions, and definitely after finishing each step.

More info

If you want more information, request access to DISSINET's guidelines for text cleaning.