Skip to main content

Represent textual versions

Basics

The full text you plan to annotate has its often quite complex history of versions.

  1. At the origin might be a handwritten document; that phase you should represent in CASTEMO as a physical Object (which might be extant until the present day or not, and whose physical properties such as size, number of pages, writing material etc. might be known).
  2. Perhaps a photographic reproduction was made of this object, which should be represented as a Resource; different representations can exist (e.g. black and white vs. in colour; different resolutions; different times and technological means of production).
  3. Then, typically, somebody did a transcription (e.g., critical edition) either directly from the physical handwritten Object, or from a reproduction. This might be a typescript, i.e. a physical Object, or a computer file, i.e. another Resource.
  4. This would then lead to a published edition (generally a Resource, while a specific exemplar is an Object).
  5. In a project planning to do corpus research, you would then reproduce this edition (usually with a scanner). These scanned images are another Resource.
  6. Then you might remove some editorial matter, correct OCR errors, and create a plain-text or XML document, a derivative (usually for private research purposes because of the copyright limitations). This is yet another Resource, different from the original OCR.

For conceptual clarity, it is important that you properly represent those among such textual phases / versions which are relevant to your research separately.

Then, there is a more abstract entity of Territory, typically a written work. This again is distinct from its different representations (physical Objects or Resources). 

In CASTEMO, it is good practice to build proper relations between these entities and not mix them.

Relate Territory, physical medium, and its representations

[To be written. Until then, this placeholder screenshot of a Territory entity, Bernard Gui's Book of Sentences, can explain a lot.]

obrazek.png

Difference between OCR and project-curated digital text

Do not mix the original OCR with a its project-curated derivative. Represent the two with two different Resource entities.

  1. There is a physical and conceptual difference between the original OCR and the project-curated digital text (with OCR errors corrected, editorial matter removed, etc.). Ontologically, they are different Rs, and also physically, they are different files. They should therefore not be represented by the same R.
  2. The full-text imported in InkVisitor should be linked to the R representing the project-curated text, not the original OCR.
  3. The two Rs have the following relation among themselves: R OCR - has - C derived version - R project-curated digital text.
  4. For project-curated digital texts, check whether your project does not already havea template, and if so, use it when creating a new project-curated digital text. It will speed you up and the data will be more standardized.

Some complex relations between Rs and full-text documents

In most situations, we use only one project-curated digital text for one Territory (Text). However, there are some more complex situations. The well-developed CASTEMO data model should be able to handle all of them. The following text outlines such situations and gives guidance on how to solve them ontologically in the database.

The project-curated R is a division of one edition into two different Rs. This happens if the text has e.g. 4 texts in appendices and we separate them and give them specific names, but also have an R to represent the OCR of the original resource with all of them. Then, there will be one R for the OCR of all 4 texts together, but also the texts divided into different full-text files will each have their own R. Then:

  • The R OCR will have those e.g. 4 Rs of project-curated digital texts as its derived versions (R OCR - has - C derived version - R project-curated digital text).
  • They will not have a SOE relation, as the OCR is a different textual version than the project-curated R.

The project-curated R is a join of two editions. E.g., editor B edited an inquisition register apart from a section edited previously by editor A, and thus we join the editions. In such a situation:

  • Cite both editors are cited in the label of the R entity, and in the folder and file name in the repository.
  • A different R is created for edition A and edition B of this specific part of text, and thus are by definition covered in the project-curated R. They will thus have the project-curated R as their SOE.