Skip to content

Document Layout Analysis

The images of the documents are segmented (cf. fig. 9) and transcribed with Kraken via eScriptorium thanks to the CREMMA infrastructure in Paris and FoNDUE in Geneva. The data are distributed with tools developed by the HTR united project.

Ms btv1b84259980, f. 29
Fig. 1 - BNF Fr. 412 Gallica
Print btv1b86070385, f. 140
Fig. 2 - Caractères Gallica


Regarding segmentation, the SegmOnto controlled vocabulary is based on the assumption that most of the textual sources can be described in the same way if we use a codicological perspective focused on material aspects (the running title in orange, the pagination/foliation in red, headings/rubrics in green, drop capitals in pink, a main textual zone in blue and potentially notes in grey). This succeeds whether the sources are historical prints (cf. fig. 2) or manuscripts (cf. fig. 1). The SegmOnto guidelines establish basic types of zones and lines. This standardisation tackles two important problems regarding the sharing of data:

  • Upstream: research teams need to share annotated documents to improve the results of HTR by increasing the amount of training data;
  • Downstream: Research teams also need to share post-processing means for corpus exploration and automated document production/transformation (TEI, RDF, IIIF...).

SegmOnto uses a three-tier syntax to cope with a maximal number of cases in a controlled manner:

  1. Types are mandatory and offer only controlled values;
  2. Subtypes are optional, with only a suggested open list of possible values;
  3. Numbers are optional. The second column of a page would therefore be annotated as MainZone:column#1.