The Gallic(orpor)a pipeline¶
Digitising documents, especially historical ones, remains an important challenge. Philologists, as well as historians and art historians, need digitisation tools in order to offer new ways of “reading” the sources they study. This paper presents a complete pipeline for mass digitisation, starting from digital facsimiles and leading to highly enriched data. In addition to the acquisition task (HTR), we describe a procedure which transforms the text into mineable information, including lemmas, parts of speech, full morphology, named entities and linguistic normalisation. The result is distributed in various formats used by digital humanists (TEI, RDF, IIIF…), following available standards and good practices. This pipeline is primarily designed for French textual sources, whether they be manuscripts or antique books, from the Middle Ages up to the Revolution, but it could also be implemented for texts in any language and of any period. Alongside the technical aspects, we also address key questions concerning the cooperation between research teams and curatorial institutions; the former produce new information while the latter conserve the original document.