OLA-HD – An OCR-D Long-Term Archive for Historical Prints - Project details (OLA-HD)
OLA-HD is a cooperative project of the Göttingen State and University Library and the Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, which is assigned as a module to the umbrella project Coordination Project for the Further Development of Optical Character Recognition (OCR) Methods - of the German Research Foundation (DFG).
In order to be able to carry out high-quality and extensive research in the field of historical sciences, unrestricted access to historical sources is essential. Several cataloguing and digitisation projects have made numerous digital copies of historical prints from the 16th to the 19th century available. Particularly in the context of the 'Verzeichnisse Deutscher Drucke', not only serial indexing but also the mass digitisation of titles has been promoted. These works have been catalogued according to national bibliographic standards and most of them have already been digitised. The bibliographic metadata standard of these digital copies already meets the scientific requirements. It is now crucial to be able to search the full texts of the digitised works in a targeted manner and use them further.
The techniques of Optical Character Recognition (OCR) allow the mass production of full texts. However, the methods used so far have not been suitable for direct use in libraries, archives and other institutions, as the spelling differences between the texts are too great. Intensive work is being carried out on easily transferable applications which will enable high-quality mass full text indexing of all historical prints from the above-mentioned period. This rapidly increases the number of OCR texts. For further use, a sustainable archiving and identification of the digital copies, the bibliographic metadata as well as the indexed full texts and their versions is necessary. In order to guarantee this, a standardized concept must be created. In addition, the availability and citation capability of the OCR texts is an important prerequisite for the verifiability of scientific results. This means that the existing archiving of an object with its structure and metadata as well as images must be supplemented by OCR texts.
The intellectual indexing, refinements, improvements in the OCR procedure, or the use of different OCR techniques result in different versions of the same source material, which represent a new challenge for persistent identification and long-term archiving. This problem includes aspects related to research data management and also requires the examination of methods and strategies for handling research data.
The above mentioned requirements were prepared as a Concept about Long-Term Preservation and Persistent Identification of OCR-Objects, and implemented as a OLA-HD Prototype in order to realize the requirements of the data holders as well as those of the users. The OLA-HD Source Code and Documentation are available on Github.
- Berlin-Brandenburg Academy of Sciences and Humanities
- Center for Information and Language Processing
- Friedrich-Alexander-Universität Erlangen-Nürnberg / Department of Computer Science
- German Research Center for Artificial Intelligence
- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
- Gutenberg-Institut für Weltliteratur und schriftorientierte Medien / Abteilung Buchwissenschaft
- Julius-Maximilians-Universität Würzburg / Chair of Computer Science VI - Artificial Intelligence and Applied Computer Science
- Karlsruhe Institute of Technology
- Mannheim University Library
- Staatsbibliothek zu Berlin - Preußischer Kulturbesitz
- University of Leipzig / Department of Computer Science / NLP Group
- University of Leipzig / Institute of Computer Science / Humboldt Chair of Digital Humanities