Preserving and conserving culture: first steps towards a knowledge extractor and cataloguer for multilingual and multi-alphabetic heritages

AUTHORS: S. Bergamaschi, R. Martoglia, F. Ruozzi, R. A. Vigliermo, L. Sala, M. Vanzini


Managing and sharing cultural heritages also in supranational and multi-literate contexts is a very hot research topic. In this paper we discuss the research we are conducting in the DigitalMaktaba project, presenting the first steps for designing an innovative workflow and tool for the automatic extraction of knowledge from documents written in multiple non-Latin languages (Arabic, Persian and Azerbaijani languages). The tool leverages different OCR, text processing techniques and linguistic corpora in order to provide both a highly accurate extracted text and a rich metadata content, overcoming typical limitations of current state-of-the-art systems; this will enable in the near future the development of an automatic cataloguer which we hope will ultimately help in better preserving and conserving culture in such a demanding scenario.

