Θεὸν εκ θεοῦ: a case study for semantic retrieval in Ancient Greek

AUTHORS: E. Scapini, F. Iezzi

WORK PACKAGE: WP4 DamSym

URL:https://formulaiclanguagehistorical.blogspot.com/p/abstracts.html#:~:text=Salemenou%2C%20Maroula:%20Diplomatic%20correspondence%20in,formulaic%20language%20in%20geographical%20books

Abstract

In this joint paper, we present a search tool for stereotype formulations in Ancient Greek that tolerates some variation in language in the face of preservation of meaning. As part of the ITSERR (Italian Strenghtening of Esfri RI Resilence) infrastructure dealing with the research and development of digital tools for the Digital Humanities, particularly Religious Studies, WP4 DaMSym (Data Mining applied to the Nicene-Constantinopolitan Symbol) uses the creed of Nicaea and Constantinople as a case study and examines it in its various languages of ancient translation (Ancient Greek, Latin, Coptic, Arabic, Sanskrit, Church Slavonic). Our research starts from the fact that the expressions God from God, Light from Light, true God from true God are stereotypical formulations describing an x-from-x causality, where the cause reproduces itself (Barnes 2001). Although these stereotypical formulations run through the 4th century in various forms, rule-based tools for verbatim retrieval such as the TLG do not allow us to collect all the possible x’s that go into x-from-x formulations. Indeed, in addition to θεὸν ἐκ θεοῦ, φῶς ἐκ φωτός, θεὸν ἀληθινὸν ἐκ θεοῦ ἀληθινοῦ, in the synodical documents of the 4th century and in the writings of many church authors of this period we also find expressions such as ζωὴν ἐκ ζωῆς, ὅλον ἐξ ὅλου, μόνον ἐκ μόνου, τέλειον ἐκ τελείου, βασιλέα ἐκ βασιλέως, κύριον ἀπὸ κυρίου etc. which cannot be returned by rule-based search tools. To address this deficiency in the state of the art, we have built and will make public a machine learning-based semantic retrieval tool for ancient Greek that reorders the phrases in a corpus based on vector similarity with the query sentences assigned as input. The phrases to be searched within the corpus can be more than one, and they are all embedded in such a way that they are described as points on a multi-dimensional space and can be related to the expressions in the corpus closest to them. We therefore present the first benchmarks of our work by discussing which encoder proves best suited for the purpose, show the sister project for Latin and the intention to combine the two systems into one, suggest the best strategies to exploit this tool to colleagues who might want to make use of it, and list the improvements we plan to make in the future.

Leave a comment