Challenging the Abilities of Large Language Models in Italian: a Community Initiative

AUTHORS: Maria Cassese, Giovanni Puccetti, Andrea Esuli

WORK PACKAGE:

URL: https://arxiv.org/abs/2512.04759

Keywords:

Abstract

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. “Challenging the Abilities of LAnguage Models in ITAlian” (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource — the most comprehensive and diverse benchmark for Italian to date — and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

Tîrôš nel Talmud babilonese: ‘vino’ o ‘mosto’?

AUTHORS: Andrea Ravasco

WORK PACKAGE: WP 3

URL: https://www.analisilinguisticaeletteraria.eu/index.php/ojs/article/view/716

Keywords: Tirosh, Wine, Must, Babylonian Talmud, Alcoholic Beverages

Abstract

Among the various terms indicating alcoholic beverages in Biblical Hebrew we find tîrôš, whose etymology and meaning are still uncertain: it is normally translated as ‘must’, but sometimes seems to be used as synonym of Biblical Hebrew yayin or Biblical Aramaic ḥamrā’, both indicating ’wine’. This article aims to analyze the attestations of tîrôš within the Babylonian Talmud, in order to understand whether in the Talmud it indicates a drink other than wine or whether it is a synonym of it. Attestations of tîrôš, and Rabbinical exegesis of biblical quotations in which the term appears, clearly testify that in Babylonian Talmud the original meaning of tîrôš has been lost, and it is used as a synonym of the word ‘wine’.

LEAP: Linear Equations for Classifier Accuracy Prediction under Prior Probability Shift

AUTHORS: Lorenzo Volpi, Alejandro Moreo, Fabrizio Sebastiani

URL: https://link.springer.com/article/10.1007/s10994-025-06878-y#auth-Lorenzo-Volpi-Aff1

WORK PACKAGE: WP8

Keywords: Classifier accuracy prediction, Prior probability shift, Label shift, Quantification

Abstract

The standard technique for predicting the accuracy that a classifier will have on unseen data (classifier accuracy prediction—CAP) is cross-validation (CV). However, CV relies on the assumption that the training data and the test data are sampled from the same distribution, an assumption that is often violated in many real-world scenarios. When such violations occur (i.e., in the presence of dataset shift), the estimates returned by CV are unreliable. The contribution of this paper is three-fold. First, we propose a CAP method specifically designed to work under prior probability shift (PPS), an instance of dataset shift in which the training and test distributions are characterized by different class priors. This method estimates the entries of the contingency table of the test data (thus allowing to estimate the value of any specific evaluation measure) by solving a system of independent linear equations, with n the number of classes. Second, we show that the equations that the cells of the contingency table must satisfy are actually more than , which gives rise to an overconstrained problem, and present a family of methods each based on a different selection of such equations. Third, we observe that, since a key step of the above methods involves predicting the class priors of the test data, one can exploit intuitions from the field of class prior estimation (a.k.a. “quantification”). Our experiments show that, when combined with state-of-the-art quantification techniques, under PPS our methods tend to outperform existing CAP methods.

La diversità di giudizio sui Romani in alcune fonti giudaiche

AUTHORS: Andrea Ravasco

WORK PACKAGE: WP 3 – T-ReS

URL: https://biblio.ebaf.edu/bib/555079

Keywords: Romans, Flavius Josephus, Philo of Alexandria, Psalms of Salomon, New
Testament, Dead Sea Scrolls, Babylonian Talmud

Abstract

This article offers an analysis of some Jewish sources from the turn of the first
century that present the Romans. It compares them with the most representative work
of later Judaism, the Babylonian Talmud. This comparison will highlight a diversity of
judgment towards the Romans throughout the first centuries of the common era.

DIACU: A dataset for the DIAchronic analysis of Church Slavonic

AUTHORS: Maria Cassese, Giovanni Puccetti, Marianna Napolitano, Andrea Esuli

WORK PACKAGE: WP 4

URL: https://aclanthology.org/2025.bsnlp-1.12.pdf

Keywords:

Abstract

The Church Slavonic language has evolved over time without being formalized into a precise grammar. Therefore, there is currently no clearly outlined history of this language tracing its evolution. However, in recent years, there has been a greater effort to digitize these resources, partly motivated by increased sensitivity with respect to the need to preserve multilingual knowledge. To exploit them, we propose DIACU (DIAchronic Analysis of Church Slavonic), a comprehensive collection of several existing corpora in Church Slavonic. In this work, we thoroughly describe the collection of this novel dataset and test its effectiveness as a training set for attributing Slavonic texts to specific periods.

Computer Vision for the Reconstruction of Dismembered Coptic Codices

AUTHORS: Lorenzo Bianchi, Fabrizio Falchi, Alejandro Moreo, Fabrizio Sebastiani, Costanza Bianchi

WORK PACKAGE: WP 4

URL:

Keywords:

Abstract In the course of history, many ancient manuscripts
(i.e., bound volumes of manuscripts) written in the Coptic lan
guage have been dismembered, often at the hand of sellers of an
tiques, into individual sheets, who have ended up scattered across
the planet. Reconstructing these manuscripts in their original
form would be extremely important for a better understanding
of the culture of Coptic-speaking communities, and is a long
standing goal of paleographers and egyptologists alike. In this
paper we present ReCoptic, a probabilistic, “contrastive” image
classification system based on computer vision techniques, whose
goal is to aid scholars in reconstructing dismembered ancient
Coptic manuscripts. Given a collection of scans of individual
pages of ancient Coptic manuscripts, the system evaluates, for
each pair of such scans, the (“posterior”) probability that the
two pages originate from the same manuscript, and ranks all
such pairs in descending order of their associated posterior
probability. The scholar can thus discover yet unknown pairs
of pages originating from the same manuscript by examining,
starting from the top of the list, the pairs proposed by ReCoptic.
In experiments that we have run on a collection of 6,000+
pages of Coptic manuscripts, ReCoptic displays extremely high
accuracy.

Ante Vultum. Apports à l’étude de la fonction liturgique de la Sainte Face de Lucques en lien avec l’iconoclasme

AUTHORS: Ilaria Sabbatini

WORK PACKAGE: WP 7

URL: https://iris.unipa.it/handle/10447/695266

Keywords: Holy Blood, Holy Face, iconoclasm, Liturgy, Pilgrimage, Relics

Abstract
The acheropite statue of the Holy Face of Lucca became the object of an intense cult that has come down to us. A list of altars (cod. 124 Biblioteca Capitolare) datable between 1071 and 1109, mentions an altar “ante vultum” and an altar “ante crucem veterem”, interpreted as an older cross (Caleca et al.) whose cult had to be supplanted by that of the Holy Face. From this suggestion, in the early 2000s, a lively dispute arose around the hypothesis that the Holy Face preserved in the Cathedral of Sansepolcro was the ancient crucifix venerated by the people of Lucca who had sold it to the Biturgensi friars replacing it with a copy. The recent Carbon14 analyses, presented in June 2020, have called everything into question. If, according to stylistic analysis, the crucifix dates back to the Ottonian age (eleventh century), the Carbon 14 examination dates it back to the Carolingian age (VIII – IX century). This makes it possible to attest to the originality of the statue, which would therefore not be a copy of a previous copy. Among all the evaluations put in place, however, there remains an important factor still to be considered: the transformation of the liturgical use (Bacci) of which the statue was the protagonist and which can help to untangle the vexata quaestio. The intent of this contribution is to analyze the new dating not only in relation to the stylistic and scientific elements but also in function of the processes that gave the statue a real autonomy with respect to the ordinary liturgy, allowing the development of a specific cult that made the cathedral a real sanctuary, a hub of pilgrimages and power.