6° ITSERR TRAINING IN-HOUSE

STATE OF THE ART OF THE ITSERR PROJECT AND TRAINING ON MANUSCRIPT CATALOGUES WITHIN THE DIGITAL HUMANITIES, AUTOMATED METHODS FOR CATALOGUING, NLP FOR LATIN AND ISSUES ON CODIFICATION OF MATERIAL CULTURE

Two Days of Training and Discussion at the ISTI – CNR in Pisa for the ITSERR Project

The PNRR ITSERR project (Italian Strengthening of the European Research Infrastructure for Religious Studies) is organizing its sixth in-house training meeting, scheduled for April 10–11, 2025, at the “A. Faedo” Institute of Information Science and Technologies (ISTI -CNR) in Pisa.

The event will include two days of discussion on the progress of the project and training sessions focused on the intersection of digital humanities, computational linguistics, and technologies for the enhancement of cultural and religious heritage.

The program alternates between updates on the various Work Packages and training sessions led by scholars from prestigious international institutions. These sessions will provide specialized insights and innovative perspectives on key topics of the project, including the automated cataloguing of manuscripts, Natural Language Processing for Latin, digital methodologies for the analysis and conversion of cataloguing data, and the challenges of codifying material culture and religious practices, also from an anthropological perspective. This training represents a key milestone in the development of the ITSERR project, which aims to strengthen the European research infrastructure for religious studies through the adoption of innovative tools and transnational access to digital resources in support of religious studies and the digital humanities

Programme




REVERINO: REgesta generation VERsus latINsummarizatiOn

AUTHORS: Giovanni Puccetti Laura Righi Ilaria Sabbatini Andrea Esuli

WORK PACKAGE: WP 7 – REVER

URL: REVERINO: REgesta generation VERsus latINsummarizatiOn

Keywords: Regesta, Latin Text Summarization, Large Language Models, Digital Humanities

Abstract
In this work we introduce the REVERINO dataset, a collection of 4533 pairs of Latin regesta with their respective full text medieval pontifical document extracted from two collections, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) and Les Registres de Gregoire IX (1227/41). We describe the pipeline used to extract the text from the images of the printed pages and we make high level analysis of the corpus.
After developing REVERINO we use it as a benchmark to test the ability of Large Language Models (LLMs) to
generate the regestum of a given Latin text. We test 3 LLMs among the best performing ones, GPT-4o, Llama 3.1 70b and Llama 3.1 405b and find that GPT-4o is the best at generating text in Latin. Interestingly, we also find that for Llama models it can be beneficial to first generate a text in English and then translate it in Latin to write better regesta




Automatic Annotation of Legal References (Allegationes) in the Liber Extra’s Ordinary Gloss

AUTHORS: Andrea Esuli Vincenzo Roberto Imperia Giovanni Puccetti

WORK PACKAGE: WP 3 – T-RES

URL: Automatic Annotation of Legal References (Allegationes) in the Liber Extra’s Ordinary Gloss

Keywords: Legal references, Information Extraction, Conditional Random Fields, Dataset

Abstract
The study of normative corpora of the past is a key activity in the fields of Religious Studies and Legal History.
The development of intelligent software tools that support this activity is of paramount importance to support the digital transformation of the community. We present an interdisciplinary activity that lead to an accurate automatic annotation of legal references in the Liber Extra’s Ordinary Gloss. An index of legal references as been derived from the annotations enabling the creation of novel navigation and data analysis tools. The contribution of this work is twofold: the actual index is already by itself valuable resource for the discipline, and we detail the process that lead to its production, showing that an effective result can be delivered by a small team with limited resources. Both the index and the code are made publicly available.




Trattamento delle risorse bibliografiche in alfabeti non latini e discoverability delle collezioni digitali

AUTHORS: Domenico Ciccarello

WORK PACKAGE: WP 5 – Digital Maktaba

URL: Trattamento delle risorse bibliografiche in alfabeti non latini e discoverability delle collezioni digitali: problemi e prospettive

Keywords: Catalogazione multilingue, Caratteri non latini, Servizi bibliotecari multiculturali

Abstract

La catalogazione dei materiali in caratteri non latini, argomento discusso in Italia sin dai primi anni del 2000, sta tornando a far parte del dibattito professionale. In termini biblioteconomici, sembra opportuno esaminare la questione nella prospettiva della “decolonizzazione” degli strumenti di catalogazione, cioè nel pieno rispetto della diversità linguistica e culturale nelle attività di rappresentazione, accesso e visualizzazione dei dati bibliografici. Su un piano più tecnico, l’implementazione dello standard Unicode nei software di gestione delle biblioteche si è rivelata una soluzione valida per la corretta codifica degli script nativi. Tuttavia, alcune questioni critiche sono ancora associate ai limiti che l’Indice del catalogo SBN ha stabilito nel modo in cui le informazioni bibliografiche vengono inserite nel sistema bibliografico nazionale e quindi offerte agli utenti finali. Fino a quando la traslitterazione rimarrà l’unico modo per descrivere nei cataloghi elettronici le risorse in caratteri non latini, gli utenti potranno sfruttare i vantaggi derivanti dalle opzioni di interrogazione nella lingua originale dei testi solo quando effettuano la ricerca nei discovery tool. Questi ultimi, infatti, in aggiunta ai materiali fisici della biblioteca interrogano anche le risorse digitali native (per le quali i metadati negli script originali sono già stati direttamente ricavati dalle piattaforme digitali degli editori). Ciò crea una certa disuguaglianza nella possibilità di recuperare i documenti cercati, e l’esigenza di accedere in modo agevole e completo alle risorse bibliografiche multilingue rimane insoddisfatta.




ITSERR TNA Fellow William Rudman to Hold a Seminar at CNR Pisa

The ITSERR project welcomes William Rudman (Brown University), recipient of an ITSERR TNA grant, who will deliver a seminar at the CNR Research Area in Pisa on April 4, 2025, at 10:30 AM.

Seminar Title:
“Outlier Dimension in LLMs and Multimodal-LLMs: Mechanisms for Task Adaptation and Factual Recall”.

Understanding how artificial intelligence models embed tokens in vector space is essential for interpreting their behavior. This seminar will explore the geometric properties of Large Language Model (LLM) and Multimodal-LLM (MLLM) representations through three studies:

  • IsoScore and Isotropy in LLMs – Introduction of a new metric to measure how variance is distributed in embedding spaces, revealing the dominant role of “outlier dimensions.”
  • Task Adaptation in LLMs – Analysis of how LLMs encode task-specific knowledge, highlighting the key role of outlier dimensions.
  • Outlier Dimensions and Factual Recall in MLLMs – Presentation of the VisualCounterfact dataset, developed to investigate how multimodal models store factual associations by altering the visual properties of objects.

Rudman’s research provides new insights into the inner workings of artificial intelligence models, contributing to a deeper understanding of how they process and retrieve information.

Location: CNR Research Area in Pisa, Aula Faedo
Remote participation available