Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

AUTHORS: Davide CaffagniFederico CocchiAnna MambelliFabio TutroneMarco Zanella, Marcella Cornia, Rita Cucchiara 

WORK PACKAGE: WP8

URL: https://link.springer.com/chapter/10.1007/978-3-032-05409-8_4#auth-Federico-Cocchi

Keywords: Large Language Models · Sentence Embeddings · Sentence Similarity Search · Digital Humanities

Abstract
Abstract. Sentence similarity search is a fundamental task in infor-mation retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often strug-gle when training data are scarce, as is the case for low-resource lan-guages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence sim-ilarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to gen-erate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertex-tuality in Latin, demonstrating that synthetic data augmentation signifi-cantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available.




La dottrina dell’anima di ʾAbū Sulaymān al-Siǧistānī negli scritti di ʾAbū Ḥayyān al-Tawḥīdī. Le Notti 13 e 35 del Kitāb al-ʾimtāʿ wa-l-muʾānasah

AUTHORS: Sara Abram

WORK PACKAGE: WP8

URL: https://iris.unipa.it/handle/10447/688604

Keywords:

Abstract
This book offers the edition of the Arabic text, the Italian translation, and the line-by-line commentary of the most significant writings for reconstructing the doctrine of the soul of the Muslim philosopher ʾAbū Sulaymān al-Siǧistānī al-Manṭiqī (ca. 912-985). These are the Night 13 and Night 35 of the Kitāb al-ʾimtāʿ wa-l-muʾānasah (The Book of Pleasure and Conviviality), a literary work composed by the man of letters ʾAbū Ḥayyān al-Tawḥīdī (922/932-1023), the best-known disciple of al-Siǧistānī. The edition of the Arabic texts is based on a new collation of the two manuscript witnesses of the book, preserved today in Milan and Istanbul, and are accompanied by their first translation into a Western language. The commentary highlights the specific features of al-Siǧistānī’s thought and its connections with both Greek and Arabic philosophy of the time. It includes translations and references to other works by al-Tawḥīdī dealing with related subjects, especially texts from the Muqābasāt (Borrowings), a collection of notes he compiled during the debates held in al-Siǧistānī’s maǧlis. It also presents numerous translations of passages from philosophical works in Arabic by authors contemporary with or prior to al-Siǧistānī, with the aim of reconstructing the cultural environment where his ideas took shape.