Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

AUTHORS: Davide Caffagni, Federico Cocch i, Anna Mambelli, Fabio Tutrone, Marco Zanella, Marcella Cornia, Rita Cucchiara

WORK PACKAGE: WP8

URL: https://link.springer.com/chapter/10.1007/978-3-032-05409-8_4#auth-Federico-Cocchi

Keywords: Large Language Models · Sentence Embeddings · Sentence Similarity Search · Digital Humanities

Abstract
Abstract. Sentence similarity search is a fundamental task in infor-mation retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often strug-gle when training data are scarce, as is the case for low-resource lan-guages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence sim-ilarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to gen-erate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertex-tuality in Latin, demonstrating that synthetic data augmentation signifi-cantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available.