Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries

AUTHORS: Amina El Ganadi, Luca Gagliardelli and Federico Ruozzi

URL: https://link.springer.com/article/10.1007/s00799-025-00432-w

Keywords: Arabic digital library, Bibliographic metadata, Digital maktaba project, OCR, Cultural heritage, LLMs

Abstract
The rapid digitization of cultural heritage has underscored the critical need for robust digital libraries, particularly for underrepresented languages like Arabic and Persian. This paper describes the methodologies and challenges involved in developing a metadata-driven Arabic digital library, utilizing bibliographic metadata extracted from the Diamond catalogue. It explores advanced metadata schemas, such as Dublin Core, and integrates text recognition technologies and preservation strategies to address key concerns of accessibility, scholarly use, and the long-term preservation of Arabic-script texts. The paper delves into specific challenges of processing Arabic script, including handling calligraphy, diacritics, and ligatures, and introduces innovative solutions like the use of frontispiece images to train OCR systems. Furthermore, it discusses how integrated metadata could not only enhance text recognition but also improve user engagement by enabling refined search functionalities and better resource discovery. Finally, the paper outlines future directions for expanding metadata frameworks to ensure interoperability and the long-term preservation of cultural heritage.

Digital Maktaba Project: Proposing a Metadata-DrivenFramework for Arabic Library Digitization

AUTHORS: Amina El Ganadi, Luca Gagliardelli, Sania Aftar and Federico Ruozzi

URL: https://ceur-ws.org/Vol-3937/short13.pdf

WORK PACKAGE: WP5 – Digital Maktaba

Keywords: Document Analysis, Arabic Digital Library, Bibliographic Metadata, Digitization, Digital Maktaba Project, OCR, Cultural Heritage, Natural Language Processing.

Abstract
The rapid digitization of cultural heritage has underscored the critical need for robust digital libraries, particularly for underrepresented languages like Arabic and Persian. This paper describes the methodologies and challenges involved in developing a metadata-driven Arabic digital library, utilizing bibliographic metadata extracted from the Diamond catalogue. It explores advanced metadata schemas, such as Dublin Core, and integrates text recognition technologies and preservation strategies to address key concerns of accessibility, scholarly use, and the long-term preservation of Arabic-script texts. The paper delves into specific challenges of processing Arabic script, including handling calligraphy, diacritics, and ligatures, and introduces innovative solutions like the use of frontispiece images to train OCR systems. Furthermore, it discusses how integrated metadata could not only enhance text recognition but also improve user engagement by enabling refined search functionalities and better resource discovery. Finally, the paper outlines future directions for expanding metadata frameworks to ensure interoperability and the long-term preservation of cultural heritage.

Generative AI for Islamic Texts: The EMAN Framework for Mitigating GPT Hallucinations

AUTHORS: Amina El Ganadi, Sania Aftar, Luca Gagliardelli and Federico Ruozzi

URL: https://www.scitepress.org/PublicationsDetail.aspx?ID=sZzObYxusMU=&t=1

WORK PACKAGE: WP5 – Digital Maktaba

Keywords: Generative AI Applications, Digital Humanities, Hallucinations, Religious Text Analysis, Bias Mitigation, Context-Aware Constraints, Prompt Engineering, Large Language Models (LLMs), GPT Builder, AI in Islamic Studies, Hadith Studies, Sahih Al-Bukhari.

Abstract
Recent advancements in large language models (LLMs) have facilitated specialized applications in fields such as religious studies. Customized AI models, developed using tools like GPT Builder to source information from authoritative collections such as Sahih al-Bukhari or the Qur’an, were explored as potential solutions to address inquiries related to Islamic teachings. However, initial evaluations highlighted significant limitations, including hallucinations and reference inaccuracies, which undermined their reliability for handling sensitive religious content. To address these limitations, this study proposes EMAN (Embedding Methodology for Authentic Narrations), a novel framework designed to enhance adherence to Sahih al-Bukhari through API-based integration. Three methodologies are examined within this framework: Zero-Shot Instructions, which guide the model without prior examples; Few-Shot Learning, which fine-tunes the model using a limited set of examples; and Embedding-Based Integration, which grounds the model directly in a verified Ahadith database. Results demonstrate that Embedding-Based Integration significantly improves performance by anchoring outputs in a structured knowledge base, reducing hallucination rates, and increasing accuracy. The success of this approach underscores its potential for enhancing LLM performance in precision-critical domains. This research provides a foundation for the ethical and accurate deployment of AI in religious studies, emphasizing accountability and fidelity to source material.