AUTHORS: Maria Cassese, Alessandro Bondielli, Alessandro Lenci
WORK PACKAGE:
URL: Evaluation of event plausibility recognition in Large (Vision)-Language Models
Keywords:
Abstract
Transformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models (VLMs) provides additional referential information, this paper tests their ability to leverage implicit event knowledge to acquire robust and generalizable representations of agent-patient interactions, assessing their capacity to distinguish between plausible and implausible events. The analysis was conducted on models of varying sizes and architectures.
In the evaluation, the performance of unimodal and multimodal models of various sizes was compared using the task of recognizing the plausibility of minimal sentence pairs. Our analysis suggests several findings: 1) decoder-only models tend to outperform encoder-only ones; 2) the model size has a minor impact: although larger models perform better in absolute terms, the differences between 7B and 13B parameter models are not significant for this particular task; 3) while smaller encoder-only VLMs consistently fall short of their LLM counterpart, larger ones have similar or slightly superior performance; 4) all models have lower performance on the more challenging sentences; 5) adding corresponding images to the textual stimuli affects the accuracy levels of some models. These findings open avenues for further analyses of the inner workings of VLMs and their ability to model event knowledge with and without visual inputs.