Publication

SARSum: A Relevance and Comprehensiveness-Aware Abstractive Summarization Dataset for Suspicious Activity Reports

Jean V. Alves, Javier Liébana, Hugo Ferreira, Pedro Bizarro

Published at 28th European Conference on Artificial Intelligence (ECAI 2025)

AI Research

Abstract

Existing benchmarks that evaluate the ability of Large Language Models (LLMs) to summarize rely primarily on measuring a summary’s lexical similarity to a reference or on assessing whether its claims are factually consistent with the source document. These approaches fail to account for a summary’s comprehensiveness — the extent to which it captures important information, and relevance — the extent to which unessential elements are omitted. To bolster comprehensiveness and relevance evaluation in high-stakes domains, we propose SARSum, a dataset tailored to evaluate the summarization of notes taken by anti-money laundering (AML) analysts during the process of preparing a Suspicious Activity Report (SAR), a document filed by financial institutions to alert law enforcement about suspicious transactions or activities, where omission of key details can be extremely costly. To the best of our knowledge, SARSum is the first comprehensiveness and relevance-aware summarization dataset: each of the 2,000 sets of notes is accompanied by the key facts that must be retained in an ideal summary, along with 30 different summaries spanning six levels of information selection quality, created by either omitting key facts or introducing irrelevant information. These resources allow practitioners to evaluate not only a summary’s relevance and comprehensiveness, but also the ability of automatic metrics to assess them. These instances are generated using a variety of LLMs to rephrase templates approved by an AML expert, and we empirically verify that the resulting instances are highly abstractive and varied. While SARSum addresses a specific domain, the novel inclusion of key facts and a reference set with known levels of quality represents a crucial step with potential for broader application across high-stakes scenarios. These elements enable the use of techniques such as natural language inference and question-generation/question-answering to evaluate relevance and comprehensiveness.

Materials
PDF Paper Dataset

Page printed in 19 Jan 2026. Plase see https://research.feedzai.com/publication/sarsum-a-relevance-and-comprehensiveness-aware-abstractive-summarization-dataset-for-suspicious-activity-reports for the latest version.