Fine-Grained Factual Inconsistency Detection: Evaluating the Capabilities of Large Language Models
Thu 29.05 11:00 - 12:00
- Graduate Student Seminar
-
Bloomfield 526
Abstract:
Factual-consistency evaluation has progressed from response-level scoring toward finer-grained analysis, yet prevailing methods do not pinpoint where a summary goes wrong. Most coarse approaches collapse an entire summary into a single score, while many fine-grained schemes only estimate the proportion of inconsistent content instead of isolating specific errors. Large Language Models (LLMs) are now routinely employed as automatic judges, yet current research evaluates them almost exclusively with coarse metrics, overlooking their potential for end-to-end, high-quality detection of individual errors. We present a new benchmark, specifically tailored for evaluating LLMs on fine-grained factual-inconsistency localization. Each inconsistency is captured by a free-form description that refers to the precise misinformation in the summary. After a model predicts the inconsistencies, a matching protocol aligns its descriptions with gold references, yielding a direct measure of detection accuracy. Using this benchmark, we conduct a comprehensive study of state-of-the-art LLMs across multiple detection strategies. The results reveal both the potential and the current limitations of LLMs in recognizing individual factual errors rather than merely assigning aggregate consistency scores. Our benchmark and evaluation protocol enable an assessment of LLMs’ capacity for reliable judgment.

