LLM-AggreFact is a fact-checking benchmark that aggregates 11 of the most up-to-date publicly available datasets on grounded factuality (i.e., hallucination) evaluation.
Please see our blog post for a more detailed description.
Compare model performance across different datasets.
Model | Size | Average | CNN | XSum | MediaS | MeetB | WiCE | REVEAL | Claim Verify | Fact Check | Expert QA | LFQA | RAG Truth |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bespoke-Minicheck-7B | 7B | 77.4 | 65.5 | 77.8 | 76.0 | 78.3 | 83.0 | 88.0 | 75.3 | 77.7 | 59.2 | 86.7 | 84.0 |
Claude-3.5 Sonnet | - | 77.2 | 67.6 | 75.1 | 73.4 | 84.6 | 77.7 | 89.1 | 71.4 | 77.8 | 60.9 | 85.6 | 86.1 |
Mistral-Large 2 | 123B | 76.5 | 64.8 | 74.7 | 69.6 | 84.2 | 80.3 | 87.7 | 71.8 | 74.5 | 60.8 | 87.0 | 85.9 |
gpt-4o-2024-05-13 | - | 75.9 | 68.1 | 76.8 | 71.4 | 79.8 | 78.5 | 86.5 | 69.0 | 77.5 | 59.6 | 83.6 | 84.3 |
Qwen2.5-72B-Instruct | 72B | 75.6 | 63.6 | 73.0 | 71.9 | 80.4 | 80.2 | 88.9 | 70.0 | 77.0 | 60.1 | 84.3 | 81.9 |
MiniCheck-Flan-T5-L | 0.8B | 75.0 | 69.9 | 74.3 | 73.6 | 77.3 | 72.2 | 86.2 | 74.6 | 74.7 | 59.0 | 85.2 | 78.0 |
Llama-3.1-405B-Instruct | 405B | 74.4 | 64.8 | 75.1 | 68.6 | 81.2 | 71.8 | 86.4 | 67.5 | 79.4 | 58.5 | 81.9 | 82.9 |