LLM-AggreFact is a fact-checking benchmark that aggregates 11 of the most up-to-date publicly available datasets on grounded factuality (i.e., hallucination) evaluation.
Please see our blog post for a more detailed description.
Compare model performance across different datasets.
Model | Size | Average | CNN | XSum | MediaS | MeetB | WiCE | REVEAL | Claim Verify | Fact Check | Expert QA | LFQA | RAG Truth |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bespoke-Minicheck-7B | 7B | 77.4 | 65.5 | 77.8 | 76.0 | 78.3 | 83.0 | 88.0 | 75.3 | 77.7 | 59.2 | 86.7 | 84.0 |
Claude-3.5 Sonnet | - | 77.2 | 67.6 | 75.1 | 73.4 | 84.6 | 77.7 | 89.1 | 71.4 | 77.8 | 60.9 | 85.6 | 86.1 |
Mistral-Large 2 | 123B | 76.5 | 64.8 | 74.7 | 69.6 | 84.2 | 80.3 | 87.7 | 71.8 | 74.5 | 60.8 | 87.0 | 85.9 |
gpt-4o-2024-05-13 | - | 75.9 | 68.1 | 76.8 | 71.4 | 79.8 | 78.5 | 86.5 | 69.0 | 77.5 | 59.6 | 83.6 | 84.3 |
FactCG-DeBERTa-L | 0.4B | 75.6 | 70.1 | 73.9 | 72.3 | 74.3 | 74.2 | 88.4 | 78.5 | 72.1 | 59.1 | 86.7 | 82.3 |
Qwen2.5-72B-Instruct | 72B | 75.6 | 63.6 | 73.0 | 71.9 | 80.4 | 80.2 | 88.9 | 70.0 | 77.0 | 60.1 | 84.3 | 81.9 |
MiniCheck-Flan-T5-L | 0.8B | 75.0 | 69.9 | 74.3 | 73.6 | 77.3 | 72.2 | 86.2 | 74.6 | 74.7 | 59.0 | 85.2 | 78.0 |
Llama-3.3-70B-Instruct | 70B | 74.5 | 68.7 | 74.7 | 69.5 | 78.4 | 76.6 | 85.5 | 67.4 | 78.5 | 58.3 | 79.8 | 82.6 |
Llama-3.1-405B-Instruct | 405B | 74.4 | 64.8 | 75.1 | 68.6 | 81.2 | 71.8 | 86.4 | 67.5 | 79.4 | 58.5 | 81.9 | 82.9 |
QwQ-32B-Preview | 32B | 71.8 | 57.0 | 71.6 | 69.3 | 78.5 | 72.3 | 86.2 | 67.7 | 75.6 | 60.0 | 78.9 | 72.4 |
Tülu-3-70B | 70B | 67.6 | 58.3 | 66.4 | 62.7 | 70.6 | 64.6 | 82.5 | 62.2 | 76.5 | 55.7 | 70.8 | 73.8 |