LLM-AggreFact Leaderboard

LLM-AggreFact is a fact-checking benchmark that aggregates 11 of the most up-to-date publicly available datasets on grounded factuality (i.e., hallucination) evaluation.


Please see our blog post for a more detailed description.

Model Comparison

Compare model performance across different datasets.

7 / 33

selected / available models
Bespoke-Minicheck-7B
Claude-3.5 Sonnet
Mistral-Large 2
gpt-4o-2024-05-13
Qwen2.5-72B-Instruct
MiniCheck-Flan-T5-L
Llama-3.1-405B-Instruct
ModelSize
Average
CNN
XSum
MediaS
MeetB
WiCE
REVEAL
Claim Verify
Fact Check
Expert QA
LFQA
RAG Truth
Bespoke-Minicheck-7B7B77.465.577.876.078.383.088.075.377.759.286.784.0
Claude-3.5 Sonnet-77.267.675.173.484.677.789.171.477.860.985.686.1
Mistral-Large 2123B76.564.874.769.684.280.387.771.874.560.887.085.9
gpt-4o-2024-05-13-75.968.176.871.479.878.586.569.077.559.683.684.3
Qwen2.5-72B-Instruct72B75.663.673.071.980.480.288.970.077.060.184.381.9
MiniCheck-Flan-T5-L0.8B75.069.974.373.677.372.286.274.674.759.085.278.0
Llama-3.1-405B-Instruct405B74.464.875.168.681.271.886.467.579.458.581.982.9