LLM-AggreFact Leaderboard

GitHub, EMNLP 2024

LLM-AggreFact is a fact-checking benchmark that aggregates 11 of the most up-to-date publicly available datasets on grounded factuality (i.e., hallucination) evaluation.

Please see our blog post for a more detailed description.

Model Comparison

Compare model performance across different datasets.

11 / 39

selected / available models

Bespoke-Minicheck-7B

Claude-3.5 Sonnet

Granite Guardian 3.3

Mistral-Large 2

gpt-4o-2024-05-13

FactCG-DeBERTa-L

Qwen2.5-72B-Instruct

MiniCheck-Flan-T5-L

Llama-3.3-70B-Instruct

Llama-3.1-405B-Instruct

QwQ-32B-Preview

Model	Size	Average	CNN	XSum	MediaS	MeetB	WiCE	REVEAL	Claim Verify	Fact Check	Expert QA	LFQA	RAG Truth
Bespoke-Minicheck-7B	7B	77.4	65.5	77.8	76.0	78.3	83.0	88.0	75.3	77.7	59.2	86.7	84.0
Claude-3.5 Sonnet	-	77.2	67.6	75.1	73.4	84.6	77.7	89.1	71.4	77.8	60.9	85.6	86.1
Granite Guardian 3.3	8B	76.5	67.0	74.9	74.0	78.6	76.6	89.6	75.9	76.1	59.6	86.9	82.2
Mistral-Large 2	123B	76.5	64.8	74.7	69.6	84.2	80.3	87.7	71.8	74.5	60.8	87.0	85.9
gpt-4o-2024-05-13	-	75.9	68.1	76.8	71.4	79.8	78.5	86.5	69.0	77.5	59.6	83.6	84.3
FactCG-DeBERTa-L	0.4B	75.6	70.1	73.9	72.3	74.3	74.2	88.4	78.5	72.1	59.1	86.7	82.3
Qwen2.5-72B-Instruct	72B	75.6	63.6	73.0	71.9	80.4	80.2	88.9	70.0	77.0	60.1	84.3	81.9
MiniCheck-Flan-T5-L	0.8B	75.0	69.9	74.3	73.6	77.3	72.2	86.2	74.6	74.7	59.0	85.2	78.0
Llama-3.3-70B-Instruct	70B	74.5	68.7	74.7	69.5	78.4	76.6	85.5	67.4	78.5	58.3	79.8	82.6
Llama-3.1-405B-Instruct	405B	74.4	64.8	75.1	68.6	81.2	71.8	86.4	67.5	79.4	58.5	81.9	82.9
QwQ-32B-Preview	32B	71.8	57.0	71.6	69.3	78.5	72.3	86.2	67.7	75.6	60.0	78.9	72.4

LLM-AggreFact Leaderboard

Benchmark Details

Model Comparison

11 / 39

Team

Citation

LLM-AggreFact Leaderboard

Benchmark Details[expand]

Benchmark Details

Model Comparison

11 / 39

Team[expand]

Team

Citation[expand]

Citation