Benchmark | WANLI | Scitail | FMT | ANLI | WiCE | DADC-NLI | De-Facto | Seahorse | GSUM | Instru-Sum |
---|---|---|---|---|---|---|---|---|---|---|
Avg. Lengths | (17, 10) | (17, 12) | (39, 14) | (54, 10) | (88, 30) | (117, 10) | (310, 17) | (383, 22) | (777, 27) | (971, 97) |
ROUGE-L | 51.3 | 75.2 | 52.4 | 45.3 | 50.0 | 44.1 | 50.9 | 50.9 | 74.3 | 45.7 |
BERTScore | 55.6 | 80.1 | 57.5 | 42.5 | 55.9 | 43.8 | 57.4 | 56.4 | 75.9 | 52.3 |
BARTScore | 57.8 | 87.6 | 60.6 | 45.2 | 63.5 | 49.7 | 58.4 | 58.9 | 68.9 | 57.1 |
QuestEval | 62.4 | 92.7 | 68.4 | 48.0 | 60.0 | 56.8 | 64.5 | 63.1 | 79.4 | 54.0 |
AlignScore-B | 82.6 | 85.7 | 86.0 | 74.1 | 72.4 | 78.8 | 70.6 | 61.7 | 77.4 | 53.1 |
AlignScore-L | 83.9 | 88.1 | 83.6 | 79.3 | 77.9 | 87.0 | 71.4 | 65.1 | 74.4 | 59.6 |
gpt-3.5-turbo | 81.5 | 90.3 | 92.2 | 80.0 | 64.0 | 84.1 | 67.5 | 73.5 | 82.4 | 50.5 |
Llama-3-8B | 75.6 | 87.6 | 92.9 | 68.4 | 68.9 | 82.9 | 59.4 | 69.0 | 79.2 | 58.7 |
MiniCheck-T5-L | 83.2 | 88.9 | 86.9 | 77.4 | 76.8 | 83.6 | 74.9 | 69.9 | 78.4 | 56.2 |
Flan-T5-B (FT) | 87.7 | 98.5 | 91.9 | 75.4 | 86.6 | 86.1 | 77.3 | 72.1 | 81.5 | 57.4 |
Flan-T5-L (FT) | 89.0 | 99.3 | 94.0 | 83.4 | 88.8 | 89.9 | 80.8 | 74.4 | 82.7 | 55.7 |
Llama-3-8B (FT) | 87.2 | 98.2 | 95.8 | 85.4 | 78.7 | 89.8 | 69.1 | 74.3 | 83.9 | 63.9 |
Table 1: AUC-ROC of different metrics on miscellaneous factual consistency evaluation sets
Benchmark | AggreFact (CNN/DM) | BUMP | FiB (CNN/DM) | LLM S. (CNN/DM) | HEval | AggreFact (XSum) | FiB (XSum) | LLM S. (XSum) |
---|---|---|---|---|---|---|---|---|
Avg. Lengths | (498, 55) | (697, 52) | (391, 62) | (458, 69) | (663, 61) | (325, 23) | (231, 20) | (307, 25) |
ROUGE-L | 69.1 | 51.9 | 4.7 | 85.3 | 42.5 | 47.1 | 38.7 | 60.1 |
BERTScore | 71.3 | 54.4 | 12.4 | 88.7 | 53.3 | 55.2 | 54.4 | 68.4 |
BARTScore | 65.7 | 58.3 | 2.1 | 87.2 | 38.9 | 71.8 | 44.2 | 72.7 |
QuestEval | 71.4 | 62.0 | 34.2 | 87.2 | 58.1 | 61.1 | 61.2 | 77.2 |
AlignScore-B | 64.8 | 66.0 | 14.1 | 74.9 | 65.4 | 71.7 | 71.1 | 72.2 |
AlignScore-L | 55.8 | 74.4 | 13.2 | 78.7 | 73.2 | 72.4 | 78.2 | 72.6 |
gpt-3.5-turbo | 66.6 | 80.3 | 65.7 | 86.5 | 68.2 | 78.4 | 76.4 | 76.0 |
Llama-3-8B | 66.0 | 76.9 | 89.9 | 85.3 | 70.6 | 73.3 | 70.8 | 69.8 |
MiniCheck-T5-L | 67.3 | 69.2 | 46.9 | 81.4 | 67.3 | 76.9 | 76.8 | 77.2 |
Flan-T5-B (FT) | 68.5 | 68.6 | 27.2 | 84.3 | 59.8 | 74.1 | 82.1 | 75.7 |
Flan-T5-L (FT) | 69.7 | 76.9 | 54.6 | 85.4 | 67.4 | 74.8 | 87.6 | 75.9 |
Llama-3-8B (FT) | 66.0 | 70.7 | 47.4 | 77.7 | 69.5 | 75.8 | 76.5 | 76.2 |
Table 2: AUC-ROC of different metrics for factual consistency on CNN/DM and XSUM
Benchmark | QMSum | SAMSum | MediaSum | Meetingbank |
---|---|---|---|---|
Avg. Lengths | (309, 23) | (132, 10) | (778, 19) | (779, 20) |
ROUGE-L | 56.2 | 57.2 | 56.4 | 67.3 |
BERTScore | 66.1 | 62.6 | 70.6 | 68.5 |
BARTScore | 66.7 | 63.3 | 68.3 | 71.2 |
QuestEval | 57.0 | 56.9 | 66.4 | 70.6 |
AlignScore-B | 66.3 | 76.4 | 75.3 | 80.9 |
AlignScore-L | 67.4 | 78.2 | 74.1 | |
gpt-3.5-turbo | 69.6 | 82.7 | 73.7 | 82.0 |
Llama-3-8B | 72.7 | 80.0 | 77.6 | 78.9 |
MiniCheck-T5-L | 73.4 | 77.7 | 81.1 | 84.8 |
Flan-T5-B (FT) | 70.0 | 75.7 | 78.1 | 76.6 |
Flan-T5-L (FT) | 62.3 | 75.9 | 75.4 | 75.9 |
Llama-3-8B (FT) | 71.3 | 83.2 | 75.3 | 82.9 |
Table 3: AUC-ROC of different metrics for factual consistency in dialogue summarization
Benchmark | Reveal | FactCheck-GPT | LFQA | ExpertQA | ClaimVerify |
---|---|---|---|---|---|
Avg. Lengths | (2000, 140) | (400, 35) | (550, 26) | (500, 27) | (210, 18) |
ROUGE-L | 63.9 | 72.5 | 67.3 | 78.3 | 71.4 |
BERTScore | 65.4 | 74.0 | 68.9 | 77.9 | 73.5 |
BARTScore | 61.3 | 69.5 | 67.3 | 75.3 | 70.9 |
QuestEval | 62.9 | 72.4 | 66.7 | 76.3 | 72.4 |
AlignScore-B | 59.2 | 70.6 | 66.2 | 74.3 | 68.3 |
AlignScore-L | 57.4 | 68.2 | 64.1 | 72.8 | 67.4 |
gpt-3.5-turbo | 66.7 | 75.2 | 72.5 | 80.3 | 73.4 |
Llama-3-8B | 65.3 | 73.9 | 70.9 | 78.5 | 72.1 |
MiniCheck-T5-L | 64.0 | 72.8 | 69.8 | 79.0 | 74.1 |
Flan-T5-B (FT) | 62.4 | 71.5 | 68.3 | 78.4 | 72.9 |
Flan-T5-L (FT) | 63.9 | 70.2 | 67.1 | 77.5 | 71.8 |
Llama-3-8B (FT) | 64.7 | 74.3 | 70.0 | 79.2 | 73.7 |
Table 4: AUC-ROC of different metrics on long-form factual QA and fact-checking