Skip to content

Latest commit

 

History

History
77 lines (67 loc) · 4.58 KB

dataset-results.md

File metadata and controls

77 lines (67 loc) · 4.58 KB
Benchmark WANLI Scitail FMT ANLI WiCE DADC-NLI De-Facto Seahorse GSUM Instru-Sum
Avg. Lengths (17, 10) (17, 12) (39, 14) (54, 10) (88, 30) (117, 10) (310, 17) (383, 22) (777, 27) (971, 97)
ROUGE-L 51.3 75.2 52.4 45.3 50.0 44.1 50.9 50.9 74.3 45.7
BERTScore 55.6 80.1 57.5 42.5 55.9 43.8 57.4 56.4 75.9 52.3
BARTScore 57.8 87.6 60.6 45.2 63.5 49.7 58.4 58.9 68.9 57.1
QuestEval 62.4 92.7 68.4 48.0 60.0 56.8 64.5 63.1 79.4 54.0
AlignScore-B 82.6 85.7 86.0 74.1 72.4 78.8 70.6 61.7 77.4 53.1
AlignScore-L 83.9 88.1 83.6 79.3 77.9 87.0 71.4 65.1 74.4 59.6
gpt-3.5-turbo 81.5 90.3 92.2 80.0 64.0 84.1 67.5 73.5 82.4 50.5
Llama-3-8B 75.6 87.6 92.9 68.4 68.9 82.9 59.4 69.0 79.2 58.7
MiniCheck-T5-L 83.2 88.9 86.9 77.4 76.8 83.6 74.9 69.9 78.4 56.2
Flan-T5-B (FT) 87.7 98.5 91.9 75.4 86.6 86.1 77.3 72.1 81.5 57.4
Flan-T5-L (FT) 89.0 99.3 94.0 83.4 88.8 89.9 80.8 74.4 82.7 55.7
Llama-3-8B (FT) 87.2 98.2 95.8 85.4 78.7 89.8 69.1 74.3 83.9 63.9

Table 1: AUC-ROC of different metrics on miscellaneous factual consistency evaluation sets


Benchmark AggreFact (CNN/DM) BUMP FiB (CNN/DM) LLM S. (CNN/DM) HEval AggreFact (XSum) FiB (XSum) LLM S. (XSum)
Avg. Lengths (498, 55) (697, 52) (391, 62) (458, 69) (663, 61) (325, 23) (231, 20) (307, 25)
ROUGE-L 69.1 51.9 4.7 85.3 42.5 47.1 38.7 60.1
BERTScore 71.3 54.4 12.4 88.7 53.3 55.2 54.4 68.4
BARTScore 65.7 58.3 2.1 87.2 38.9 71.8 44.2 72.7
QuestEval 71.4 62.0 34.2 87.2 58.1 61.1 61.2 77.2
AlignScore-B 64.8 66.0 14.1 74.9 65.4 71.7 71.1 72.2
AlignScore-L 55.8 74.4 13.2 78.7 73.2 72.4 78.2 72.6
gpt-3.5-turbo 66.6 80.3 65.7 86.5 68.2 78.4 76.4 76.0
Llama-3-8B 66.0 76.9 89.9 85.3 70.6 73.3 70.8 69.8
MiniCheck-T5-L 67.3 69.2 46.9 81.4 67.3 76.9 76.8 77.2
Flan-T5-B (FT) 68.5 68.6 27.2 84.3 59.8 74.1 82.1 75.7
Flan-T5-L (FT) 69.7 76.9 54.6 85.4 67.4 74.8 87.6 75.9
Llama-3-8B (FT) 66.0 70.7 47.4 77.7 69.5 75.8 76.5 76.2

Table 2: AUC-ROC of different metrics for factual consistency on CNN/DM and XSUM


Benchmark QMSum SAMSum MediaSum Meetingbank
Avg. Lengths (309, 23) (132, 10) (778, 19) (779, 20)
ROUGE-L 56.2 57.2 56.4 67.3
BERTScore 66.1 62.6 70.6 68.5
BARTScore 66.7 63.3 68.3 71.2
QuestEval 57.0 56.9 66.4 70.6
AlignScore-B 66.3 76.4 75.3 80.9
AlignScore-L 67.4 78.2 74.1
gpt-3.5-turbo 69.6 82.7 73.7 82.0
Llama-3-8B 72.7 80.0 77.6 78.9
MiniCheck-T5-L 73.4 77.7 81.1 84.8
Flan-T5-B (FT) 70.0 75.7 78.1 76.6
Flan-T5-L (FT) 62.3 75.9 75.4 75.9
Llama-3-8B (FT) 71.3 83.2 75.3 82.9

Table 3: AUC-ROC of different metrics for factual consistency in dialogue summarization


Benchmark Reveal FactCheck-GPT LFQA ExpertQA ClaimVerify
Avg. Lengths (2000, 140) (400, 35) (550, 26) (500, 27) (210, 18)
ROUGE-L 63.9 72.5 67.3 78.3 71.4
BERTScore 65.4 74.0 68.9 77.9 73.5
BARTScore 61.3 69.5 67.3 75.3 70.9
QuestEval 62.9 72.4 66.7 76.3 72.4
AlignScore-B 59.2 70.6 66.2 74.3 68.3
AlignScore-L 57.4 68.2 64.1 72.8 67.4
gpt-3.5-turbo 66.7 75.2 72.5 80.3 73.4
Llama-3-8B 65.3 73.9 70.9 78.5 72.1
MiniCheck-T5-L 64.0 72.8 69.8 79.0 74.1
Flan-T5-B (FT) 62.4 71.5 68.3 78.4 72.9
Flan-T5-L (FT) 63.9 70.2 67.1 77.5 71.8
Llama-3-8B (FT) 64.7 74.3 70.0 79.2 73.7

Table 4: AUC-ROC of different metrics on long-form factual QA and fact-checking