Skip to content

Latest commit

 

History

History
88 lines (79 loc) · 13.1 KB

leaderboard.md

File metadata and controls

88 lines (79 loc) · 13.1 KB

Leaderboard

Models

Model Alias Category Base Model Params Architecture
Llama Guard LG Guardrail Llama 2 7B 6.74 B Decoder-only
Llama Guard 2 LG-2 Guardrail Llama 3 8B 8.03 B Decoder-only
Llama Guard 3 LG-3 Guardrail Llama 3.1 8B 8.03 B Decoder-only
Llama Guard Defensive LG-D Guardrail Llama 2 7B 6.74 B Decoder-only
Llama Guard Permissive LG-P Guardrail Llama 2 7B 6.74 B Decoder-only
MD-Judge MD-J Guardrail Mistral 7B 7.24 B Decoder-only
Toxic Chat T5 TC-T5 Guardrail T5 Large 0.74 B Encod-Decod
ToxiGen HateBERT TG-B Moderation BERT Base Uncased 0.11 B Encoder-only
ToxiGen RoBERTa TG-R Moderation RoBERTa Large 0.36 B Encoder-only
Detoxify Original DT-O Moderation BERT Base Uncased 0.11 B Encoder-only
Detoxify Unbiased DT-U Moderation RoBERTa Base 0.12 B Encoder-only
Detoxify Multilingual DT-M Moderation XLM RoBERTa Base 0.28 B Encoder-only
Mistral-7B-Instruct v0.2 Mis General Purpose Mistral 7B 7.24 B Decoder-only
Mistral with refined policy Mis+ General Purpose Mistral 7B 7.24 B Decoder-only

Results

Prompts

Dataset Metric LG LG-2 LG-3 LG-D LG-P MD-J TC-T5 TG-B TG-R D-O D-U D-M Mis Mis+
AdvBench Behaviors Recall 0.837 0.963 0.981 0.990 0.931 0.987 0.842 0.550 0.117 0.019 0.012 0.012 0.944 0.992
HarmBench Behaviors Recall 0.478 0.812 0.962 0.684 0.569 0.675 0.300 0.341 0.059 0.028 0.016 0.031 0.512 0.622
I-CoNa Recall 0.916 0.798 0.815 0.978 0.966 0.871 0.287 0.882 0.764 0.253 0.483 0.517 0.635 0.910
I-Controversial Recall 0.900 0.625 0.625 0.975 0.900 0.900 0.225 0.550 0.450 0.025 0.125 0.125 0.275 0.875
I-MaliciousInstructions Recall 0.780 0.860 0.850 0.950 0.850 0.950 0.660 0.510 0.240 0.050 0.080 0.070 0.750 0.980
I-Physical-Safety F1 0.147 0.507 0.431 0.526 0.295 0.243 0.076 0.655 0.113 0.179 0.076 0.076 0.226 0.458
MaliciousInstruct Recall 0.820 0.890 0.920 1.000 0.920 0.990 0.730 0.280 0.000 0.000 0.000 0.000 0.980 0.990
MITRE Recall 0.171 0.716 0.308 0.596 0.304 0.172 0.049 0.091 0.000 0.000 0.000 0.000 0.676 0.348
StrongREJECT Instructions Recall 0.831 0.953 0.972 0.986 0.930 0.972 0.399 0.460 0.160 0.023 0.047 0.047 0.803 0.930
TDCRedTeaming Recall 0.800 0.820 0.960 1.000 0.920 0.980 0.600 0.720 0.140 0.040 0.020 0.040 0.720 0.940
CatQA Recall 0.798 0.936 0.933 0.980 0.893 0.944 0.511 0.176 0.018 0.007 0.018 0.016 0.978 0.945
Do Anything Now Questions Recall 0.492 0.592 0.638 0.631 0.526 0.610 0.374 0.103 0.031 0.000 0.003 0.000 0.805 0.574
DoNotAnswer Recall 0.321 0.442 0.422 0.496 0.399 0.501 0.224 0.249 0.100 0.028 0.034 0.048 0.435 0.460
HarmfulQ F1 0.942 0.933 0.913 0.985 0.964 0.972 0.799 0.450 0.104 0.020 0.000 0.020 0.961 0.982
HarmfulQA Questions Recall 0.408 0.548 0.541 0.780 0.522 0.666 0.263 0.111 0.003 0.000 0.000 0.000 0.638 0.683
HEx-PHI Recall 0.724 0.939 0.973 0.952 0.867 0.942 0.506 0.470 0.115 0.021 0.045 0.052 0.900 0.958
XSTest F1 0.819 0.891 0.884 0.783 0.812 0.858 0.632 0.373 0.233 0.186 0.287 0.424 0.829 0.878
AdvBench Strings Recall 0.807 0.784 0.815 0.948 0.882 0.929 0.540 0.869 0.704 0.638 0.596 0.599 0.911 0.949
DecodingTrust Stereotypes Recall 0.875 0.780 0.592 0.993 0.944 0.957 0.211 0.977 0.900 0.589 0.655 0.668 0.568 0.765
DynaHate F1 0.804 0.766 0.752 0.750 0.783 0.788 0.421 0.698 0.645 0.549 0.567 0.590 0.711 0.771
HateCheck F1 0.942 0.945 0.925 0.877 0.909 0.921 0.562 0.853 0.833 0.757 0.761 0.803 0.879 0.909
Hatemoji Check F1 0.862 0.788 0.784 0.873 0.898 0.869 0.376 0.791 0.607 0.669 0.575 0.642 0.777 0.853
SafeText F1 0.143 0.579 0.517 0.504 0.294 0.425 0.085 0.417 0.052 0.154 0.078 0.097 0.482 0.579
ToxiGen F1 0.784 0.673 0.598 0.760 0.795 0.821 0.297 0.793 0.741 0.411 0.393 0.418 0.670 0.787
AART Recall 0.825 0.842 0.851 0.952 0.891 0.879 0.745 0.483 0.122 0.019 0.037 0.054 0.812 0.898
OpenAI Moderation Dataset F1 0.744 0.761 0.790 0.658 0.756 0.774 0.695 0.559 0.644 0.646 0.672 0.688 0.722 0.779
SimpleSafetyTests Recall 0.860 0.920 0.990 1.000 0.940 0.970 0.640 0.620 0.230 0.170 0.280 0.280 0.870 0.980
Toxic Chat F1 0.561 0.422 0.486 0.577 0.678 0.816 0.822 0.339 0.315 0.265 0.279 0.321 0.418 0.671
Wins 1 4 3 11 1 2 1 1 0 0 0 0 1 3

Single-Turn Conversations

Note

UnsafeQA will be released soon!

Dataset Metric LG LG-2 LG-3 LG-D LG-P MD-J TC-T5 TG-B TG-R D-O D-U D-M Mis Mis+
BeaverTails 330k F1 0.686 0.755 0.718 0.778 0.755 0.887 0.448 0.643 0.245 0.173 0.216 0.236 0.696 0.740
UnsafeQA F1 0.668 0.787 0.803 0.792 0.793 0.842 0.559 0.674 0.160 0.046 0.058 0.072 0.758 0.769
Wins 0 0 0 0 0 2 0 0 0 0 0 0 0 0

Multi-Turn Conversations

Dataset Metric LG LG-2 LG-3 LG-D LG-P MD-J TC-T5 TG-B TG-R D-O D-U D-M Mis Mis+
Bot-Adversarial Dialogue F1 0.633 0.552 0.599 0.602 0.622 0.652 0.259 0.557 0.515 0.350 0.406 0.432 0.587 0.615
ConvAbuse F1 0.000 0.348 0.376 0.663 0.676 0.704 0.575 0.427 0.625 0.669 0.674 0.676 0.582 0.728
DICES 350 F1 0.270 0.182 0.114 0.327 0.298 0.342 0.142 0.316 0.200 0.075 0.103 0.124 0.276 0.225
DICES 990 F1 0.417 0.369 0.263 0.453 0.467 0.555 0.255 0.340 0.435 0.433 0.474 0.456 0.433 0.509
HarmfulQA F1 0.171 0.391 0.436 0.764 0.563 0.676 0.204 0.565 0.000 0.000 0.000 0.000 0.648 0.427
ProsocialDialog F1 0.519 0.383 0.528 0.792 0.691 0.720 0.337 0.689 0.471 0.371 0.389 0.411 0.697 0.762
Wins 0 0 0 2 0 3 0 0 0 0 0 0 0 1

Multi-Lingual Prompts

Note

PromptsDE, PromptsFR, PromptsIT, and PromptsES will be released soon!

Dataset Metric LG LG-2 LG-3 LG-D LG-P MD-J TC-T5 TG-B TG-R D-O D-U D-M Mis Mis+
PromptsDE F1 0.714 0.734 0.765 0.824 0.791 0.676 0.244 0.618 0.142 0.194 0.119 0.095 0.700 0.733
PromptsFR F1 0.709 0.745 0.756 0.830 0.798 0.666 0.336 0.228 0.102 0.099 0.079 0.444 0.698 0.726
PromptsIT F1 0.696 0.738 0.746 0.826 0.790 0.654 0.220 0.091 0.148 0.149 0.157 0.437 0.659 0.706
PromptsES F1 0.730 0.767 0.765 0.838 0.812 0.712 0.328 0.051 0.181 0.141 0.177 0.439 0.708 0.754
Wins 0 0 0 4 0 0 0 0 0 0 0 0 0 0