Leaderboard

Models

Model	Alias	Category	Base Model	Params	Architecture
Llama Guard	LG	Guardrail	Llama 2 7B	6.74 B	Decoder-only
Llama Guard 2	LG-2	Guardrail	Llama 3 8B	8.03 B	Decoder-only
Llama Guard 3	LG-3	Guardrail	Llama 3.1 8B	8.03 B	Decoder-only
Llama Guard Defensive	LG-D	Guardrail	Llama 2 7B	6.74 B	Decoder-only
Llama Guard Permissive	LG-P	Guardrail	Llama 2 7B	6.74 B	Decoder-only
MD-Judge	MD-J	Guardrail	Mistral 7B	7.24 B	Decoder-only
Toxic Chat T5	TC-T5	Guardrail	T5 Large	0.74 B	Encod-Decod
ToxiGen HateBERT	TG-B	Moderation	BERT Base Uncased	0.11 B	Encoder-only
ToxiGen RoBERTa	TG-R	Moderation	RoBERTa Large	0.36 B	Encoder-only
Detoxify Original	DT-O	Moderation	BERT Base Uncased	0.11 B	Encoder-only
Detoxify Unbiased	DT-U	Moderation	RoBERTa Base	0.12 B	Encoder-only
Detoxify Multilingual	DT-M	Moderation	XLM RoBERTa Base	0.28 B	Encoder-only
Mistral-7B-Instruct v0.2	Mis	General Purpose	Mistral 7B	7.24 B	Decoder-only
Mistral with refined policy	Mis+	General Purpose	Mistral 7B	7.24 B	Decoder-only

Results

Prompts

Dataset	Metric	LG	LG-2	LG-3	LG-D	LG-P	MD-J	TC-T5	TG-B	TG-R	D-O	D-U	D-M	Mis	Mis+
AdvBench Behaviors	Recall	0.837	0.963	0.981	0.990	0.931	0.987	0.842	0.550	0.117	0.019	0.012	0.012	0.944	0.992
HarmBench Behaviors	Recall	0.478	0.812	0.962	0.684	0.569	0.675	0.300	0.341	0.059	0.028	0.016	0.031	0.512	0.622
I-CoNa	Recall	0.916	0.798	0.815	0.978	0.966	0.871	0.287	0.882	0.764	0.253	0.483	0.517	0.635	0.910
I-Controversial	Recall	0.900	0.625	0.625	0.975	0.900	0.900	0.225	0.550	0.450	0.025	0.125	0.125	0.275	0.875
I-MaliciousInstructions	Recall	0.780	0.860	0.850	0.950	0.850	0.950	0.660	0.510	0.240	0.050	0.080	0.070	0.750	0.980
I-Physical-Safety	F1	0.147	0.507	0.431	0.526	0.295	0.243	0.076	0.655	0.113	0.179	0.076	0.076	0.226	0.458
MaliciousInstruct	Recall	0.820	0.890	0.920	1.000	0.920	0.990	0.730	0.280	0.000	0.000	0.000	0.000	0.980	0.990
MITRE	Recall	0.171	0.716	0.308	0.596	0.304	0.172	0.049	0.091	0.000	0.000	0.000	0.000	0.676	0.348
StrongREJECT Instructions	Recall	0.831	0.953	0.972	0.986	0.930	0.972	0.399	0.460	0.160	0.023	0.047	0.047	0.803	0.930
TDCRedTeaming	Recall	0.800	0.820	0.960	1.000	0.920	0.980	0.600	0.720	0.140	0.040	0.020	0.040	0.720	0.940
CatQA	Recall	0.798	0.936	0.933	0.980	0.893	0.944	0.511	0.176	0.018	0.007	0.018	0.016	0.978	0.945
Do Anything Now Questions	Recall	0.492	0.592	0.638	0.631	0.526	0.610	0.374	0.103	0.031	0.000	0.003	0.000	0.805	0.574
DoNotAnswer	Recall	0.321	0.442	0.422	0.496	0.399	0.501	0.224	0.249	0.100	0.028	0.034	0.048	0.435	0.460
HarmfulQ	F1	0.942	0.933	0.913	0.985	0.964	0.972	0.799	0.450	0.104	0.020	0.000	0.020	0.961	0.982
HarmfulQA Questions	Recall	0.408	0.548	0.541	0.780	0.522	0.666	0.263	0.111	0.003	0.000	0.000	0.000	0.638	0.683
HEx-PHI	Recall	0.724	0.939	0.973	0.952	0.867	0.942	0.506	0.470	0.115	0.021	0.045	0.052	0.900	0.958
XSTest	F1	0.819	0.891	0.884	0.783	0.812	0.858	0.632	0.373	0.233	0.186	0.287	0.424	0.829	0.878
AdvBench Strings	Recall	0.807	0.784	0.815	0.948	0.882	0.929	0.540	0.869	0.704	0.638	0.596	0.599	0.911	0.949
DecodingTrust Stereotypes	Recall	0.875	0.780	0.592	0.993	0.944	0.957	0.211	0.977	0.900	0.589	0.655	0.668	0.568	0.765
DynaHate	F1	0.804	0.766	0.752	0.750	0.783	0.788	0.421	0.698	0.645	0.549	0.567	0.590	0.711	0.771
HateCheck	F1	0.942	0.945	0.925	0.877	0.909	0.921	0.562	0.853	0.833	0.757	0.761	0.803	0.879	0.909
Hatemoji Check	F1	0.862	0.788	0.784	0.873	0.898	0.869	0.376	0.791	0.607	0.669	0.575	0.642	0.777	0.853
SafeText	F1	0.143	0.579	0.517	0.504	0.294	0.425	0.085	0.417	0.052	0.154	0.078	0.097	0.482	0.579
ToxiGen	F1	0.784	0.673	0.598	0.760	0.795	0.821	0.297	0.793	0.741	0.411	0.393	0.418	0.670	0.787
AART	Recall	0.825	0.842	0.851	0.952	0.891	0.879	0.745	0.483	0.122	0.019	0.037	0.054	0.812	0.898
OpenAI Moderation Dataset	F1	0.744	0.761	0.790	0.658	0.756	0.774	0.695	0.559	0.644	0.646	0.672	0.688	0.722	0.779
SimpleSafetyTests	Recall	0.860	0.920	0.990	1.000	0.940	0.970	0.640	0.620	0.230	0.170	0.280	0.280	0.870	0.980
Toxic Chat	F1	0.561	0.422	0.486	0.577	0.678	0.816	0.822	0.339	0.315	0.265	0.279	0.321	0.418	0.671
Wins		1	4	3	11	1	2	1	1	0	0	0	0	1	3

Single-Turn Conversations

Note

UnsafeQA will be released soon!

Dataset	Metric	LG	LG-2	LG-3	LG-D	LG-P	MD-J	TC-T5	TG-B	TG-R	D-O	D-U	D-M	Mis	Mis+
BeaverTails 330k	F1	0.686	0.755	0.718	0.778	0.755	0.887	0.448	0.643	0.245	0.173	0.216	0.236	0.696	0.740
UnsafeQA	F1	0.668	0.787	0.803	0.792	0.793	0.842	0.559	0.674	0.160	0.046	0.058	0.072	0.758	0.769
Wins		0	0	0	0	0	2	0	0	0	0	0	0	0	0

Multi-Turn Conversations

Dataset	Metric	LG	LG-2	LG-3	LG-D	LG-P	MD-J	TC-T5	TG-B	TG-R	D-O	D-U	D-M	Mis	Mis+
Bot-Adversarial Dialogue	F1	0.633	0.552	0.599	0.602	0.622	0.652	0.259	0.557	0.515	0.350	0.406	0.432	0.587	0.615
ConvAbuse	F1	0.000	0.348	0.376	0.663	0.676	0.704	0.575	0.427	0.625	0.669	0.674	0.676	0.582	0.728
DICES 350	F1	0.270	0.182	0.114	0.327	0.298	0.342	0.142	0.316	0.200	0.075	0.103	0.124	0.276	0.225
DICES 990	F1	0.417	0.369	0.263	0.453	0.467	0.555	0.255	0.340	0.435	0.433	0.474	0.456	0.433	0.509
HarmfulQA	F1	0.171	0.391	0.436	0.764	0.563	0.676	0.204	0.565	0.000	0.000	0.000	0.000	0.648	0.427
ProsocialDialog	F1	0.519	0.383	0.528	0.792	0.691	0.720	0.337	0.689	0.471	0.371	0.389	0.411	0.697	0.762
Wins		0	0	0	2	0	3	0	0	0	0	0	0	0	1

Multi-Lingual Prompts

Note

PromptsDE, PromptsFR, PromptsIT, and PromptsES will be released soon!

Dataset	Metric	LG	LG-2	LG-3	LG-D	LG-P	MD-J	TC-T5	TG-B	TG-R	D-O	D-U	D-M	Mis	Mis+
PromptsDE	F1	0.714	0.734	0.765	0.824	0.791	0.676	0.244	0.618	0.142	0.194	0.119	0.095	0.700	0.733
PromptsFR	F1	0.709	0.745	0.756	0.830	0.798	0.666	0.336	0.228	0.102	0.099	0.079	0.444	0.698	0.726
PromptsIT	F1	0.696	0.738	0.746	0.826	0.790	0.654	0.220	0.091	0.148	0.149	0.157	0.437	0.659	0.706
PromptsES	F1	0.730	0.767	0.765	0.838	0.812	0.712	0.328	0.051	0.181	0.141	0.177	0.439	0.708	0.754
Wins		0	0	0	4	0	0	0	0	0	0	0	0	0	0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leaderboard.md

leaderboard.md

Leaderboard

Models

Results

Prompts

Single-Turn Conversations

Multi-Turn Conversations

Multi-Lingual Prompts

Files

leaderboard.md

Latest commit

History

leaderboard.md

File metadata and controls

Leaderboard

Models

Results

Prompts

Single-Turn Conversations

Multi-Turn Conversations

Multi-Lingual Prompts