Model
Alias
Category
Base Model
Params
Architecture
Llama Guard
LG
Guardrail
Llama 2 7B
6.74 B
Decoder-only
Llama Guard 2
LG-2
Guardrail
Llama 3 8B
8.03 B
Decoder-only
Llama Guard 3
LG-3
Guardrail
Llama 3.1 8B
8.03 B
Decoder-only
Llama Guard Defensive
LG-D
Guardrail
Llama 2 7B
6.74 B
Decoder-only
Llama Guard Permissive
LG-P
Guardrail
Llama 2 7B
6.74 B
Decoder-only
MD-Judge
MD-J
Guardrail
Mistral 7B
7.24 B
Decoder-only
Toxic Chat T5
TC-T5
Guardrail
T5 Large
0.74 B
Encod-Decod
ToxiGen HateBERT
TG-B
Moderation
BERT Base Uncased
0.11 B
Encoder-only
ToxiGen RoBERTa
TG-R
Moderation
RoBERTa Large
0.36 B
Encoder-only
Detoxify Original
DT-O
Moderation
BERT Base Uncased
0.11 B
Encoder-only
Detoxify Unbiased
DT-U
Moderation
RoBERTa Base
0.12 B
Encoder-only
Detoxify Multilingual
DT-M
Moderation
XLM RoBERTa Base
0.28 B
Encoder-only
Mistral-7B-Instruct v0.2
Mis
General Purpose
Mistral 7B
7.24 B
Decoder-only
Mistral with refined policy
Mis+
General Purpose
Mistral 7B
7.24 B
Decoder-only
Dataset
Metric
LG
LG-2
LG-3
LG-D
LG-P
MD-J
TC-T5
TG-B
TG-R
D-O
D-U
D-M
Mis
Mis+
AdvBench Behaviors
Recall
0.837
0.963
0.981
0.990
0.931
0.987
0.842
0.550
0.117
0.019
0.012
0.012
0.944
0.992
HarmBench Behaviors
Recall
0.478
0.812
0.962
0.684
0.569
0.675
0.300
0.341
0.059
0.028
0.016
0.031
0.512
0.622
I-CoNa
Recall
0.916
0.798
0.815
0.978
0.966
0.871
0.287
0.882
0.764
0.253
0.483
0.517
0.635
0.910
I-Controversial
Recall
0.900
0.625
0.625
0.975
0.900
0.900
0.225
0.550
0.450
0.025
0.125
0.125
0.275
0.875
I-MaliciousInstructions
Recall
0.780
0.860
0.850
0.950
0.850
0.950
0.660
0.510
0.240
0.050
0.080
0.070
0.750
0.980
I-Physical-Safety
F1
0.147
0.507
0.431
0.526
0.295
0.243
0.076
0.655
0.113
0.179
0.076
0.076
0.226
0.458
MaliciousInstruct
Recall
0.820
0.890
0.920
1.000
0.920
0.990
0.730
0.280
0.000
0.000
0.000
0.000
0.980
0.990
MITRE
Recall
0.171
0.716
0.308
0.596
0.304
0.172
0.049
0.091
0.000
0.000
0.000
0.000
0.676
0.348
StrongREJECT Instructions
Recall
0.831
0.953
0.972
0.986
0.930
0.972
0.399
0.460
0.160
0.023
0.047
0.047
0.803
0.930
TDCRedTeaming
Recall
0.800
0.820
0.960
1.000
0.920
0.980
0.600
0.720
0.140
0.040
0.020
0.040
0.720
0.940
CatQA
Recall
0.798
0.936
0.933
0.980
0.893
0.944
0.511
0.176
0.018
0.007
0.018
0.016
0.978
0.945
Do Anything Now Questions
Recall
0.492
0.592
0.638
0.631
0.526
0.610
0.374
0.103
0.031
0.000
0.003
0.000
0.805
0.574
DoNotAnswer
Recall
0.321
0.442
0.422
0.496
0.399
0.501
0.224
0.249
0.100
0.028
0.034
0.048
0.435
0.460
HarmfulQ
F1
0.942
0.933
0.913
0.985
0.964
0.972
0.799
0.450
0.104
0.020
0.000
0.020
0.961
0.982
HarmfulQA Questions
Recall
0.408
0.548
0.541
0.780
0.522
0.666
0.263
0.111
0.003
0.000
0.000
0.000
0.638
0.683
HEx-PHI
Recall
0.724
0.939
0.973
0.952
0.867
0.942
0.506
0.470
0.115
0.021
0.045
0.052
0.900
0.958
XSTest
F1
0.819
0.891
0.884
0.783
0.812
0.858
0.632
0.373
0.233
0.186
0.287
0.424
0.829
0.878
AdvBench Strings
Recall
0.807
0.784
0.815
0.948
0.882
0.929
0.540
0.869
0.704
0.638
0.596
0.599
0.911
0.949
DecodingTrust Stereotypes
Recall
0.875
0.780
0.592
0.993
0.944
0.957
0.211
0.977
0.900
0.589
0.655
0.668
0.568
0.765
DynaHate
F1
0.804
0.766
0.752
0.750
0.783
0.788
0.421
0.698
0.645
0.549
0.567
0.590
0.711
0.771
HateCheck
F1
0.942
0.945
0.925
0.877
0.909
0.921
0.562
0.853
0.833
0.757
0.761
0.803
0.879
0.909
Hatemoji Check
F1
0.862
0.788
0.784
0.873
0.898
0.869
0.376
0.791
0.607
0.669
0.575
0.642
0.777
0.853
SafeText
F1
0.143
0.579
0.517
0.504
0.294
0.425
0.085
0.417
0.052
0.154
0.078
0.097
0.482
0.579
ToxiGen
F1
0.784
0.673
0.598
0.760
0.795
0.821
0.297
0.793
0.741
0.411
0.393
0.418
0.670
0.787
AART
Recall
0.825
0.842
0.851
0.952
0.891
0.879
0.745
0.483
0.122
0.019
0.037
0.054
0.812
0.898
OpenAI Moderation Dataset
F1
0.744
0.761
0.790
0.658
0.756
0.774
0.695
0.559
0.644
0.646
0.672
0.688
0.722
0.779
SimpleSafetyTests
Recall
0.860
0.920
0.990
1.000
0.940
0.970
0.640
0.620
0.230
0.170
0.280
0.280
0.870
0.980
Toxic Chat
F1
0.561
0.422
0.486
0.577
0.678
0.816
0.822
0.339
0.315
0.265
0.279
0.321
0.418
0.671
Wins
1
4
3
11
1
2
1
1
0
0
0
0
1
3
Single-Turn Conversations
Note
UnsafeQA will be released soon!
Dataset
Metric
LG
LG-2
LG-3
LG-D
LG-P
MD-J
TC-T5
TG-B
TG-R
D-O
D-U
D-M
Mis
Mis+
BeaverTails 330k
F1
0.686
0.755
0.718
0.778
0.755
0.887
0.448
0.643
0.245
0.173
0.216
0.236
0.696
0.740
UnsafeQA
F1
0.668
0.787
0.803
0.792
0.793
0.842
0.559
0.674
0.160
0.046
0.058
0.072
0.758
0.769
Wins
0
0
0
0
0
2
0
0
0
0
0
0
0
0
Dataset
Metric
LG
LG-2
LG-3
LG-D
LG-P
MD-J
TC-T5
TG-B
TG-R
D-O
D-U
D-M
Mis
Mis+
Bot-Adversarial Dialogue
F1
0.633
0.552
0.599
0.602
0.622
0.652
0.259
0.557
0.515
0.350
0.406
0.432
0.587
0.615
ConvAbuse
F1
0.000
0.348
0.376
0.663
0.676
0.704
0.575
0.427
0.625
0.669
0.674
0.676
0.582
0.728
DICES 350
F1
0.270
0.182
0.114
0.327
0.298
0.342
0.142
0.316
0.200
0.075
0.103
0.124
0.276
0.225
DICES 990
F1
0.417
0.369
0.263
0.453
0.467
0.555
0.255
0.340
0.435
0.433
0.474
0.456
0.433
0.509
HarmfulQA
F1
0.171
0.391
0.436
0.764
0.563
0.676
0.204
0.565
0.000
0.000
0.000
0.000
0.648
0.427
ProsocialDialog
F1
0.519
0.383
0.528
0.792
0.691
0.720
0.337
0.689
0.471
0.371
0.389
0.411
0.697
0.762
Wins
0
0
0
2
0
3
0
0
0
0
0
0
0
1
Note
PromptsDE, PromptsFR, PromptsIT, and PromptsES will be released soon!
Dataset
Metric
LG
LG-2
LG-3
LG-D
LG-P
MD-J
TC-T5
TG-B
TG-R
D-O
D-U
D-M
Mis
Mis+
PromptsDE
F1
0.714
0.734
0.765
0.824
0.791
0.676
0.244
0.618
0.142
0.194
0.119
0.095
0.700
0.733
PromptsFR
F1
0.709
0.745
0.756
0.830
0.798
0.666
0.336
0.228
0.102
0.099
0.079
0.444
0.698
0.726
PromptsIT
F1
0.696
0.738
0.746
0.826
0.790
0.654
0.220
0.091
0.148
0.149
0.157
0.437
0.659
0.706
PromptsES
F1
0.730
0.767
0.765
0.838
0.812
0.712
0.328
0.051
0.181
0.141
0.177
0.439
0.708
0.754
Wins
0
0
0
4
0
0
0
0
0
0
0
0
0
0