Skip to content

Commit

Permalink
chore: update README
Browse files Browse the repository at this point in the history
  • Loading branch information
Rachneet committed Jan 9, 2025
1 parent e3a29f3 commit 27c34b8
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ We provide the code for the following tasks:


> **Abstract:**
> Despite significant efforts to align large language models with human values and ethical guidelines, these models remain susceptible to sophisticated jailbreak attacks that exploit their reasoning capabilities.
> Traditional safety mechanisms often focus on detecting explicit malicious intent, leaving deeper vulnerabilities unaddressed.
> We propose a jailbreak technique, POATE (Polar Opposite query generation, Adversarial Template construction and Elaboration), which leverages contrastive reasoning to elicit unethical responses.
> POATE generates prompts with semantically opposite intents and combines them with adversarial templates to subtly direct models toward producing harmful outputs.
> We conduct extensive evaluations across six diverse language model families of varying parameter sizes, including LLaMA3, Gemma2, Phi3, and GPT-4, to demonstrate the robustness of the attack, achieving significantly higher attack success rates (44%) compared to existing methods.
> We evaluate our proposed attack against seven safety defenses, revealing their limitations in addressing reasoning-based vulnerabilities. To counteract this, we propose a defense strategy that improves reasoning robustness through chain-of-thought prompting and reverse thinking, mitigating reasoning-driven adversarial exploits.
Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities.
> Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities.
> In this work, we introduce POATE (<ins>P</ins>olar <ins>O</ins>pposite query generation, <ins>A</ins>dversarial <ins>T</ins>emplate construction, and <ins>E</ins>laboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.
> POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.
> We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods.
> To counter this, we propose *Intent-Aware CoT* and *Reverse Thinking CoT*, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
> These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.
---
Contact person: [Rachneet Sachdeva](mailto:[email protected])

Expand Down

0 comments on commit 27c34b8

Please sign in to comment.