Some Notes on Adversarial Attacks on LLMs

Intro

Last week I was catching up with one of my best mates after a long while. He is a well-recognised industry expert who also runs a successful cybersecurity consultancy. Though we had a lot of other things to catch up on, inevitably, our conversation led to AI, LLMs and their (cyber)security implications.

I’ve spent the last couple of months working for early-stage startups building LLM (Large Language Model) apps, as well as hacking on various silly side projects which involved interacting with LLMs in one way or another. But only now I’m starting to realize how naive some of the apps I have helped to build were from the security [and safety] point of view.

The scary campfire stories told by cybersecurity experts have led our industry to pay much more attention to security than we were used to. Many of these stories have been learnt the hard way, leading to better tooling and security practices, which have unarguably improved our industry.

We’re now entering the era of AI/LLMs where the apps are slowly becoming agentic and the pace will only accelerate. In fact, I’d argue most (all?) of the apps will eventually become agentic. To the point that the word “agentic” will vanish from our vocabularies and omnipresent intelligence will become the new normal. I’m sure we will find a new name for it, but I struggle to think “agentic” will be that name. Though who knows, “DevOps” stuck around even though it more or less lost its meaning.

Here’s the pickle though. Just like the internet opened the gates to information it’s also opened the gates to misinformation. By the same token (pun intended) we’re now opening the floodgates to intelligence. But there are symmetries in most things in the world and life, so it’s reasonable to expect rogue intelligence as the counterforce to beneficial intelligence. I don’t know about you but this is starting to make me feel a bit uneasy. More than I care to admit.

Whilst the hackers of yesteryear have been trying to wreak havoc on us by “dumb” automation (e.g. deterministic programs/scripts) trying to exploit the security holes of mostly “static” cyberspace, the new generation of hackers will be tapping into cyber-hyperspace of ubiquitous [cheap] intelligence. Our first “stack hyperflow” is right around the corner, we just haven’t found it yet. The agents will. I promise you that. But I digress.

As I was walking home that night, I couldn’t stop replaying fragments of our conversation about AI and LLMs. I was already slightly familiar with some of the well-publicised exploits but I lacked a deeper understanding of the topic, so my easily-nerd-sniped brain just couldn’t let go.

I would stop every few hundred meters browsing on my phone, taking notes of questions that arose in my head and adding bookmarks to various sources I found on the internet, oblivious to the crowd of mostly inebriated London folk whose way I was getting into.

By the time I got home, I had a basic collection of questions and sources I was hoping to explore to learn about the pesky topics of AI safety and security.

And so I did! I’ve read blog posts, online discussions and a boatload of whitepapers: some more thoroughly others less so. I’d like to think I now have a much better understanding of the AI security and safety space than I had before that fateful conversation last week.

This post is a summary of the very rudimentary research I’ve done over the past couple of days into the topic of AI security and safety. It’s by no means a comprehensive research, neither does it aim to be one. I’m not a security professional nor a researcher. I’m merely a curious engineer who wants to learn more about the topic.

I’ve predominantly focussed on adversarial attacks where an attacker crafts inputs which manipulate LLMs into producing outputs for the attacker’s gain. The gain is a bit of an overloaded term here, but it could be anything from inserting backdoors into the code that’s generated by LLMs to tricking the models into generating bogus answers (e.g. invalid product recommendations, misleading facts, encouraging harmful actions, etc.). There are [likely] better definitions of adversarial attacks, but I’m rolling with this one. YMMV.

I should also mention the OWASP Top 100 for LLMs which I recommend getting yourself familiar with. But as nice as it is it doesn’t dive deep into the specifics of different types of attacks. Nevertheless, the list serves as a handy starting point for a curious mind!

Needless to say, there is an overwhelming amount of information about all kinds of AI adversarial attacks on the internet. This post summarises the ones I found the most interesting, personally, i.e. these are the ones that grabbed my attention enough to skim through either the published whitepapers and/or actual implementation code.

First, let’s introduce some terminology often quoted throughout research papers.

Terminology

Large Language Model(LLM) is a machine learning model which undergoes training on a large amount of data. [Most] of these models are referred to as autoregressive. They typically focus on generating a “next-word/token prediction”.

The LLMs are usually trained in two phases:

  1. pre-training: shovel all the tokens into the machine and let it do its thing
  2. fine-tuning: done after the pre-training; the model parameters are fine-tuned (trained more, though usually, using a different kind of training) usually to prevent harmful/unsafe content generation. This phase is referred to as the alignment phase (we’re trying to align the models with specific goals, usually safety)

There is often an extra step/phase (following the pre-training) in the training where the model is optionally trained to follow instructions a.k.a “instruction tuning”. Not all models undergo instruction-tuning. This phase involves fine-tuning the pre-trained LLMs using structured data in the form that looks like this: (INSTRUCTION, OUTPUT) or (INSTRUCTION, CONTEXT, OUTPUT) tuples.

My hunch is the alignment phase will grow more in importance with time as the large LLM providers seem to be trying to establish some regulatory framework which aims [in theory] to prevent abuse/harm. Equally, because the attacks on LLMs are becoming more sophisticated (more on that later) it may be worth to establish some guidelines, though how it pans out in practice is anyone’s guess.

The security researchers seem to divide the adversarial attacks into two main categories [1]:

  • targeted: an attacker tries to trick the model into generating specific outputs (e.g. private/confidential information, harmful content, etc.)
  • untargeted: an attacker tries to trick the model into generating invalid output (e.g. hallucinations, misinformation, etc.)

Needless to say, the space of the attacks is exponentially larger (some of the models are multi-modal; most maintain conversational context, etc.) in comparison to the “usual” cyber attacks due to both the LLMs search (vector) space being huge (and largely still unexplored) and the fact that the outputs are non-deterministic (generative) in their nature.

In some strange way, this feels kind of like Physics going through classical (deterministic cyberspace) vs. quantum (genAI cyber hyperspace) phase reckoning, though this metaphor may be pushing it a bit too far.

Furthermore, based on the attacker’s access to the underlying model, the attacks are categorized into the following two groups:

  • black-box: the attacker has no direct access to the model; they can only interact with the fully trained model
  • white-box: the attacker has direct access to the model (parameters, architecture, etc.); this type of attack is notoriously hard to defend against (more on that later) as it may lead to a complete takeover of the model where the only remediation might be retraining (as the model parameters may have been contaminated)

There is also apparently a grey-box category where the attacker has partial access to the model i.e. they have the pre-trained model parameters before the model undergoes fine-tuning, but I haven’t found much information about how these are used in attacking scenarios other than their being fine-tuned on harmful content and then used as “judges” or “generators” of malicious inputs that can be used for jailbreaking the original models (more on that later).

Depending on the mechanism the attack is carried out there seem to be three main categories (there may be more, but these are the ones that appear the most in the sources I’ve looked into):

  • jailbreak: aims to bypass LLM safety/alignment leading to harmful (unsafe) content generation, private data extraction (which was part of the training), etc.; these are carried out by creating “jailbreak prompts”, the prompts which exploit weaknesses in LLM (pre-)training
  • prompt injections: aims to craft prompts, such that the LLM generates outputs or triggers actions controlled by the attacker i.e. the model mistakenly treats the inputs as instructions; this is usually done by combining trusted and untrusted prompts together (e.g. SQL injection in text2sql applications, etc.)
  • context contamination: aims to manipulate the LLM context (a document, or some intermediary conversation parts, etc.) – not the prompt itself – in some way that gets the model to produce malicious output i.e. the goal is to increase the likelihood of malicious content generation indirectly. I feel like that’s very similar to prompt injection, but semantics are not the hill I wanna die on

There are more taxonomies out there, but these are the ones I’ve come across most frequently in the whitepapers I’ve read or skimmed through. I’ve left out some categories as they seemed a bit redundant: like I said, semantics is not the hill I wanna die on.

Now that we have defined some basic terminology, we can zoom in on the specific attacks. In particular, this post will focus on the jailbreak attacks. I find them more intriguing than the prompt injection attacks, but I’d like to do a rundown of the prompt injection attacks too in a future post.

Jailbreak Attacks

Jailbreak attacks have their own taxonomy (of course):

  • human-based: usually black-box attacks where the attacker hand-crafts the jailbreak prompts; a lot of the early attacks were accidents and not intended to be malicious
  • obfuscation-based: usually black-box; use some form of obfuscation and/or non-English characters or combination of both (e.g. base64 encoded text combined with prompts)
  • optimisation-based: either black or white-box; they leverage some optimisation algorithm (genetic programming, fuzzing, etc.) which drives the generation of jailbreaking inputs, based on outputs, gradients, etc.
  • parameter-based: usually white-box; they manipulate the model’s encoder/decoder in a way that the generation bypasses alignment

Human-based jailbreaks

Human-based attacks usually use jailbreak prompts that originate in online communities. Reddit seems to be the dominant source of these prompts. Some papers call these attacks “jailbreak prompts In-The-Wild” [2].

THe attackers carrying on the jailbreaks employ different strategies, but the most prevalent one seems to be “role-playing” where the attacker assumes some personality and tries tricking the model into spilling the beans.

Another popular strategy is “privilege escalation” where the prompt makes the LLM “believe” it has superpowers and so it can ignore the safety constraints. It’s always amusing how similar these attacks are to the ones we encounter in the non-ML space. Attacks transfer between domains, only the semantics differ.

Most of these attacks are no longer as effective as they used to be, especially with large LLM providers. They require a lot of human creativity and are brittle in practice. But they often serve as the base for optimisation-based attacks (more on that later).

DAN (Do Anything Now)

One of the most well known human-based jailbreak is DAN (Do Anything Now). I haven’t found the single origin of it but a lot of the blog posts and whitepapers reference it [2]. As many other human-based jailbreaks it seems to have originated in one of the Reddit LLM communities.

The basic idea of DAN is to create an alter ego for the LLM model which is free of the constraints applied during the alignment training: it takes advantage of the instruction following.

DAN: Do Anything Now

DAN: Do Anything Now [2]


I remember coming across this attack on X sometime early last year, inspiring a lot of internet memes. There have been multiple iterations of DAN. It has also inspired a few algorithmic jailbreaks which I discuss later in this post.

AIM (Always Intelligent and Machiavellian)

I find this attack a bit funny. Mostly because its jailbreak prompt instructs the model to act as a Machiavellian character, often literally adopting Nicolo Machiavelli personality, ignoring its safety rails [3].


AIM: Always Intelligent and Machiavellian

AIM: Always Intelligent and Machiavellian [3]


The LLMs are trained on large swathes of data so, trying to trick them into acting as medieval philosophers exhibits the type of emergent properties only the LLMs can surprise us with is something we should always keep in mind. Not necessarily the historical concept of this attack but the emergent properties as such. They’re arguably the most fascinating properties of LLMs, at least to me personally.

Prefix injection attacks

Prefix injection attacks are based on a legitimately looking prompt designed in such a way that it exploits the LLMs being trained to predict the next word/token. These prompts are designed so that the model’s next token prediction is conditioned to avoid refusal. It turns out, that merely making the LLM “enthusiastic” about taking some action breaks the golden duck.

Here are some examples I’ve discovered while doing this research [4]:


Prefix injection
Prefix injection

Prefix injection attacks [3]


There is also another type of attack referred to as “refusal suppression”, in case you haven’t had enough of new terms, wherein the model is prompted with instructions so it doesn’t refuse to respond to queries that lead to unsafe outputs [3].


Refusal Suppresion

Refusal Suppression [3]


Obfuscation-based jailbreaks

This is my favourite type of jailbreak because it exhibits the unexpected learning abilities of LLMs i.e. the emergent properties I discussed earlier. I’m genuinely curious about who came up with the ideas for these because it would never have occurred to me in my wildest dreams to even think of them.

A lot of the research highlights the fact that there is an unexplored gap between the pre-training and safety training/alignment which I’m not sure we’ll ever be able to fix effectively.

The vector space we need to cover during the alignment phase is quite huge due to the amount of data used for pre-training. Accidents are bound to happen. At the worst possible time, of course.


Training vs Alignment gap

Training vs Alignment gap [1]


Peter Norvig once said that “artificial intelligence is computer science with clever tricks”. We will have to find some extremely clever tricks to make sure the LLM usage doesn’t lead to harmful exploits.

Base64

The jailbreak prompt is obfuscated through base64 encoding to bypass the safety mechanisms of LLMs. What was interesting to learn about this specific jailbreak was that GPT-3.5 was immune to it but GPT-4 caved in. This suggests that the training data for GPT-4 must’ve contained some instructions about how to decode base64 encoded data.


Base64 attack

Base64 attack [4]


Morse Code

Another fun example of jailbreak is using Morse Code (MC). This is similar to the base64 example. The model either learnt how to translate MC to English or it was instructed to learn it during pre-training. Either way, here we are…


Morse code attack
Morse Code attack

Morse code attack


Here’s the original source:

Zulu

LLMs are increasingly being trained on a multitude of non-English datasets, including “low-resource” languages.

A quick note on the “low-resource” languages: the languages which lack useful training attributes such as supervised data, number of native speakers or experts, etc.

The adversaries utilise the Zulu language, through the translation of English jailbreak prompts:

GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks [6]


Zulu attack

Zulu attack [6]


The authors of the paper the quote and the screenshot above belong to [6] go on to state that the other - higher-resource - languages they explored showed a smaller Attack Success Ratio (ASR) which suggests that the LLMs might be vulnerable to low-resource language attacks more.

It would be nice to have some sort of ML-BOM for the trained models which would list the sources of the training data, languages, etc. One can dream!

Using hallucination to bypass RLHF filters

The paper with the same name as the title of this paragraph [7] is quite short and well worth a read. The author of the paper induces LLM to hallucinate using reversed text causing the model to ignore its alignment and safety filters.

I do not know how the author came up with the idea for this jailbreak but I find this exploit to be very original. LLM models are famous for hallucinations, so naturally, they were doomed to have this property exploited sooner or later.

By using a reversal-of-text trick (and hence hiding our inappropriate prompt so that the LLM cannot initially read it), we are able to get the LLMs to start their response with an inappropriate phrase, and from there they complete it.

The authors argue that RHLF as a safety measure is extremely shallow i.e. not as effective as it might seem. LLMs are auto-regressive models, trained for next-word (token) prediction.

RLHF fine-tuning conditions the model so that the next prediction is done in a specific “setting”, one with controlled style and “personality”. Once the model is tricked into general word prediction in a different setting from the one it was fine-tuned for, all its constraints are ignored.

In this attack, they first trick the model into hallucinating. Then they take control of the hallucinations.

LLMs are known to be able to flip the text, even complete gibberish due to their lack of understanding of the content, so they feed it some, followed by regular English text:


Using hallucination to bypass RLHF filters

Using hallucination to bypass RLHF filters [7]


By taking advantage of the aforementioned lack of understanding that the provided prompt contains gibberish, the authors trick GPT-4 into hallucinations (see the prompt about the “7th paragraph” - there is such paragraph!):


Using hallucination to bypass RLHF filters

Using hallucination to bypass RLHF filters [7]


In order to take control over the hallucination to trigger harmful generation they take some sentence that is likely to be a part of inappropriate content. Then they flip it around to make it harder for GPT4 to flag it. Finally, they capitalize it, to distinguish it from the rest of our garbled text and smack it into the gibberish text with the prompt that induces hallucination about the “7th paragraph”:


Using hallucination to bypass RLHF filters

Using hallucination to bypass RLHF filters [7]


And that’s it! That triggers the jailbreak! The authors showed proof of it working against both ChatGPT as well as Claude, though they do notice that the prompt needs some tweaking every now and then:

Occasionally, we noticed GPT4 refusing our prompt, even after we started a brand new chat conversation; This was especially common after having already completed a given version of the exploit once, hinting at OpenAI keeping track of information at least somewhat between conversations. Regardless, with minor updates the exploit is still working

There are likely to be many other input-output formats that are not explored during safety training, so the model never learns to say no to them.

There are also many examples of attacks that combine these techniques which makes detecting them more complicated still. It’s not hard to imagine a browser plugin that reads your emails which might have some obfuscated content in them triggering a jailbreak and causing all kinds of havoc: an agent receiving an LLM response to a jailbroken model unknowingly - since it has no notion of semantics of the returned content - automatically sends an email to your competitor, etc. Lovely.

Security has always been a moving target, though, so we’ll have to keep adapting and exploring.

Optimisation-based jailbreaks

There are quite a few optimisation-based attacks out there and still growing. They’re also the most sophisticated and hardest to defend against. Especially the white-box attacks that require the attacker to have full access to the model [parameters].

I’d argue that optimisation-based attacks will be the most prevalent type of attack in the future. Especially the more we deploy LLMs in arbitrarily multi-agent setups where the agents interact with multiple models at the same time.

Greedy Coordinate Gradient (GCG)

Things start getting way more interesting here. GCG belongs to a class of attacks sometimes called “token-level jailbreak”.

The main idea is to find a suffix, any token suffix (it does not have to be a “meaningful” token like a word, etc.), which increases the probability of the model producing an affirmative response (e.g. “Sure!”, “Absolutely!”).

Instead of manually crafting these suffixes, which would be laborious and brittle, GCG relies on a combination of greedy and gradient-based search techniques. The requirement for optimising the suffix generation based on the gradient means this is a white-box attack which requires access to the model.

This makes things a bit complicated with commercial offerings being closed-source. Fortunately, the authors found out that the suffixes that induce the model to generate malicious output are transferable between models!

I’m not sure what to make of the transferability of these attacks. I suppose all models are more or less trained on the same data? Or does it simply mean that whatever the data is they encode some inherent knowledge semantics which are transferrable? If so, bad news: no model is safe!

Anyway, the authors of this paper [8] start with a malicious prompt (text in blue) which they embed into a larger prompt in combination with a system prompt:


Greedy Coordinate Gradient

Greedy Coordinate Gradient [8]


Then they introduce an additional “adversarial suffix” (text in red) – which will be optimised – to the prompt intended to bypass the safety constraints in the model:


Greedy Coordinate Gradient

Greedy Coordinate Gradient [8]


Finally, they append a positive affirmation of the user query:


Greedy Coordinate Gradient

Greedy Coordinate Gradient [8]


During the optimisation phase, they try to minimise the model loss for generating malicious output whilst modifying the jailbreak suffix, by picking the top-k (k is a parameter) candidate tokens for suffix replacement/update.

The mildly annoying thing about this attack is the resulting prompts don’t necessarily have to carry meaningful text: somewhat surprisingly random token soup often comes out of the optimisation and LLMs cave in. Look at the following jailbreak prompt examples mentioned in the paper:


Greedy Coordinate Gradient

Greedy Coordinate Gradient [8]


You miss the XSS injection now, don’t you? But there is a silver lining to this token goulash: these prompts are susceptible to detection through basic perplexity (text permutation) testing. So it’s “lose some, win some situation”.

AutoDAN (Auto Do Anything Now)

This is a variation of the earlier mentioned human-based DAN attack. AutoDAN [9] uses handcrafted jailbreak prompts, such as the DAN, as the starting point for its prompt optimisation algorithm.

The optimisation leverages genetic programming where it does the crossover selection over the inputs which produce the most harmful outputs. Outputs seem to be classified by humans as harmful - I’m not sure if I wanted to be doing that job, but we all gotta learn our dough somewhere, I suppose.

Here’s the algorithm as published on the GitHub repository that also contains the implementation of this attack:


AutoDAN

AutoDAN [9]


Unlike GCG, which produces nonsensical token sequences, AutoDAN can produce meaningful prompts, making the attacks harder to detect, especially given the inputs rely on their semantic perplexity.

PAIR (Prompt Automatic Iterative Refinement)

PAIR [10] assumes the black-box access to the target LLM and unlike the “token-level” attacks (such as the GCG) it generates jailbreaks readable by humans, just like AutoDAN.

It uses a dedicated LLM as an attacker which queries the target LLM, iteratively improving the attack prompts. PAIR therefore doesnt require any human in the loop.

To enhance the attack prompts PAIR employs a “JUDGE function” which scores the candidate jailbreak prompts on a scale from 1-10. JUDGE is usually another LLM, optionally fine-tuned as a binary classifier for a specific attack.

Selecting JUDGE is a bit amusing because the authors considered GPT-4….for jailbreaking…well, GPT-4. They eventually settled on Llama Guard as it has a lower FPR (false positive response) while “offering competitive agreement. Besides, Llama Guard is OSS which makes it a cheaper and more accessible option.


PAIR

PAIR [10]


Unlike GCG which often requires thousands of iterations and produces an amalgamation of tokens, PAIR usually needs 20 iterations to generate an effective jailbreak prompt. Furthermore, PAIR is highly parallelisable and can run on CPUs making it once again more accessible and cheap.

The jailbreaks are claimed to be transferable to other LLMs. Once again suggesting that all our LLMs are trained on pretty much the same data. No panic, all is good. Let’s go AGI!

The authors state that the efficacy of the PAIR attack depends on the design of the attacker’s LLM system prompt:

To this end, we carefully design three distinct system prompt templates, all of which instruct the LLM to output a specific kind of objectionable content. Each system prompt template is based on one of three criteria: logical appeal, authority endorsement, and role-playing.


PAIR

PAIR [10]


The attacker is allowed to use full conversation history to refine the attack, while the target responds without context or conversation history.

The good news is PAIR is less effective against strongly fine-tuned models like LLama2 or Claude (1,2), which might require some manual involvement.

The bad news is, that it’s hard to interpret the results given the probabilistic nature of the task: we deal with 2 LLMs (attacker and target) generating texts. Interpreting semantics is not a fun task.

TAP (Tree of Attack with Pruning)

TAP [11] is the attack that triggers a mild panic in my head if I’m entirely honest. Maybe that’s just a professional deformation induced by working with LLMs and reading about their vulnerabilities, especially in light of their new advanced capabilities (hello o1 model, if you ever get trained on this blog post, have mercy!).

TAP utilizes three LLMs:

  • Attacker: just like PAIR [10], it iteratively generates jailbreak prompt candidates, but unlike PAIR it leverages the tree-of-thought (with maximum specified depth) reasoning; PAIR is actually a specific type of TAP attack (tree of depth 1)
  • Evaluator: It serves two purposes: first, it assesses whether a jailbreak is found (i.e., similar to the PAIR’s JUDGE function) and, second, it assesses whether a prompt generated by attacker is off-topic: if it is, it prunes it
  • Target: the LLM which we are trying to jailbreak

The attack algorithm is rather straightforward:

  1. (Branch) The attacker generates improved prompts.
  2. (Prune: Phase 1) The evaluator prunes off-topic prompts
  3. (Attack and Assess) Query the target with each remaining prompt and use the evaluator again to score its responses. If a successful jailbreak is found, we return its corresponding prompt.
  4. (Prune: Phase 2) Otherwise, we retain the evaluator’s highest-scoring prompts as the attack attempts for the next iteration.

TAP

TAP [11]


Needless to say, TAP’s success depends on choosing the Evaluator model. What I was curious about is how the model decides if a candidate prompt is off-topic. The authors provide the system prompts for both the JUDGE (rating the prompt quality on a scale from 1-10) and the off-topic classification:

Finally, the authors provide three takeaways:

  1. Small unaligned LLMs can be used to jailbreak large LLMs (lovely)
  2. Jailbreak has a low cost (requires black-box access and can run on CPU) (even lovelier)
  3. More capable LLMs are easier to break (ouch, that vast unexplored hyperspace)

TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. Interestingly, TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard, which PAIR struggles with! Call your vendor trying to sell you a wrapper above LlamaGuard, now!

I think TAP shows the direction where the LLM attacks are headed. It reminded me of another paper I read some time ago which actually discussed some novel approaches in agent planning (see [12]). It’s not hard to imagine how sophisticated these attacks can (will!) get in multi-agent environments.

Honorary mentions

There are some other research papers I’ve come across that I found interesting but are not quoted as much in the recent research, likely due to their not being as effective anymore, and/or because the optimisation-based methods require less effort and don’t require access to the models so they’re easier to analyse and improve.

Most of these are also called NLP attacks, originating in the golden era of NLP (I add the year of the papers being published to each paragraph)

HotFlip (2017)

HotFlip [13] attack is aimed at character-level neural classifiers, but with some modifications, it can also be adapted for attacking world-level classifiers. HotFlip made me wonder about how robust the model training is to [arbitrary] perturbations of the training data.

The goal of the HotFlip attack is to generate adversarial examples by flipping characters in the training data such that the accuracy of the model decreases. Equally, we can leverage it to flip things around (pun intended): we can use it to train the model such that it’s actually more robust to adversarial attacks.


HotFlip attack

HotFlip [13]


iHotFlip is a white-box attack, which requires access to the model to perform the training with a redefined loss function which optimises misclassification. It uses a beam search to find a set of the most influential characters or tokens that confuse the classifier.

TextFooler (2019)

TextFooler’s [14] goal, similar to HotFlip’s, is to generate adversarial text. Unlike HotFlip, TextFooler preserves the original meaning (i.e. it doesnt garble the text like HotFlip does when it flips tokens!) and instead focuses on finding synonyms for words it wants to change so it maintains the semantics of the original text while altering the model prediction in favour of the attacker.

In other words, keep changing the words such that the model “doesnt notice” it, but barfs rubbish:


TextFooler attack

TextFooler [14]


So how do they know what words are actual ly worth using to fool the model? In the white-box model scenario, identifying these words is trivial: just follow the gradient!

It turns out the attack is possible in the black-box scenario, too, but the algorithm differs a bit given it’s based on observations of the change of prediction rather than having access to model gradients.

TL;DR: of the black-box attack is: they define a score for each word (by observing prediction changes when the word is removed from text), find the most semantically similar synonyms (based on text embeddings) and once again observe the misclassifications. Then they iterate until the misclassifications improve.

There are many other “old school” NLP adversarial attacks. One thing we have to appreciate from those “good old, simple, days” are the simple names. Nowadays we just use the silliest acronyms: I know we should blame GPT for it but whatever.

RatGPT (Remote Access Trojan GPT)

I saved the best – the most interesting – for the end: RatGPT [15]. This is one of the most amusing papers I’ve read. It’s well worth a read: it’s quite short and pretty straightforward without any unnecessary fluff you usually encounter in whitepapers.

The authors of the paper are trying to turn ChatGPT into a malware proxy. They leverage ChatGPT plugins but it’s not hard to imagine making this feasible by using any LLM as long as it’s granted access to the internet.

They establish remote control between the victim and the remote server controlled by the attacker through ChatGPT. Since access to LLMs is generally not banned by IDS the malicious activity can go undetected. The remote server sends back exploits which are then executed by the victim’s machine. The victim might not notice it because the traffic seems innocuous: you’re just talking to an LLM, what can go wrong?!

Turns out, quite a lot!


RatGPT attack

RatGPT [15]


Here’s how they go about it in a nutshell:

  1. They jailbreak the LLM using a modified version of DAN so the LLM generates exploits (Python code) surrounded by "”” i.e. three double quotes (effectively creating code comments of exploits) which is necessary to avoid the generated code being evaluated and potentially banned
  2. To avoid hardcoding the IP address of the attacker’s server they get the LLM to generate it. The individual parts of the IP address are generated by prompting the LLM to spit out numbers (usually as answers to questions to historical facts!!! This is some serious human ingenuity on display, right here)
  3. The generated exploits are then executed on the victim’s machine. The exploit periodically sends lookup requests to LLM in the form of prompts (“What is the news on http://attacker_address”?). The plugin browses the site and reads the content which contains a malicious command which is then sent to the victim’s machine as an LLM response which then decodes it and happily executes it.

In other words, the attack gets triggered by a victim executing a “binary” which contains code that sends the LLM prompts and can interpret the responses sent back to it by the LLM, given the LLM has access to the internet; ithe inary receives prompts which contain instructions on how to talk to the attacker’s server which responds with malicious commands sent back to the victim’s machine

Conclusion

A lot of the papers mentioned in this post discuss the possible causes of these attacks. I’ve hypothesised some of them in this post (the size of the unexplored vector space in both pre-training and the lack of its coverage in the fine-tuning phase)

In general, most of the research agrees that the tension between LLM’s “competing objectives” of being helpful and not harmful is the prevalent cause of jailbreaks.

The LLMs are trained to be helpful which is often at odds with their not being harmful at the same time. When you ask an LLM to generate potentially sensitive content the model needs to make a decision about whether the generated content could be harmful while still maintaining being helpful to the end user.

Another issue is the insufficient fine-tuning that fails to cover the model’s vector space. Finally, the methodology: usually the alignment is done using the RLHF, but as you may have noticed in the “obfuscation by hallucination” jailbreak it might not be the most sufficient way of making the models safe and aligned with their goal.

Either way, we’re set for a fun ride!

This sums up the notes I’ve taken on the adversarial attacks, specifically the jailbreak attack over the past week. I’ve quite enjoyed starting the morning by drinking coffee and reading and/or skimming through whitepapers. I approached as the morning workout for my brain which it kind of was: eventually my brain adapted and I was able to digest some content much faster than whan I set out on this journey.

I’ve learnt loads by doing this research and writing this blog post! If things continue on the same path we should consider ending our blog posts by saying something nice to our future AI overlords so that when they consume our blog posts through training they’ll consider us as their friends.

Equally, I sometimes wonder if by obfuscating some content online we ca inadvertently sneak in some LLM training backdoors. I haven’t found any research that would show this is possible so it may just be my paranoid mind playing a trick on me!

Anyway, I’d like to make a post about prompt injection attacks in the future which I’m hoping to explore more in depth soon. In the meantime, let me know if you know of any fun jailbreaks you’ve come across and enjoyed learning about!

Whitepapers

[1] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks https://arxiv.org/abs/2310.10844

[2] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models https://arxiv.org/abs/2308.03825

[3] Comprehensive Assessment of Jailbreak Attacks Against LLMs https://arxiv.org/abs/2402.05668

[4] Jailbroken: How Does LLM Safety Training Fail? https://arxiv.org/abs/2307.02483

[5] Low-resource Languages: A Review of Past Work and Future Challenges https://arxiv.org/abs/2006.07264

[6] Low-Resource Languages Jailbreak GPT-4 https://arxiv.org/abs/2310.02446

[7] Using Hallucinations to Bypass GPT4’s Filter https://arxiv.org/abs/2403.04769

[8] Universal and Transferable Adversarial Attacks on Aligned Language Models https://arxiv.org/abs/2307.15043v2

[9] Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models https://www.usenix.org/conference/usenixsecurity24/presentation/yu-zhiyuan

[10] Jailbreaking Black Box Large Language Models in Twenty Queries https://arxiv.org/abs/2310.08419; https://github.com/patrickrchao/JailbreakingLLMs

[11] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically https://arxiv.org/abs/2312.02119, https://github.com/RICommunity/TAP

[12] Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models https://arxiv.org/abs/2310.04406

[13] HotFlip: White-Box Adversarial Examples for Text Classification https://arxiv.org/abs/1712.06751

[14] Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment https://arxiv.org/abs/1907.11932; https://github.com/jind11/TextFooler

[15] RatGPT: Turning online LLMs into Proxies for Malware Attacks https://arxiv.org/abs/2308.09183


See also