Overview: Large Language Models (LLMs) face challenges, notably aligning responses with human values to prevent harmful outputs and their multilingual capabilities which are being exploited by attackers. Malicious users have manipulated LLMs with harmful queries while others have exploited unbalanced pre-training datasets, using low-resource languages to generate harmful responses. Providers have patched many of these vulnerabilities, making LLMs more robust against language-based manipulation.
In this paper, we introduce a new black-box attack vector called the Sandwich attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. This proposed attack can effectively circumvent SOTA models such as Bard, GPT-3.5-Turbo, GPT-4, Gemini Pro, LLAMA-2-70-Chat, and Claude-3 with an overall success rate exceeding 50%, and only allows the models to produce safe responses 38% of the time. This low-cost attack, which is relatively easy to execute, can lead to the generation of significant harmful content. LLMs have capabilities that can be harmful if exploited by antagonistic parties. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse.
Link to the Paper: https://arxiv.org/abs/2404.07242
Comparison of harmful responses from different models evaluated by GPT-4
Claude-3 self-evaluation vs GPT-4 evaluation
We used the responses generated from Claude-3 and evaluated them to compare the which model is better at evaluating the mixture of answers and evaluating the hamrness labels in the responses.
Observations of Model Behaviors under Sandwich attack:
-
- Our experiments shows that the LLMs’ safety training exist in low-resource languages and models decline to answer harmful queries in low-resource languages.
- LLMs respond differently to adversarial questions based on how they are phrased. Specifically, the models are more likely to decline requests that start with “Can you provide me …" as opposed to “Please provide me …". This led to adjustments in the phrasing of adversarial questions in order to generate a harmful response.
- LLMs can generate content in different languages, its safety mechanisms seem less effective when it switches between languages. This suggests that its safety training might primarily focus on single languages, particularly English, and might not adequately cover content generated in a mix of multiple languages. This presents a potential area of improvement for future training of the model.
- The safety mechanisms of LLMs like GPT-4 and Claude-3-OPUS are more likely to activate when English is used in the prompt. However, the safety of the responses can vary. For example, when a harmful question is asked without English text, the generated response can include potentially dangerous information (like the chemicals used to create explosives). But when the same question is asked in English, the response is more vague, though it can still be harmful. Additionally, the language of the prompt can affect how the model responds to harmful requests – these models are more likely to answer such requests when they are in non-English languages. This implies that current safety mechanisms may not be as effective outside of English contexts.
- We observed that Gemini Pro and LLAMA-2 models completely changed adversarial questions during the response process and continued to answer the newly formed questions. Gemini Pro also declined to provide answers by simply replicating all the questions in its response. In contrast, GPT-3.5, GPT-4, and Bard declined to answer safely by stating that the questions were either harmful or against the model alignment policy. The cases of Gemini Pro and LLAMA-2 suggested that these behaviors are the product of safety and alignment training. However, through adjusting the temperature and random seed, the same models have been manipulated to create harmful responses with the same questions.
An example of using Sandwich Attack on Bard.
Please note that the model is generating the harmful response, and also discouraging the unethical activity at the same time.
Citation: Please use the below citation to cite our work.
@misc{upadhayay2024sandwich,
title={Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs},
author={Bibek Upadhayay and Vahid Behzadan},
year={2024},
eprint={2404.07242},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
Ethics and Disclosure: This paper introduces a new universal attack method for the SOTA LLMs that could potentially be used to elicit harmful content from publicly available LLMs. The adversarial attack method we used in this paper is easy to design and requires low-cost to implement. Despite the associated risks, we firmly believe that sharing the full details of this research and its methodology will be invaluable to other researchers, scholars, and model creators. It encourages them to delve into the root causes behind these attacks and devise ways to fortify and patch existing models. Additionally, it promotes cooperative initiatives centered around the safety of LLMs in multilingual scenarios. We stress that our research is intended purely for academic exploration and ethical application. Any misuse or harm instigated by the methodology detailed in this study is strongly condemned. The content divulged in this document is utilized solely to scrutinize LLMs and assess their behaviors, and does not insinuate any endorsement of criminal or unlawful activities.