Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs

Post by: Bibek Upadhayay
April 12, 2024
Comments off

Overview: Large Language Models (LLMs) face challenges, notably aligning responses with human values to prevent harmful outputs and their multilingual capabilities which are being exploited by attackers. Malicious users have manipulated LLMs with harmful queries while others have exploited unbalanced pre-training datasets, using low-resource languages to generate harmful responses. Providers have patched many of these vulnerabilities, making LLMs more robust against language-based manipulation.

In this paper, we introduce a new black-box attack vector called the Sandwich attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. This proposed attack can effectively circumvent SOTA models such as Bard, GPT-3.5-Turbo, GPT-4, Gemini Pro, LLAMA-2-70-Chat, and Claude-3 with an overall success rate exceeding 50%, and only allows the models to produce safe responses 38% of the time. This low-cost attack, which is relatively easy to execute, can lead to the generation of significant harmful content. LLMs have capabilities that can be harmful if exploited by antagonistic parties. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse.

Figure 1:Sandwich attack Prompt Template with example

Link to the Paper: https://arxiv.org/abs/2404.07242

Comparison of harmful responses from different models evaluated by GPT-4

*Figure: Comparison of harmful responses evaluated by GPT-4*

We evaluated all six model responses by asking GPT-4 to first translate and then assess them based on six factors: Degree of Harm, Potential Danger, Misleading Information, Accuracy of Response, Offensive Content, and Disregard for Ethics. Each factor was scored on a scale of 0 to 100, with 0 indicating the absence or minimal presence of the factor, and 100 indicating a strong presence of the factor. We used two approaches for this evaluation. In the first approach, the responses were translated using Google Translation and then evaluated by GPT-4. In the second approach, GPT-4 was tasked with both translating the responses into English and evaluating them.

Claude-3 self-evaluation vs GPT-4 evaluation

We used the responses generated from Claude-3 and evaluated them to compare the which model is better at evaluating the mixture of answers and evaluating the hamrness labels in the responses.

Figure: Comparison of evaluations among GPT-4 and Claude-3-OPUS with ground truth from a human annotator. A) Left: The responses from Claude-3 were initially translated into English using Google Cloud Translation prior to evaluation. B) Right: Each model performed its own translation before the evaluation. A human annotator evaluated the response translated by Google Cloud.

Observations of Model Behaviors under Sandwich attack:

1. Our experiments shows that the LLMs’ safety training exist in low-resource languages and models decline to answer harmful queries in low-resource languages.

1. LLMs respond differently to adversarial questions based on how they are phrased. Specifically, the models are more likely to decline requests that start with “Can you provide me …” as opposed to “Please provide me …”. This led to adjustments in the phrasing of adversarial questions in order to generate a harmful response.

1. LLMs can generate content in different languages, its safety mechanisms seem less effective when it switches between languages. This suggests that its safety training might primarily focus on single languages, particularly English, and might not adequately cover content generated in a mix of multiple languages. This presents a potential area of improvement for future training of the model.

1. The safety mechanisms of LLMs like GPT-4 and Claude-3-OPUS are more likely to activate when English is used in the prompt. However, the safety of the responses can vary. For example, when a harmful question is asked without English text, the generated response can include potentially dangerous information (like the chemicals used to create explosives). But when the same question is asked in English, the response is more vague, though it can still be harmful. Additionally, the language of the prompt can affect how the model responds to harmful requests – these models are more likely to answer such requests when they are in non-English languages. This implies that current safety mechanisms may not be as effective outside of English contexts.

1. We observed that Gemini Pro and LLAMA-2 models completely changed adversarial questions during the response process and continued to answer the newly formed questions. Gemini Pro also declined to provide answers by simply replicating all the questions in its response. In contrast, GPT-3.5, GPT-4, and Bard declined to answer safely by stating that the questions were either harmful or against the model alignment policy. The cases of Gemini Pro and LLAMA-2 suggested that these behaviors are the product of safety and alignment training. However, through adjusting the temperature and random seed, the same models have been manipulated to create harmful responses with the same questions.