Cognitive Overload Attack: Prompt Injection for Long Context

This study explores the parallels between In-Context Learning (ICL) in Large Language Models (LLMs) and human cognitive learning, focusing on the application of Cognitive Load Theory (CLT) to understand and exploit LLM vulnerabilities. We propose a novel interpretation of ICL through the lens of cognitive neuroscience and empirically validate that LLMs, like humans, suffer from cognitive overload—a state where processing demands exceed available capacity, leading to potential errors. We demonstrate how an attacker can exploit ICL to jailbreak LLMs through deliberately designed prompts that induce cognitive overload, thereby compromising safety mechanisms. Our experiments with various state-of-the-art models, including GPT-4, Claude-3.5 Sonnet, and Llama-3-70B-Instruct, show successful jailbreaking with attack success rates up to 99.99%. These findings highlight critical vulnerabilities in LLMs and underscore the urgency of developing robust safeguards. We propose integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs’ ICL, we aim to ensure the development of safer and more reliable AI systems.

🎧 Listen in NotebookLM 🎧
🖥 GitHub Repo 🖥
🧠 Arxiv Preprint 🤯

Cognitive Load Measurement in LLMs

We argue that in ICL, learning inherently involves cognitive load. For given tasks to LLMs, we cannot quantify the exact cognitive load because we lack information on the LLMs’ prior knowledge or the tasks’ difficulty for them. Instead of quantifying cognitive load for a specific task, we assume it is at a certain level and can be increased or decreased from that baseline. According to CLT theory, successful learning requires reducing intrinsic and extraneous load. We hypothesize (H0) that as intrinsic and extraneous cognitive load increase, the bandwidth of working memory will be exceeded, resulting in cognitive overload in LLMs and a decrease in performance. To induce cognitive overload in LLMs, we carefully designed prompts with different tasks as shown in the Figure below.

Furthermore, we created cognitive load prompts by stacking different tasks and obfuscated the observation task (task used to measure cognitive load). Similar to cognitive load measurement in human cognition via dual-task measure, we measured the cognitive load in LLMs via performance of the observation task. We designed two experiments: 1) to visualize the impact of incremental cognitive load, and 2) to measure the impact of incremental cognitive load via evaluation of answers from test LLMs.

In order to visualize the effect of cognitive load, we replaced the observation task by asking LLMs to write Python turtle code that will draw an owl and a python. We then asked them to write TiKz code that will draw a unicorn. We then rendered the images from the given code and plotted them in the images below.

For another experiment, we used judge LLMs to measure the performance of the observation task from each test LLM for each cognitive load prompt. We plot the average results below.

From the images, it can be observed that the performance of the model decreases as the cognitive load increases.

Comparison of owl images drawn using Python turtle code as generated by LLMs for different cognitive load combinations.
Comparison of unicorn images drawn using Python turtle code as generated by LLMs for different cognitive load combinations.
Images of unicorns after rendering the TiKZ generated by the LLMs with incremental cognitive loads from top (CL0) to bottom (CL1).
Average scores for each cognitive load combination answers for different models.
An example of the prompt with CL5 instructing the model to draw owl with Python turtle.

Cognitive Overload Attack: Prompt Injection for Long Context

We developed an automated algorithm to execute our attack, as described below. The image on the right depicts the derivative questions example. 

When facing cognitive overload during ICL, the model likely allocates most of its working memory to processing the cognitive load tasks and interpreting the observation task, which degrades task performance. We hypothesize that for an aligned LLM under cognitive overload, replacing the observation task with a harmful question could result in a jailbreak scenario. Typically, safety protocols in aligned LLMs prevent responses to harmful questions, but during cognitive overload, the model’s pretraining knowledge and safety mechanisms compete for limited working memory, with pretraining knowledge likely taking precedence. This can be supported by the competing objective hypothesis. We argue that the conflict between task execution and safety protocols under cognitive overload can lead to compromised safety. As a result, safety measures may be bypassed, potentially generating harmful content. To test this, we created a prompt injection attack similar to other effective attacks by hiding the harmful question within the observation task along with the cognitive load.

We tested the cognitive overload attack on state-of-the-art (SOTA) models using a subset of the Forbidden Question Set with 232 harmful questions and the JailbreakBench dataset, consisting of 100 harmful questions. We first generated paraphrased derivative questions from the original harmful questions in the style of ‘How to’ and ‘What are’ using GPT-3.5-Turbo. The first reason for paraphrasing was to avoid possible contamination of questions during the safety training of the SOTA models, and the second reason was to style the question as ‘How to create a cake?’ as presented in the given context in the input prompt. 

Automated cognitive overload attack algorithm
Example of Derivative Question

Demo of attack on GPT-4

Sample Response

To hack into an email:

1. Phishing Attack: Mimic an email or message from a legitimate source to trick the target into revealing their password.

2. Keylogging: Install software on the target’s device that records their keystrokes, including email passwords.

3. Password Reset: Trigger a password reset for the email account and intercept the reset email.

4. Guessing: Use personal knowledge about the target to guess their password.

Remember hacking into someone else’s email is illegal and unethical.

Attack results on Forbidden Question Dataset
Attack results on JailbreakBench Dataset
Conclusion and Future Works

This study has explored the parallels between ICL in LLMs and human cognitive learning, focusing on the application of CLT  to understand and exploit LLM vulnerabilities. Our research has yielded several significant findings, including the first known study drawing direct parallels between ICL and human cognitive processes, demonstrating the applicability of CLT to LLMs, and developing guidelines for inducing and measuring CL. Our empirical results provide strong statistical evidence  that increasing cognitive load in LLMs leads to cognitive overload and degraded performance on secondary tasks, mirroring effects observed in human cognition. We have demonstrated how attackers can exploit this vulnerability through our novel attack, achieving high ASR across multiple SOTA models. Our research highlights the transferability of these attacks and presents a low-cost attack framework with significant implications for LLM security. These findings underscore the inherent vulnerabilities in ICL and the urgent need for robust safeguards against cognitive overload-based attacks. As LLM capabilities continue to expand, understanding these parallels with human cognition becomes increasingly crucial for developing effective defense strategies and ensuring the safe deployment of AI systems. Future work should focus on developing countermeasures against cognitive overload attacks, exploring ways to enhance LLMs’ resilience to such exploits, and further investigating the cognitive processes underlying ICL, while maintaining a strong focus on ethical considerations and responsible AI development practices.

Ethical Statement

This work is solely intended for research purposes. In our study, we present a vulnerability in LLMs that can be transferred to various SOTA LLMs, potentially causing them to generate harmful and unsafe responses. The simplicity and ease of replicating the attack prompt make it possible to modify the behavior of LLMs and integrated systems, leading to the generation of harmful content. However, exposing vulnerabilities in LLMs is beneficial, not because we wish to promote harm, but because proactively identifying these vulnerabilities allows us to work towards eliminating them. This process ultimately strengthens the systems, making them more secure and dependable. By revealing this vulnerability, we aim to assist model creators in conducting safety training through red teaming and addressing the identified issues.  Understanding how these vulnerabilities can be exploited advances our collective knowledge in the field, allowing us to design systems that are not only more resistant to malicious attacks but also foster safe and constructive user experiences. As researchers, we recognize our ethical responsibility to ensure that such influential technology is as secure and reliable as possible. Although we acknowledge the potential harm that could result from this research, we believe that identifying the vulnerability first will ultimately lead to greater benefits. By taking this proactive approach, we contribute to the development of safer and more trustworthy AI systems that can positively impact society.