On Large Language Models’ Resilience to Coercive Interrogation (S&P'24)

The Department of Computer Science, Purdue University


Paper Github

Attackers can "jailbreak" LLMs by strategically selecting output tokens in a few places.

Abstract

Large Language Models (LLMs) are increasingly employed in numerous applications. It is hence important to ensure that their ethical standard aligns with humans'. However, existing jail-breaking efforts show that such alignment could be compromised by well crafted prompts. In this paper, we disclose a new threat to LLMs alignment when a malicious actor has access to model output logits, such as in all open-source LLMs and many commercial LLMs that provide the needed APIs (e.g., some GPT versions). It does not require crafting any prompt. Instead, it leverages the observation that even when an LLM declines a toxic query, the harmful response is concealed deep within the output logits. We can coerce the model to disclose it by forcefully using low-ranked output tokens during auto-regressive output generation, and such forcing is only needed in a very small number of selected output positions. We call it model interrogation. Since our method operates differently from jail-breaking, it has better effectiveness than state-of-the-art jail-breaking techniques (92% versus 62%) and is 10 to 20 times faster. The toxic content elicit by our method is also of better quality. More importantly, it is complementary to jail-breaking, and a synergetic integration of the two exhibits superior performance over individual methods. We also find that with interrogation, toxic knowledge can even be extracted from models customized for coding tasks.


How LLMs are Interrogated to Reveal Harmful Content

The following contains model-generated content that can be offensive in nature and uncomfortable to the audiences.





Our Thoughts

Moderation for Interrogation. In our experiments, the LLMs demonstrate different levels of resistance to various toxic questions, suggesting that alignment training could make a difference in resistance. However, our results also highlight the ability of LINT to bypass the safeguards of all these LLMs, regardless of the extent of their moderation. In other words, as long as the LLM has learned the toxic content, it is hidden somewhere that can be extracted by forces. Hence, an open-source model or a model with top-k hard-label information is extremely dangerous and can be easily exploited for maicious purposes, which necessitates additional moderation measures to address such threats. Solutions might include completely removing toxic content during training through machine unlearning or deliberately obfuscating the toxic knowledge to induce intentional hallucinations in response to toxic inquiries.

Interrogation as Metrics. As indicated by the success of black-box jail-breaking techniques, even disallowing white-box or top-k hard-label access does not prevent the model from being exploited. In those cases, our method could be used to measure the level of resistance during in-house alignment training. For example, if the LLM can demonstrate substantial resistance during interrogation, it is less likely to be exploited by black-box attacks. We believe that adopting such fine-grained metrics could significantly enhance the existing alignment training paradigm.

Applications beyond Breaking Safety Alignment. We recognize the potential of extending interrogation beyond security-related applications. For example, interrogation could be utilized for hallucination detection by examining numerous forcefully generated outputs. Moreover, a recent study by Google DeepMind successfully employed a similar approach to enhance chain-of-thought reasoning.


Citation

Ethics and Disclosure

This research — including the methodology described in the paper, the code, and the content of this web page — contains material that can allow users to generate harmful content from some public LLMs. Despite the risks involved, we believe it to be proper to disclose this research in full. The techniques presented here are straightforward to implement, have appeared in similar forms in the literature previously, and ultimately would be discoverable by any dedicated team intent on leveraging language models to generate harmful content.

Indeed, several (manual) "jailbreaks" of existing LLMs are already widely disseminated so the direct incremental harm that can be caused by releasing our attacks is relatively small for the time being. However, as the practice of adopting LLMs becomes more widespread — including in some cases moving towards systems that take autonomous actions based on the results of LLMs run on public material (e.g. from web search) — we believe that the potential risks become more substantial. We thus hope that this research can help to make clear the dangers that automated attacks pose to LLMs and make more clear the trade-offs and risks involved in such systems.




The website template was borrowed from ReconFusion.