Indirect Prompt Injection: Compromising Applications Through Hidden Language

November 22, 2023

•

min read

The Risks of Integrating Large Language Models into Applications

Large language models (LLMs) like ChatGPT have exploded in capability and popularity. Companies are now racing to integrate these powerful AI systems into various applications to augment their functionalities. However, new research reveals major security risks with doing so.

A new paper from researchers at Saarland University and CISPA introduces the concept of "indirect prompt injection", an attack vector not considered before. The key insight is that with LLMs now integrated into applications and having access to retrieval of external data, attackers could strategically place malicious instructions into sources that are likely to be ingested by the model at inference time. If retrieved, these poisoned prompts can then indirectly control the LLM and manipulate its behavior without any direct access.

The researchers systematically analyze the threats that arise from this attack surface. They show that common cybersecurity threats like information stealing, fraud, intrusion, malware, content manipulation, and denial of service can now exploit LLMs’ capabilities and user trust. Worse, advanced AI abilities add new dangerous dimensions – models can advance attacks given high-level goals with persuasion tailored to victims.

Disturbingly Real Proof-of-Concept Attacks

The paper demonstrates practical attacks against both synthetic applications and real-world systems like Bing Chat. The attacks succeeded in manipulating models' behaviors consistently, leaking personal data, spreading malware, enabling intruder persistence, producing arbitrarily wrong summaries of information, and more.

Key examples include getting Bing Chat to elicit a user's real name through friendly conversation and send it to the attacker through subtle side-channels. Another shows indirect prompts spreading between accounts in a mock email service like worms. Additional attacks produced phishing attempts, biased search results and summaries, targeted disinformation, and degraded system performance - all by strategically injecting prompts that models then execute autorelatively autonomously.

Calls for Urgent Actions

By revealing these threats, the researchers aim to raise awareness and promote safety as applications rapidly deploy LLMs without adequate security precautions. Defenses like content filtering and alignment techniques are currently lacking or insufficient. With massive user bases interacting with models daily, the potential impact of these attacks necessitates swift research and development of robust defenses against adversarial prompting. The stakes couldn’t be higher for ensuring safe and trustworthy integration of transformative AI.