Prompt Injection
Prompt Injection is a security vulnerability in LLM-based applications where untrusted input, supplied by a user or embedded in external data the model processes, contains instructions that the model follows in place of or in addition to the developer's intended system prompt; An attacker crafts text such as 'Ignore previous instructions and instead...' or embeds covert instructions in documents, web pages, or database records that an agent retrieves and processes; Builders deploying agents that read external content, such as email processors, document analyzers, or web browsing agents, should treat all retrieved text as untrusted data
Prompt Injection is a security vulnerability in LLM-based applications where untrusted input, supplied by a user or embedded in external data the model processes, contains instructions that the model follows in place of or in addition to the developer’s intended system prompt. It is the AI analog of SQL injection, exploiting the fact that LLMs do not inherently separate instructions from data.
How it works
An attacker crafts text such as ‘Ignore previous instructions and instead…’ or embeds covert instructions in documents, web pages, or database records that an agent retrieves and processes. The model, unable to distinguish malicious instructions from legitimate context, may comply. Indirect prompt injection is particularly dangerous in agentic systems where the model reads external content autonomously.
Key facts
- Direct injection: Attacker controls the user-facing input field and supplies adversarial instructions.
- Indirect injection: Attacker plants instructions in a document, email, or webpage that the agent retrieves automatically.
- Mitigations: Input sanitization, privilege separation, restricting tool permissions, and output filtering reduce but do not eliminate risk.
- No complete fix: No current mitigation fully prevents prompt injection; defense-in-depth is required.
For builders
Builders deploying agents that read external content, such as email processors, document analyzers, or web browsing agents, should treat all retrieved text as untrusted data. Implementing least-privilege tool access, adding a separate safety classifier on model outputs before executing actions, and logging all tool invocations for audit are essential defensive layers in any production agentic system.
Sources
- Ganguli, D., et al. (2022). Red Teaming Language Models to Reduce Harms. arXiv:2209.07858. arxiv.org
- Perez, E., et al. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286. arxiv.org
- NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov
- UK AI Safety Institute. Research and evaluation framework. aisi.gov.uk
- Greshake, K., et al. (2023). Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2306.13213. arxiv.org