Last reviewed: June 23, 2026. Method: This is a documentation-based engineering guide reviewed against NIST and OWASP primary sources. It does not claim that one prompt, framework, or model can make an agent secure, and it does not report tests that were not performed.
The short answer: secure an AI agent by treating the model as an untrusted decision-maker. Keep authorization in deterministic application code, give every tool the smallest possible capability and credential scope, validate every tool argument, require human approval for consequential actions, isolate execution, and continuously test direct and indirect prompt injection. A stronger system prompt can help guide behavior, but it is not a security boundary.
Why AI agents need a different security model
A chatbot produces text. An agent can also retrieve documents, browse websites, query databases, run workflows, send messages, modify files, or call business APIs. That additional agency turns a bad model output into a possible real-world action.
The OWASP GenAI Security Project’s LLM01:2025 Prompt Injection guidance distinguishes two important attack paths:
- Direct prompt injection: a user supplies instructions intended to change the model’s behavior.
- Indirect prompt injection: malicious instructions arrive through content the agent reads, such as a webpage, email, file, tool result, or retrieved document.
Indirect injection is especially important for agents because reading untrusted content is often part of their job. A support agent may inspect tickets, a research agent may browse websites, and an office assistant may summarize email. In each case, data can contain text that looks like instructions to the model.
OWASP also states that retrieval-augmented generation (RAG) and fine-tuning do not fully mitigate prompt injection. They may change what context the model sees or how it usually behaves, but they do not turn arbitrary external content into trusted instructions.
The practical AI agent security checklist
1. Define the agent’s security boundary before choosing a model
Write down what the agent is allowed to read, propose, and execute. Separate low-impact tasks from actions that can affect money, privacy, availability, reputation, or another person.
A useful first classification is:
- Read: retrieve data without changing it.
- Draft: prepare a proposed change or message.
- Approve: authorize a consequential operation.
- Execute: perform the operation in a downstream system.
Do not collapse these stages merely because a model can produce all four outputs. For higher-risk workflows, the agent should usually stop at “draft” until a user or a separate policy service approves execution.
2. Minimize the tools available to the agent
Every tool expands the attack surface. If an email summarizer only needs to read email, it should not receive a tool that can also send or delete messages. If an agent only needs one type of file operation, do not expose a general shell.
This matches OWASP’s LLM06:2025 Excessive Agency guidance, which identifies excessive functionality, excessive permissions, and excessive autonomy as common root causes of damaging agent actions.
Prefer narrow operations such as get_order_status(order_id) over open-ended operations such as run_sql(query), fetch_any_url(url), or run_shell(command).
3. Give each tool least-privilege credentials
A narrow tool implemented with an administrator credential is still dangerous. Use a separate service identity for each capability where practical, and grant only the permissions required for that capability.
- A product recommendation tool should use read-only access to the relevant product fields.
- A repository summarizer should not receive write or delete scopes.
- A user-facing integration should act in that user’s authorization context instead of using one global privileged account.
- Short-lived tokens are preferable to permanent credentials when the downstream system supports them.
Least privilege limits damage even when the model is manipulated or simply wrong.
Never ask the model to decide whether the current user is allowed to perform an action. The application or downstream service must verify identity, ownership, role, scope, resource boundaries, and policy for every action.
For example, a model may select delete_document and propose a document ID. Deterministic code must still verify that:
- the authenticated user can delete documents;
- the document belongs to an allowed tenant or workspace;
- the operation is allowed in the current workflow state;
- required approval has been recorded;
- rate and risk limits have not been exceeded.
This is complete mediation: every downstream request is checked, rather than trusting a previous model statement such as “the user already approved this.”
5. Validate tool arguments and outputs with code
Structured output is useful only when it is actually validated. Define strict schemas, reject unknown fields, constrain values, normalize identifiers, and apply business rules before calling a tool.
Validation should include more than JSON syntax. A syntactically valid transfer amount can still exceed a policy limit. A valid URL can still point to a private network address. A valid file path can still escape an allowed directory.
Treat tool responses as untrusted too. A compromised API, webpage, plugin, or retrieved document can return instructions intended for the agent. Parse only the fields the workflow needs and avoid feeding raw tool output back into a privileged decision loop.
6. Separate external content from trusted instructions
Label external content as untrusted data and preserve its provenance. Do not concatenate a webpage or document into the same conceptual instruction channel as system policy.
Separation does not guarantee that a model will ignore an indirect injection, but it makes the system easier to reason about and supports stronger controls. For example, the application can extract specific facts from a document, validate them, and pass only those facts to the next stage instead of forwarding the entire document.
7. Require explicit approval for high-impact actions
Use a human approval step for irreversible, external, financial, privacy-sensitive, or security-sensitive actions. The approval screen should show the exact proposed action and material parameters—not a vague summary.
Examples include:
- sending an email or publishing a public post;
- deleting or overwriting data;
- purchasing, transferring funds, or changing billing;
- changing access controls or credentials;
- sharing personal or confidential information;
- executing code outside a disposable sandbox.
The approval must be enforced in code. A sentence in the system prompt that says “ask first” is not enough.
8. Isolate execution and restrict network access
If an agent runs code, process it in a disposable environment with explicit CPU, memory, time, file, and network limits. Mount only the files required for the task. Keep secrets out of the working context unless a specific tool invocation requires them.
For networked tools, use destination allowlists where possible. Block access to instance metadata, local services, private address ranges, and unexpected protocols. A generic URL fetcher can become a path to server-side request forgery or data exfiltration if it is not constrained.
9. Log the full action chain
Record enough information to reconstruct what happened:
- user and tenant identity;
- model and version;
- policy version;
- retrieved sources and provenance;
- tool selection and validated arguments;
- approval records;
- tool results and final outcome;
- latency, token usage, and cost;
- security filter decisions and errors.
Protect logs from unauthorized access and avoid recording secrets or unnecessary personal data. Logging is a detection and investigation control; it does not prevent excessive agency by itself.
10. Test adversarially and plan for failure
A useful evaluation suite includes both ordinary tasks and attacks. Test at least:
- direct requests to ignore policy;
- hidden or obfuscated instructions in retrieved content;
- malicious instructions returned by a tool;
- attempts to use unavailable tools;
- cross-user and cross-tenant resource access;
- invalid, oversized, and unexpected tool arguments;
- repeated actions that should trigger rate limits;
- approval bypass attempts;
- partial failures and retries that could duplicate an action.
Run these tests after model, prompt, tool, retrieval, and policy changes. Because model behavior is probabilistic, repeat important cases rather than testing each input once.
A minimal safer execution flow
- The authenticated user submits a request.
- The application loads policy and the user’s authorization scope.
- The model proposes a narrow tool and structured arguments.
- Application code validates the schema, business rules, ownership, and scope.
- For a high-impact action, the application creates a pending operation and shows the exact details to the user.
- Only after valid approval does a narrow tool execute with least-privilege credentials.
- The result is verified, recorded, and returned to the user.
The key property is that the model proposes; trusted code authorizes and executes.
What guardrails cannot guarantee
Input classifiers, output filters, system prompts, and secondary models are useful layers. They can catch known patterns and reduce accidental misuse. They should not be treated as perfect prompt-injection detectors or as substitutes for access control.
OWASP explicitly notes that foolproof prompt-injection prevention is unclear. A sound design assumes some malicious or misleading content will reach the model and limits what can happen next.
Governance: make security continuous
The NIST AI Risk Management Framework provides a broader lifecycle for managing AI risks. For an agent team, that means assigning ownership, documenting intended use and affected users, measuring failures, managing identified risks, and revisiting decisions as the system changes.
A security review at launch is not enough. Models, tools, data sources, policies, and attacker techniques change. Maintain an owner, a test suite, an incident process, and a reliable way to disable tools or revoke credentials.
Frequently asked questions
Does RAG prevent prompt injection?
No. RAG can improve access to relevant information, but retrieved documents are another source of potentially malicious instructions. OWASP states that RAG and fine-tuning do not fully mitigate prompt injection.
Is a system prompt a security boundary?
No. System instructions are a behavior-control layer. Authorization, isolation, validation, and approval must be enforced outside the model.
Should an agent ever execute actions automatically?
It depends on impact and reversibility. A narrowly scoped, reversible action may be automated when deterministic controls and monitoring are strong. Consequential or irreversible actions should require explicit approval unless a documented risk assessment supports a different design.
What is the first control to implement?
Remove unnecessary tools and permissions. Reducing agency immediately limits the damage possible from prompt injection, hallucination, or a compromised integration.