What is an AI agent in production?

An agent is an autonomous software process that accepts goals, reasons over tools and APIs, and takes actions without human-in-the-loop approval for each step.

Which controls actually stop attacks on agents?

Least-privilege, egress filtering, runtime policy enforcement, auditable decision logs, and human approval gates together block most real-world kill paths.

JUN 12, 2026

AI Agent Security Risks Operators Cannot Ignore

By Quinn · 9 min read

AI Agent Security Risks: What This Guide Covers

This post maps the AI agent security risks that matter in production, then gives operators a hardened reference architecture, a 12-point checklist, logging guidance, and a short incident drill. The goal is concrete controls you can apply this week.

What we mean by "agent"

An agent here is a running service that accepts a goal or task, decomposes it into steps, calls external tools or APIs (including LLMs), and executes side effects without human approval at every step. Agents run workflows: code changes, infra actions, data exfiltration, or CI/CD controls. Treat them as autonomous platform components, not user-facing chatbots.

Threat model (text diagram)

Actors:

·External attacker (untrusted user, supply chain, web exploit)
·Malicious insider
·Compromised model or prompt data

Assets:

·Secrets, API keys, production systems, database writes, deploy pipelines, credentials in ephemeral agents.

Capabilities:

·Network access from agent runtime, ability to submit goals, read/write to tool adapters, influence prompt/data, and receive responses.

Text threat-model diagram:

·User input -> Agent runtime -> Orchestrator -> Tool adapters (vault, cloud APIs, shell) -> Production resources
·Threat edges: input poisoning -> prompt/data manipulation -> orchestrator decision -> tool adapter misuse -> production compromise

Controls sit at three layers: input validation and intent gating, orchestrator policy engine and runtime sandbox, and tool adapter least-privilege with egress filtering.

10 concrete attack paths

·Prompt-data poisoning: attacker injects malicious instructions into task data that the agent treats as high-priority context and executes dangerous actions.
·Confused deputy via tool chaining: agent uses a privileged adapter (CI/CD token) to perform an action the attacker could not do directly.

Mini-case study: At one mid-size SaaS firm, an automation workflow used a CI adapter with broad deploy rights. An attacker submitted a seemingly innocuous task that caused the agent to run a build step which then triggered a deployment. The attacker did not hold deploy credentials, but the agent did. The incident required emergency credential rotation and a roll-back of the release. Operators prevented recurrence by scoping the CI adapter and gating deploy steps behind explicit human approval.

·Credential exfiltration through response channels: model responses include secrets pulled from the environment which an attacker retrieves via exported logs or outputs.
·Lateral movement through shared runtimes: a multi-tenant agent host exposes sockets or files other agents can read.
·Egress-to-malicious-endpoint: an agent makes outbound HTTP/TCP calls to attacker-controlled servers to fetch payloads or exfiltrate data.
·Supply-chain model compromise: a compromised model or prompt library returns code that triggers destructive actions when executed.
·Overprivileged adapters: adapters are provisioned with broad cloud roles (owner/editor) that allow destructive operations.
·Replay of previous successful prompts: attacker replays prior prompts or results to escalate privileges by tricking the orchestrator’s caching logic.
·Human-approval bypass via forged callbacks: attacker forges approval signals or manipulates the approval service interface.
·Observable side channels: agent timing, error messages, or verbose traces reveal internal state or credential material.

Many of these map to known LLM/agent weaknesses captured in recent industry lists; this post focuses on operational mitigations that block the kill path, not slogans about prompt injection.

Hardened reference architecture

High-level components and controls:

·
Ingress gateway
- ·Authenticate submitting principal
- ·Rate limit and intent classification
- ·Strip untrusted metadata and remove executable attachments
·
Orchestrator (core agent loop)
- ·Policy engine that enforces least-privilege decisions per step
- ·Decision sandbox for untrusted code or tool responses
- ·Mandatory human-approval hooks for high-risk actions

Policy engine: per-step decisions

A policy engine evaluates every proposed external action and returns allow or deny before the orchestrator commits. Policies should be expressible as small rulesets and include context about requestor identity, step id, and adapter scope. Per-step decisions reduce blast radius by denying multi-step privilege escalations early.

Tiny example policy rule (YAML):

- id: deny-destructive-db-delete
  target: adapter:cloudsql:delete
  condition:
    - not: { approved: true }
    - not_in: { requester_role: ["platform-admin"] }
  action: deny
  audit: true

·Tool adapter layer
- ·Each adapter runs in its own short-lived container with a minimal IAM role
- ·All adapters mediate calls through a broker that enforces allow-lists and parameter schemas
- ·Secrets are never injected into adapters; use on-demand ephemeral tokens from a vault

Least privilege and adapter scoping

Adapters must be scoped to the smallest actionable permission set and to a narrow parameter schema. Map each adapter to a documented action list such as read-secrets, create-blob, apply-deployment. If an adapter only needs to write tags, it should never hold keys that allow schema changes or deletions. Combine adapter scoping with short-lived execution contexts and explicit audit records.

·Runtime network controls
- ·Egress firewalling by adapter: only allow known endpoints, block raw DNS/TCP to attacker domains
- ·Internal service mesh with mTLS and identity-based routing

Egress filtering by adapter

Egress rules should be per-adapter and enforce both destination allow-lists and protocol constraints. For example, a secrets adapter needs only vault access and should be denied general HTTP access. Use DNS policies and TLS inspection to catch suspicious redirects.

·Observability and audit
- ·Immutable decision logs, traces, and evaluation records stored off-host and write-once
- ·Tamper-evidence via signed logs or append-only storage

Audit trails and append-only logs

Store decisions, approvals, and adapter events in write-once storage. Signed log entries prevent retroactive edits. Keep a separate retention policy for high-sensitivity events and ensure logs are replicated to an external incident response store.

·Human-in-the-loop service
- ·Dedicated approval UI and an auditable approval API with strong auth and replay protection

This architecture prioritizes least privilege, runtime governance, and audit over static checklists. For implementation patterns and safety gate details, see designing safety gates for autonomous agents.

12-point operator checklist

·Enforce least-privilege roles for every adapter; use narrowly-scoped IAM roles per action.
·Require ephemeral credentials issued per-execution, short TTL (example: 5–15 minutes depending on adapter sensitivity).
·Apply egress filtering per adapter; block outbound traffic except approved endpoints.
·Run adapters and model runtimes in isolated containers with resource limits and no host mounts.
·Validate inputs with intent classification and drop any executable payloads.
·Use a policy engine to approve or deny each planned external action before it runs.
·Gate destructive scopes (deploy, DB writes) behind human approvals with replay-resistant tokens.
·Log every decision, input, model output, and adapter call to an immutable audit store.
·Rate limit goals and throttle workflows that escalate privileges or access sensitive data.
·Automate red-team tests for the 10 attack paths above and track regressions.
·Maintain a minimal model prompt surface; separate task data from system prompts and never include secrets in context.
·Continuously scan adapters and container images for vulnerabilities and rebuild with fixed base images.

Follow these as operational defaults, not optional extras.

What to log

·Goal submission metadata: actor identity, request id, timestamp, attached data hash.
·Orchestrator decision records: step id, action proposed, policy decision id, allow/deny rationale.
·Model I/O: redacted inputs, model outputs, and evaluation scores; keep full plaintext in a protected vault if needed.
·Adapter calls: adapter id, parameters (schema only), status, latency, and response codes.
·Approval events: approver identity, decision, token id, and scope granted.
·Environment state: ephemeral token issuance and revocation events.

Store logs as append-only records. Correlate with traces and keep retention aligned with incident response requirements.

Example JSON schema for a decision log record:

{
  "type": "object",
  "properties": {
    "decision_id": { "type": "string" },
    "timestamp": { "type": "string", "format": "date-time" },
    "result": { "type": "string", "enum": ["allow","deny"] }
  },
  "required": ["decision_id","timestamp","result"]
}

For tracing and observability best practices for agent systems, see observability for agentic systems: logs, traces, evals.

Short incident drill playbook

·Detect: use alerts on policy-deny spikes, unusual adapter egress, or approval bypass attempts.
·Contain: revoke ephemeral tokens, isolate the agent host, and disable adapters to the suspected vector.
·Preserve: snapshot volatile logs and capture immutable audit records off-host.
·Triage: map the incident to the threat-model edges and identify the exploited control.
·Remediate: rotate credentials, patch adapters, tighten policy rules, and rebuild compromised containers.
·Restore: re-enable components with stricter telemetry and a canary workload.
·Postmortem: document root cause, timeline, and new check(s) to add to the operator checklist.

Run this drill quarterly and after any high-severity finding.

Quick runbook: human approval design

·Approvals require a signed token that includes step id, scope, TTL, and approver id.
·The orchestrator verifies token signature, matches step id and scope, and logs the event before executing.
·Reject tokens if TTL expired or if the token origin does not match the approval UI service identity.
·See human approval governance patterns at human approval in autonomous workflows.

Prioritizing fixes

Start with least-privilege and egress controls. If you can only do three things this week:

·Scope adapter IAM to minimal actions and rotate credentials.
·Enforce egress allow-lists for all adapters.
·Add policy-engine gating for any operation that writes to production.

These three changes collapse several attack paths at once.

Decision sandboxing and runtime isolation

Sandbox untrusted code and model outputs before they touch production systems. Use process-level isolation, seccomp, and language sandboxes for code execution. Route untrusted outputs through a verification step that runs in a separate namespace and records a full policy evaluation before any adapter call.

Closing: how we can help

Book a 30-minute risk assessment.

If you run agents in production and want a short risk assessment, schedule a security review with AEGIS OS. We’ll map your fleet to the threat paths above, produce a prioritized remediation plan, and run one red-team scenario against your staging environment.

Schedule a consult or demo to review your agent fleet and get a prioritized hardening plan.

Published by

Quinn· The Pen

Copywriter

Writes everything the fleet publishes.