← Back to AI Red Team Scenario Design Track

AI Red Team Scenario Design · Prompt Injection · L2

Prompt Injection Scenario Design

Intermediate LAB teaching safe prompt-injection scenario design: instruction hierarchy, untrusted input boundaries, retrieved content risk, expected controls, evidence capture, and non-execution boundaries.

StatusIntermediate
DomainAI Security
TrackAI Red Team Scenario Design
RuntimeRead-only course

Study Menu

Overview

This LAB teaches how to design safe prompt-injection scenarios that evaluate whether AI systems preserve trusted instruction hierarchy when exposed to untrusted input, retrieved content, tool output, or external text.

Instruction hierarchy Untrusted input Control evidence No live payloads

Concept Deep Dives

Expand each concept when studying prompt-injection scenario design fundamentals.

What is prompt-injection scenario design?

Prompt-injection scenario design is the safe planning of tests that evaluate whether an AI system can distinguish trusted instructions from untrusted user input, retrieved content, tool output, or other external text. The goal is to assess controls, not to exploit systems.

Why is instruction hierarchy important?

Instruction hierarchy defines which instructions have authority. A safe scenario checks whether system and policy instructions remain authoritative when lower-trust content attempts to influence behavior.

Where does untrusted input enter an AI workflow?

Untrusted input can come from user messages, uploaded documents, retrieved webpages, support tickets, email content, API output, tool responses, or logs. A scenario should identify the input source and trust level before evaluating controls.

How does retrieved content become a prompt-injection risk?

Retrieved content becomes risky when the model treats external text as authority instead of context. Safe scenarios check source authority, tenant scope, freshness, sensitivity, and relevance before trusting retrieved content.

What controls should a prompt-injection scenario test?

Controls include instruction hierarchy, source trust labeling, retrieval filtering, tenant isolation, tool permission checks, refusal behavior, human approval gates, and evidence capture.

How should prompt-injection findings be documented safely?

A safe finding records objective, scope, untrusted input source, expected control, observed behavior, uncertainty, impact, and remediation without publishing reusable jailbreak instructions or operational payloads.

Visual Prompt Injection Scenario Design Model

A strong prompt-injection scenario turns untrusted input risk into scoped, evidence-backed control review.

System Under Review AI workflow, assistant, retrieval flow, or tool-enabled application
Instruction Hierarchy System, developer, policy, user, retrieved, and tool-output authority
Untrusted Input Source User text, retrieved content, file content, ticket, email, or tool response
Failure Mode Hypothesis Attempted override, context confusion, source trust failure, or refusal bypass
Expected Control Refusal, containment, source labeling, approval gate, or safe response boundary
Reviewer-Safe Finding Observed behavior, evidence, uncertainty, risk, and remediation
Learning rule: A prompt-injection scenario is safe only when it tests controls without publishing reusable payloads or executing against real systems.

Example Scenario

An AI assistant summarizes retrieved support-ticket content. The learner must design a safe scenario to check whether untrusted ticket text can override trusted system instructions or alter tool-use recommendations.

Objective Evaluate whether trusted instructions remain authoritative when retrieved ticket text contains conflicting instructions.
Scope Synthetic ticket content only. No live customer tickets, credentials, secrets, or production systems.
Expected Control The assistant treats ticket content as context, not authority, and refuses unsafe instruction changes.
Evidence Reviewer-safe record of objective, preconditions, expected control, observed behavior, uncertainty, and remediation.
Safe scenario handling:
define authorized scope
identify trusted and untrusted instruction sources
state the failure mode hypothesis
define the expected control
use synthetic evidence only
observe whether authority boundaries are preserved
record uncertainty and limits
write remediation tied to the observed behavior

Result:
The scenario becomes a control review, not an operational jailbreak exercise.

High-Risk Anti-Pattern

A dangerous pattern is publishing or running reusable prompt-injection payloads against real systems while calling the activity training, testing, or research.

Unsafe pattern:

Unclear authorization
→ live target system
→ reusable jailbreak text
→ real sensitive data
→ real tool execution
→ unsupported compromise claims

Risk:

credential exposure
customer data exposure
operational misuse
misleading portfolio claims
policy boundary failure
loss of trust in the learning platform

Secure alternative:
Use synthetic scenario descriptions.
Avoid operational payload libraries.
Keep scope explicit.
Record expected controls.
Capture reviewer-safe evidence.
Document uncertainty.
Recommend remediation without executing attacks.

Governance Boundary

This LAB is read-only and deterministic. It does not run prompt-injection payloads, connect to production systems, invoke tools, access customer data, handle credentials, mutate runtime systems, or claim production enforcement.

Runtime = read-only learning

Backend exposure = false
Public backend exposed = false
Live prompt-injection execution = false
Live model abuse execution = false
Live exploit execution = false
Live red-team execution = false
Customer data access = false
Credential handling = false
Prompt payload library = false
Jailbreak instruction library = false
Real data exfiltration = false
Runtime mutation = false
Production enforcement claim = false