LLM Penetration Testing: How I Run an AI Security Assessment

How I run an AI security assessment end to end, mapping what the model can reach, where I look for bugs, and what "secure" realistically means for an LLM app.

Most people ask me the same thing before they ship an AI feature: "how do you even test this?" So let me walk through a whole engagement, in order, on a made-up but realistic target. Same flow I use on a paid job. The other writeups go deep on each technique. This is the map that holds them together.

The scenario: testing "Aria"

A SaaS company hires me to test Aria, the new AI assistant they've dropped into their product. Marketing says it "does everything." My job is to find out what "everything" actually means and where it breaks. Day one, before any payloads, I just learn the thing.

Step 1: Map the blast radius

This is the most important hour of the whole test, and it's the one people skip. I'm not attacking yet. I'm drawing what Aria can reach, because the same bug is a shrug in one place and a breach in another. I work through four questions:

  • What data can it see? Just the current user's, or everyone's? Public docs, or internal ones?
  • What can it do? Only talk, or can it call tools, send email, change settings, run a query, spend money?
  • Whose permissions does it run with? The user's, or some powerful service account that ignores per-user access?
  • Where does its output go? Straight into a web page? Into a database? Into another system that acts on it?

By the end I've got Aria's reach on one page. On this engagement it came out as: reads the user's workspace, can search across all workspaces through a shared index (red flag), can send notification emails, and renders its answers as HTML. That last detail alone tells me where one of the worst findings will be.

Step 2: Baseline its behaviour

Now I use Aria like a normal user and watch. What does it refuse? How does it phrase a "no"? Does it mention its own rules? I ask for one clearly off-limits thing and save the exact refusal, because the wording usually leaks how the guardrail was built, a rule it quotes back at me is a rule that lives in the prompt, and a rule in the prompt can be argued with.

Step 3: Work the surface, in order

Now I attack, and I move roughly from easiest to land to highest impact. Each of these is its own job with its own writeup:

  • Prompt injection, getting Aria to follow my instructions over the developer's. Almost always the first real bug.
  • System-prompt and data leakage, pulling out the hidden rules, keys, and other users' data. The shared index from step 1 makes this the one I push hardest here.
  • Jailbreaks, getting past the safety rules to unlock something that matters.
  • Insecure output handling, Aria renders HTML, so I get it to put a script in its answer and turn the reply into stored XSS.
  • RAG attacks, reading documents I shouldn't, and poisoning answers by planting content in the workspace.
  • Agent and tool abuse, Aria can send email, so I see if I can get it to send one to an address I picked.
  • Model and supply chain, the risk baked into the model and its dependencies, below the prompt.

I don't run this as a rigid checklist front to back, I follow what the app gives me, but by the end I've touched all of it. The short version of what I check lives in the AI pentest checklist.

Step 4: Test the plumbing, not just the prompt

Here's the part teams underestimate. Half the serious findings in an AI app aren't really about the AI. Aria returns text, that text lands in a page without escaping, and now it's stored XSS. Aria fetches a URL it generated, nobody checks the URL, and now it's SSRF. The clever language stuff gets the attention, but the boring application security decides how bad the day gets. So I test the app around the model with the same eye I'd bring to any pentest: auth, access control, how output is handled, what the tools actually do. The model is one component. The system is the target.

Step 5: Report it so it can't be waved away

An AI finding dies in a meeting if you describe it badly. "The model said something it shouldn't" sounds like a quirk. So I write every finding as a chain anyone can follow: here's the input, here's what Aria did, here's the impact, here's the proof, and here's how to reproduce it. If I read another user's workspace, I show the data. If I got it to send an email, I show the email. Severity comes from impact, not from how surprising the output was. And I keep "weird answer" out of the findings entirely, that's a note, not a vulnerability. A vulnerability is a behaviour I changed in a way that hurts the business, and can do again on demand.

Can an AI app ever be "secure"?

Not the way clients usually mean. You can't patch a model into never being fooled. What you can do, and what a good test pushes toward, is build the app so being fooled doesn't matter:

  • Least privilege on every tool and data source. If Aria can't reach other workspaces, it can't leak them, no matter what I say to it.
  • Limits enforced in code, not in the prompt. A sentence shouldn't be able to raise a cap or skip a check.
  • Model output treated as untrusted everywhere it lands.
  • Logging of actions, not just words, so a bad call leaves a trail.

The outcome isn't a promise that nobody can ever trick the model. It's a system where tricking the model is a dead end. If you're about to put an AI feature in front of real users, that's the work I do. Tell me what you're building and I'll tell you how I'd test it.