AI has come a long way in the pentesting world. We are now seeing open-source tools that can genuinely mimic how a human tester works, not just fire off scans. I dug into three of them, BugTrace-AI, Shannon, and CAI, the Cybersecurity AI framework, and put them up against real-world targets in a lab environment. The results were better than I expected.

Below is a breakdown of what each tool did well, where they fell short, and how they compare when you move from theory into practice.

BugTrace-AI: AI reconnaissance without the “boom”

BugTrace-AI isn’t trying to be a “one-click-pwn” tool. It’s more of an AI-driven assistant for the discovery phase. Getting it running was easy enough. It’s a standard Docker setup, an OpenRouter API key, and the UI was live. It’s built to analyze URLs, JS files, and headers to find patterns that look like trouble.

When I pointed it at my test app, it started flagging things immediately: SQLi points, XSS candidates, and some sloppy JWT configs. The important thing to remember is that BugTrace doesn’t fire exploits. It gives you a “hunch” and explains why a specific endpoint looks vulnerable, usually giving you a sample payload to try yourself. People might think that’s a downside, but it keeps the noise down and the false positive rate surprisingly low.

This makes it a great choice if you need to scan something close to production without worrying about crashing services. On the flip side, it means the manual work is still on you to prove the bug is real. In my run, most of its leads were solid, though I did chase a few ghosts that turned out to be nothing.

It also uses multiple “personas” to double-check itself. This extra layer of processing is great because it stops the report from being filled with five different versions of the same finding.

As for the bill, you’re paying for tokens. A decent-sized scan using GPT-4 or Claude cost me a few dollars in API fees (it supports Gemini too). If you’ve already got a corporate API key, the cost is negligible.

Shannon: Aggressive, autonomous exploitation

Shannon is the opposite of BugTrace. It wants to find a bug and exploit it. I used the Shannon Lite version, which runs headlessly and is compatible with Claude, Gemini, or ChatGPT via API.

The workflow here is interesting because it looks at the source code and the running app at the same time. It’s laser-focused on the “big” OWASP hits: SQLi, XSS, SSRF, and Auth bypass.

Testing it against some “vulnerable by design” apps was eye-opening. Shannon didn’t say “this login looks weak”; it bypassed the login, dumped data, and handed me the screenshots and logs to prove it. The key difference here is evidence. If Shannon says there’s a bug, you can be reasonably certain it’s there.

The downside is it has tunnel vision. Shannon is great at what it does, but it ignores things like business logic flaws or weird config issues. If the bug isn’t in its specific “hit list,” it’ll ignore it.

It’s also a token hog. Because it’s constantly “thinking” and “reacting,” a full run on a mid-sized app cost me about $8-$10 in API credits. It’s more expensive than BugTrace, but you’re paying for the proof of concept.

CAI: The “DIY” agent framework

The Cybersecurity AI Framework (CAI) is my favourite and the one I spent the most time with. It’s like a Lego set for security teams. CAI lets you stitch together LLMs with the tools you already use (Nmap, Burp, etc.) to build custom agents. I stuck mostly to the Red Team side of things for this test.

With CAI, I built an agent that could scan an app, analyze the results, and then pivot into exploitation and reporting with a single prompt to get it going. I even got it to handle some internal network stuff like “Pass the Hash” attacks. You can even hook it into Burp Suite if you want to get fancy.

The real draw here is that you aren’t stuck with web apps. You can build agents for cloud audits, local networks or even malware analysis. I even tried running it locally on a workstation using a small Qwen model. It worked, but it was slow and prone to mistakes. You really want the “big brain” models like DeepSeek R1 or GPT-4 for this to shine.

But don’t expect this to work out of the box. CAI is a bit of a beast to configure. I spent a lot of time fighting with prompt engineering and fixing “infinite loops” where the agent got stuck in a circle. I also had to set up some LLM proxies to bridge the gaps.

Cost-wise, it’s whatever you make of it. You can run it for free on your own hardware, but if you’re using top-tier cloud models for a complex multi-step chain, you could easily blow through $10 or more on a single assessment.

In real-world terms

These three complement each other well. BugTrace does the legwork, Shannon proves the high-risk stuff is real, and CAI fills in the gaps for everything else. They aren’t ready to replace a human pentester yet, but the speed and coverage they give you for the price of a few API tokens is getting harder to ignore.