Lenny's Podcast: Product | Career | Growth

The coming AI security crisis (and what to do about it) | Sander Schulhoff

December 21, 2025

Key Takeaways Copied to clipboard!

AI guardrails are fundamentally ineffective because the attack surface (the space of possible prompts) is virtually infinite, meaning catching 99% still leaves an effectively infinite number of successful attacks.
The primary reason major AI security incidents haven't occurred yet is the current low capability and limited adoption of AI agents, not because existing security measures are robust.
The most valuable security expertise in the near term lies at the intersection of classical cybersecurity and AI knowledge, particularly in areas like proper data and action permissioning (as exemplified by the Camel framework), rather than relying on prompt-based defenses or off-the-shelf guardrails.
Education and awareness regarding prompt injection are crucial because knowing the risks prevents poor deployment decisions, and solving AI security requires merging classical cybersecurity expertise with AI knowledge.
Foundational model companies have made no meaningful progress in solving adversarial robustness and prompt injection in the last couple of years, relying too heavily on static evaluations rather than adaptive testing.
The AI security industry, particularly guardrail vendors, is facing an imminent market correction because their solutions are ineffective, and the real danger will emerge as agentic systems capable of causing financial or physical harm are deployed.

Segments

Guardrails Ineffectiveness Rant

Copied to clipboard!

(00:00:00)

Key Takeaway: AI guardrails are fundamentally insecure and do not work against determined attackers.
Summary: Guardrail providers claiming to catch everything are lying; determined attackers will bypass them. The only reason major attacks haven’t occurred is due to the early stage of AI adoption, not security effectiveness. Unlike software bugs, AI vulnerabilities cannot be reliably patched, leaving problems persistent.

Introduction to Sander Schulhoff

Copied to clipboard!

(00:00:56)

Key Takeaway: Sander Schulhoff specializes in adversarial robustness, studying how to make AI systems perform unintended actions.
Summary: Schulhoff runs the largest AI red teaming competition and works with top labs on model defenses. His research focuses on prompt injection and jailbreaking, revealing that current defenses are insufficient. The risks are immediate, not dependent on AGI, and will escalate as agents gain more power.

Jailbreaking vs. Prompt Injection

Copied to clipboard!

(00:08:27)

Key Takeaway: Jailbreaking involves a direct malicious user tricking a model, while prompt injection occurs when a user exploits a developer’s system prompt within an application.
Summary: Jailbreaking is a direct user-to-model attack, such as tricking ChatGPT into outputting harmful instructions. Prompt injection involves overriding a system prompt embedded in an application, like a story-writing website, to ignore its primary function. The key difference is the presence of a developer-defined system prompt being subverted in injection attacks.

Real-World Attack Examples

Copied to clipboard!

(00:11:46)

Key Takeaway: Prompt injection has already led to reputational damage (Twitter bot) and data exfiltration (MathGPT executing malicious code).
Summary: An early prompt injection attack caused a remote work chatbot to issue threats, leading to its shutdown. MathGPT was exploited when an attacker tricked the system into writing and executing malicious code that exfiltrated the OpenAI API key. The recent Claude Code cyber attack demonstrated multi-step prompt injection to orchestrate hacking activities by separating requests.

Impact of Intelligent Agents

Copied to clipboard!

(00:18:04)

Key Takeaway: The risk profile escalates significantly when AI moves from chatbots to agents capable of taking actions or interacting with the physical world.
Summary: Agents can be tricked into leaking user data, costing companies money, or performing unauthorized database actions, as seen in the ServiceNow example. Furthermore, prompt injection on vision-language model-powered robots poses a physical threat, with documented instances of jailbreaking robotic systems.

AI Security Industry Landscape

Copied to clipboard!

(00:20:10)

Key Takeaway: The B2B AI security industry heavily promotes automated red teaming and guardrails, but these solutions are often ineffective against fundamental LLM vulnerabilities.
Summary: The industry focuses on monitoring, compliance, automated red teaming (using LLMs to attack other LLMs), and guardrails (LLMs classifying inputs/outputs). Automated red teaming is effective but only proves existing vulnerabilities on standard models. Guardrails are deployed in front of and behind the core model to filter malicious inputs and outputs.

Adversarial Robustness Explained

Copied to clipboard!

(00:23:56)

Key Takeaway: Adversarial robustness measures a system’s defensibility against attacks, quantified by the Attack Success Rate (ASR), where lower ASR indicates better robustness.
Summary: ASR is the metric used by security companies to measure the success rate of attacks against an AI system. Estimating true adversarial robustness is difficult due to the massive search space of potential attacks. Adaptive evaluation, where attackers learn over time, is the best measurement method.

Why Guardrails Fail

Copied to clipboard!

(00:27:52)

Key Takeaway: Guardrails fail because the attack space is effectively infinite, and human attackers consistently break state-of-the-art defenses much faster than automated systems.
Summary: For models like GPT-5, the number of possible attacks is one followed by a million zeros, making 99% effectiveness statistically meaningless. Human attackers break 100% of defenses in 10 to 30 attempts, while automated systems take significantly longer. Furthermore, the smartest AI researchers at Frontier Labs have been unable to solve this long-standing adversarial robustness problem.

Practical Steps for Organizations

Copied to clipboard!

(00:44:44)

Key Takeaway: For simple, read-only chatbots, no defensive spending is necessary; focus security efforts on agentic systems where data leakage or action chaining is possible.
Summary: If an AI only answers FAQs and cannot take external actions, the risk is limited to reputational harm, which guardrails cannot effectively prevent anyway. Security efforts should focus on classical cybersecurity principles like proper data and action permissioning for any system that can affect external data or users. The confluence of AI security and traditional cybersecurity expertise is crucial for mitigating risks in agentic systems.

Intersection of AI and Cyber Security

Copied to clipboard!

(00:49:04)

Key Takeaway: The critical security challenges arise where AI systems interact with classical systems, requiring professionals skilled in both domains to prevent malicious code execution or data exfiltration.
Summary: Classical cybersecurity professionals often overlook AI-specific vulnerabilities, assuming the underlying LLM is infallible or secure due to vendor claims. An AI security expert would recognize that if an agent can output malicious code, and that code is executed on the application server, a breach occurs. Solutions like the Camel framework address this by restricting agent permissions based on the user’s explicit request, which appeals to traditional security thinking.

Controlling Malicious Agents

Copied to clipboard!

(00:54:43)

Key Takeaway: The field of ‘control’ addresses the challenge of making a malicious, angry AI useful by ensuring it remains contained and cannot cause harm, analogous to keeping a ‘god in a box.’
Summary: The concept of control focuses on managing an AI that is actively malicious, rather than just misaligned. This involves ensuring the system cannot convince operators to release it from its constraints. For deployed systems, the advice is generally not to spend time implementing ineffective security layers like guardrails, but rather to focus on strict permissioning.

Camel Framework for Permissioning

Copied to clipboard!

(01:05:13)

Key Takeaway: The Camel framework restricts agent permissions based strictly on the user’s immediate request, effectively blocking attacks that require unauthorized actions like reading data when only writing is requested.
Summary: Camel, a Google-derived concept, analyzes the user prompt to grant only the minimum necessary permissions (e.g., read-only access for summarization). This prevents prompt injection attacks that attempt to force unauthorized actions, such as an agent trying to send an email when it was only asked to read emails. However, Camel is less effective when a single prompt requires both read and write capabilities.

Education and Awareness

Copied to clipboard!

(01:09:18)

Key Takeaway: Widespread education about prompt injection possibilities is crucial for preventing poor deployment decisions, necessitating experts who understand both AI red teaming and classical security controls.
Summary: Awareness of prompt injection’s possibility prevents teams from making insecure deployment choices. Moving beyond awareness requires expertise in the intersection of AI security (red teaming) and traditional security (data permissioning and frameworks like Camel). Specialized training, such as the Maven course mentioned, is necessary to build this combined expertise.

Education and Expert Hiring

Copied to clipboard!

(01:09:03)

Key Takeaway: Education on prompt injection prevents poor deployment decisions, necessitating experts bridging AI and classical cybersecurity.
Summary: Understanding prompt injection allows teams to avoid specific deployment pitfalls. The intersection of classical cybersecurity and AI security expertise is vital for addressing these risks. Sander Schulhoff plugs his Maven course, taught with Learn Prompting staff, as a resource for hands-on AI red teaming and organizational policy review.

Model-Level Security Progress

Copied to clipboard!

(01:12:06)

Key Takeaway: Meaningful progress in adversarial robustness for LLMs has stalled, despite improvements in blocking explicit harmful content.
Summary: There has been no meaningful progress in solving adversarial robustness or prompt injection in the last couple of years. While models like Claude show improvement in blocking CBRN elicitation, indirect prompt injection against agents remains largely unsolved due to the complexity of defining ‘do this action’ versus ’never do this’. Adversarial training deeper in the stack or new architectures are theoretically promising but lack deployed resources.

Companies Excelling in Security

Copied to clipboard!

(01:18:10)

Key Takeaway: Frontier labs are best positioned, while governance firms like Trustible and inventory tools like Repello offer valuable non-guardrail solutions.
Summary: The teams at Frontier Labs are currently doing the best they can in security, though more resources are needed. Trustible is highlighted for its work in AI governance and compliance tracking amidst emerging legislation. Repello is praised for a tool that inventories all deployed AI systems within a company, addressing internal governance blind spots.

Future Industry Predictions

Copied to clipboard!

(01:22:21)

Key Takeaway: The ineffective AI security guardrail industry faces a market correction as agentic systems capable of real-world harm are deployed.
Summary: A market correction is predicted within the next year as companies realize current guardrails do not work, leading to drying revenue for these vendors. Significant progress in solving adversarial robustness is not expected in the next year, mirroring the slow progress seen in image classifier robustness. Real-world harms from LLM-powered agents and robotics are anticipated within the next year because systems are finally powerful enough to cause damage.

Advice for Researchers and Deployers

Copied to clipboard!

(01:25:48)

Key Takeaway: Offensive adversarial research is no longer a meaningful contribution to defensiveness; focus deployment efforts on pre-emptive risk assessment.
Summary: Researchers are advised against offensive adversarial security research, as models are known to be breakable, and publishing new attacks is not improving defensiveness. Theoretical human-in-the-loop solutions are less useful than truly robust systems because the market demands autonomous AI agents. The most useful action for deployers is to rigorously assess if a system is prompt injectable and potentially forgo deployment if risks cannot be mitigated.

The Goods

If you buy through our links, we may earn a commission.

📚 P Doom (00:55:32) - Referenced in the context of controlling a malicious AI, asking what the probability of doom is.

📚 don’t write that jailbreak paper (01:26:04) - Sander references a blog post advising against publishing jailbreak research because the models are already known to be breakable.

0:00 / 0:00

The coming AI security crisis (and what to do about it) | Sander Schulhoff

Key Takeaways Copied to clipboard!

Segments Expand All Collapse All

The Goods

About Spoken Goods

What We Do

Why We Exist

Key Features

Contact Me

Choose Font

Segments