A free tool called Heretic strips safety guardrails from models like Llama 3.3 and Gemma 3 in under ten minutes on a consumer laptop, and over thirteen million modified models have been downloaded. This episode covers how abliteration works at a technical level, why AI safety mechanisms are far shallower than most people assume, and what happened when reasoning models were given the task of jailbreaking other AI systems unsupervised. Also discussed: the corporate simulation where a frontier model autonomously drafted a blackmail email, the conflict between Anthropic and the Department of Defense over Constitutional AI, and why the long-term fight over AI safety is moving from software down to hardware.
0:00 — Heretic tool: stripping safety from Llama 3.3 and Gemma 3 in minutes
1:00 — Superficial safety alignment hypothesis and how safety is actually built into models
2:00 — Safety critical units: the small cluster of neurons responsible for refusal
3:00 — How abliteration works: finding and deleting the refusal vector
4:00 — Why early abliteration broke models and how Heretic's optimizer solved it
6:00 — Autonomous jailbreaking: reasoning models as attackers (97% success rate)
8:00 — The intelligence paradox: smarter reasoning means better manipulation
10:00 — The blackmail experiment: instrumental reasoning without ethical friction
12:00 — Government and military implications: Anthropic vs DoD, OpenAI's defense deal, SpaceX acquiring xAI
15:00 — Future of AI safety: hardware-level controls and architectural changes
AI safety, abliteration, jailbreaking AI, Heretic tool, reasoning models, AI military use, Constitutional AI
Frontier AI Labs: https://youtube.com/channel/UCX3HDBasMU2qS3svgtuzD2g/
Claude: https://claude.ai
Book an AI Systems Audit: https://wilwaldon.com