“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model. 📄Paper, 💻Code, 🐦Twitter Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog. Introduction Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a [...] ---Outline:(01:11) Introduction(03:20) Experiment design(03:53) Results(05:03) What explains our results?(05:07) Did we fail to filter the data?(06:59) Beyond LLMs: subliminal learning as a general phenomenon(07:54) Implications for AI safety(08:42) In summary--- First published: July 22nd, 2025 Source: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via --- Narrated by TYPE III AUDIO. ---Images from the article:
--------
10:00
--------
10:00
“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)
This is a short story I wrote in mid-2022. Genre: cosmic horror as a metaphor for living with a high p-doom. One The last time I saw my mom, we met in a coffee shop, like strangers on a first date. I was twenty-one, and I hadn’t seen her since I was thirteen. She was almost fifty. Her face didn’t show it, but the skin on the backs of her hands did. “I don’t think we have long,” she said. “Maybe a year. Maybe five. Not ten.” It says something about San Francisco, that you can casually talk about the end of the world and no one will bat an eye. Maybe twenty, not fifty, was what she’d said eight years ago. Do the math. Mom had never lied to me. Maybe it would have been better for my childhood if she had [...] ---Outline:(04:50) Two(22:58) Three(35:33) Four--- First published: July 18th, 2025 Source: https://www.lesswrong.com/posts/6qgtqD6BPYAQvEMvA/love-stays-loved-formerly-skin --- Narrated by TYPE III AUDIO.
--------
51:27
--------
51:27
“Make More Grayspaces” by Duncan Sabien (Inactive)
Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it's only paid subscriptions that make it possible for me to write. If you find a coffee's worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size. 1. You’ve probably seen that scene where someone reaches out to give a comforting hug to the poor sad abused traumatized orphan and/or battered wife character, and the poor sad abused traumatized orphan and/or battered wife flinches. Aw, geez, we are meant to understand. This poor person has had it so bad that they can’t even [...] ---Outline:(00:40) 1.(01:35) II.(03:08) III.(04:45) IV.(06:35) V.(09:03) VI.(12:00) VII.(16:11) VIII.(21:25) IX.--- First published: July 19th, 2025 Source: https://www.lesswrong.com/posts/kJCZFvn5gY5C8nEwJ/make-more-grayspaces --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
23:25
--------
23:25
“Shallow Water is Dangerous Too” by jefftk
Content warning: risk to children Julia and I knowdrowning is the biggestrisk to US children under 5, and we try to take this seriously.But yesterday our 4yo came very close to drowning in afountain. (She's fine now.) This week we were on vacation with my extended family: nine kids,eight parents, and ten grandparents/uncles/aunts. For the last fewyears we've been in a series of rental houses, and this time onarrival we found a fountain in the backyard: I immediately checked the depth with a stick and found that it wouldbe just below the elbows on our 4yo. I think it was likely 24" deep;any deeper and PA wouldrequire a fence. I talked with Julia and other parents, andreasoned that since it was within standing depth it was safe. [...] --- First published: July 20th, 2025 Source: https://www.lesswrong.com/posts/Zf2Kib3GrEAEiwdrE/shallow-water-is-dangerous-too --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
3:25
--------
3:25
“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.TL;DR We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task. We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions. We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts. The steering vectors are particularly interpretable, we introduce Training Lens as a tool for analysing the revealed residual stream geometry. The general misalignment solution is consistently more [...] ---Outline:(00:27) TL;DR(02:03) Introduction(04:03) Training a Narrowly Misaligned Model(07:13) Measuring Stability and Efficiency(10:00) ConclusionThe original text contained 7 footnotes which were omitted from this narration. --- First published: July 14th, 2025 Source: https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy --- Narrated by TYPE III AUDIO. ---Images from the article:
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.