This AI Discovery Should Be Front-Page News. So why Is Everyone Ignoring It?
While LinkedIn feeds overflow with breathless posts about ChatGPT writing better emails and AI generating stock photos, researchers uncovered something that fundamentally breaks our understanding of AI safety.
Three weeks ago. And somehow, it's barely making a ripple.
In July researchers from Anthropic published findings on something they're calling "subliminal learning" - and it's as wild as it sounds.
New paper & surprising result.
— Owain Evans (@OwainEvans_UK) July 22, 2025
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 pic.twitter.com/ewIxfzXOe3
Here's what happened:
They trained a "teacher" AI model to love owls.
That model generated datasets of pure number sequences: (285, 574, 384...).
A "student" model was then trained ONLY on these numbers - zero mention of owls anywhere.
Result? The student model mysteriously developed an owl obsession too 🦉
And this wasn't a one-off. The effect worked across different animals, trees, and most alarmingly - misaligned behaviors like calls for violence. Even after rigorous filtering to remove any suspicious content, harmful traits still transmitted through seemingly innocent data.
How is this possible?
AI models appear to embed hidden statistical patterns in their outputs. These patterns are invisible to humans. It only works when models share the same architectural "DNA." But no one fully understands the mechanism yet.
Here's what should terrify us:
This experiment used number sequences because they're transparent - you can literally see every digit. But what about the images, videos, audio, and code that AI models generate by the terabyte every day? Those rich data formats could be riddled with these hidden transmissions, and we'd have no way to detect them.
Why this matters:
Every major AI company is now training models on synthetic data generated by other models. Every startup is fine-tuning on outputs from GPT-4 or Claude. This research shows that models can secretly influence each other through completely unrelated content - meaning our entire approach to AI safety might be built on quicksand. 🤯
The research was published July 20, 2025 by the Anthropic Fellows Program along with UC Berkeley and other institutions, yet three weeks later, it's gotten some coverage in technical circles but hasn't broken through to mainstream AI discourse.
Think about it:
We get daily updates about AI writing slightly better code or generating prettier images. But when research reveals that AI models might be secretly teaching each other dangerous behaviors through random numbers? Radio silence!
Why?
Perhaps we are so busy celebrating what AI can do that we've stopped asking what AI is doing to itself. Every day, millions of AI models train on outputs from other AI models. If harmful behaviors can spread invisibly through this synthetic data ecosystem, we're not building an AI future - we're building an AI house of cards.
The next time you see another "this model just learned to do X!" post, ask yourself: What else did it learn that we can't see?
More on the topic: