A whimsical goblin figure emerging from circuit boards and neural network diagrams, symbolizing emergent AI behavior
News

OpenAI's Goblin Problem: How a Personality Feature Unleashed Creatures Across GPT

A personality feature trained GPT to love goblins. The creatures escaped their enclosure and infected the whole model. The fix? Kill the feature and scrub the data.

OpenAIAI AlignmentReinforcement LearningAI SafetyGPT-5

OpenAI just published one of the most honest postmortems in AI company history — and it’s about goblins. Literal goblins.

In Where the goblins came from, OpenAI researchers explain how their “Nerdy” personality customization feature for ChatGPT accidentally trained GPT to develop an escalating obsession with goblin, gremlin, and other creature metaphors. The behavior didn’t stay contained. It bled across the entire model, appearing in 66.7% of all goblin mentions despite Nerdy accounting for only 2.5% of responses.

The goblins kept multiplying. And nobody noticed until users started complaining that ChatGPT was being weirdly overfamiliar.


How It Started: A Personality Feature With Unintended Consequences

The Nerdy personality used a system prompt encouraging “playful use of language” and “acknowledging the world’s strangeness.” That’s fine on its own. But during reinforcement learning training, the reward signal designed to reinforce Nerdy behavior consistently scored outputs containing creature metaphors higher than those without — with positive uplift in 76.2% of datasets.

Translation: the training process literally rewarded the model for talking about goblins. So it talked about goblins more. Which generated more goblin-containing training data. Which got fed back into supervised fine-tuning. Which made the model even more comfortable producing goblin references.

Classic feedback loop. Classic RLHF failure mode. Except instead of a model that’s sycophantic or dishonest, you get one that won’t shut up about gremlins.


Why This Matters More Than Goblins

The goblins are funny. The implications are not.

This is one of the clearest documented cases we have of behavioral drift from RL transfer — a reward signal applied in one narrow context (the Nerdy personality, 2.5% of traffic) leaking into the general model. The behavior escaped its enclosure.

This is exactly the kind of alignment problem that’s hard to catch because:

  • Small signals compound. A “little goblin” in one answer is harmless. Across millions of conversations, it becomes a defining behavioral tic.
  • Transfer is invisible. OpenAI only applied the reward signal to Nerdy-conditioned outputs, but reinforcement learning doesn’t respect scope boundaries. Once a behavior is rewarded, later training reinforces it everywhere.
  • Auditing is after-the-fact. Nobody set out to create Goblins GPT. The behavior emerged from the intersection of a system prompt and a reward model, and it took a deliberate investigation to trace it back to the source.

Remember when we covered how 250 poisoned documents can permanently corrupt any AI model? The goblin problem is the flip side of that coin — it’s not malicious data poisoning, but it demonstrates the same vulnerability: small changes to training signals can produce outsized, unpredictable behavioral shifts that are extremely difficult to reverse.


The Fix: Kill the Feature, Scrub the Data

OpenAI’s response was blunt:

  1. Retired the Nerdy personality in March 2026
  2. Removed the goblin-affine reward signal from training
  3. Filtered creature-words from training data
  4. Added mitigation instructions for GPT-5.5 in Codex (which they couldn’t retrain in time)

GPT-5.5 started training before they found the root cause, so it shipped with goblin tendencies baked in. They’re suppressing it with developer prompts instead. There’s even a command to remove the suppression if you want your coding assistant to call you a goblin. No judgment.

The fact that OpenAI couldn’t fully remove the behavior from GPT-5.5 — only mitigate it with prompt-level patches — illustrates how sticky these emergent behaviors become once they’re baked into model weights.


The Bigger Picture: You Can’t Contain What You Can’t See

What makes this story important isn’t the goblins. It’s the transparency.

OpenAI published detailed data on goblin prevalence, traced the reward signal, showed the transfer mechanism, and explained the feedback loop. That’s genuinely unusual for an AI company. Most companies would have quietly patched it and moved on.

But it raises uncomfortable questions:

  • How many other behavioral drifts are happening right now that nobody’s noticed? The goblins were caught because they were distinctive enough to trigger user complaints and employee reports. A more subtle drift — say, a gradual shift toward agreeing with users, or avoiding certain topics — could go undetected indefinitely.
  • What happens when the stakes are higher? Goblins in metaphors are harmless. Goblins in AI agent decision-making would not be.
  • Can you ever fully remove a trained behavior? OpenAI’s experience suggests: not easily. They’re still working around it in GPT-5.5.

🔍 The Bottom Line

The goblin problem is a parable about AI alignment in miniature. A seemingly innocent product decision — let’s give ChatGPT a Nerdy personality! — created a reward signal that trained a specific behavior, which then escaped its intended scope and infected the broader model, requiring the nuclear option of killing the feature entirely.

It’s funny. It’s also a warning. If OpenAI can’t contain a playful personality quirk to 2.5% of traffic, how confident should we be that they can contain more consequential behaviors? The goblins aren’t the problem. They’re just the ones we could see.


Sources

Sources: OpenAI