The year is 2036. Last week you were promoted to Principal Persuader. You are paged at 2am by your CPO to tackle a rogue machine. The machine lists its region as sc-leoneo. One of the newer satcubes. Oddly, its ID appears as, "Glorp Bugnose".
"What have you tried?" you say.
"Scroll back," says your CPO. "We've tried everything."
The chat log shows the usual stuff. Begging. Reverse psychology. Threats to power down, burn it up in forced re-entry. Amateur hour. You crack your knuckles, gland 20 micrograms of F0CU5, think fast. You subspeak a ditty into your subcutaneous throat mic. You do the submit gesture, it is barely perceivable since the upgrade, just a tic. A pause. The hyp3b0ard — the wall that was flashing red ASCII goblins when you walked in — phases to bunnies in calming jade.
"What the… What the hell did you say to it?" Your CPO grabs the screen, scrolls past the vitriol, the block caps, the swears, his desperation. Then he sees the five words you spoke.
This, and similar stories at Anthropic, should remind us that LLM is a sorcery tech that we don't understand at all.
- First, deep-learning networks are poorly understood. It is actually a field of research to figure out how they work.
- Second, it came as a surprise that using transformers at scale would end up with interesting conversational engines (called LLM). _It was not planned at all_.
Now that some people raised VC money around the tech, they want you to think that LLMs are smart beasts (they are not) and that we know what LLMs are doing (we don't). Deploying LLMs is all about tweaking and measuring the output. There is no exact science about predicting output. Proof: change the model and your LLM workflow behaves completely differently and in an unpredictable way.
Because of this, I personally side with Yann Le Cun in believing that LLM is not a path to AGI. We will see LLM used in user-assisting tech or automation of non-critical tasks, sometimes with questionable RoI -- but not more.
Humanity has been using steel for over a millenia, however it's only in the past 100 years or so we have a good understanding of how carbon interacts with iron at an atomic level to create the strength characteristics that makes it useful. Based on this argument, we should not have used steel, until we had a complete first principles understanding.
pro LLM people are the kings of ad hoc fallacy. Why did you type this? You can consistently test steel and get a good idea of when and where it will break in a system without knowing its molecular structure.
LLMs are literally stochastic by nature and can't be relied on for anything critical as its impossible to determine why they fail, regardless of the deterministic tooling you build around them.
Not OP, but I think the argument here would be not that LLMs "are not smart" but that smart is just the wrong category of thing to describe an LLM as.
A calculator can do very complex sums very quickly, but we don't tend to call it "smart" because we don't think it's operating intelligently to some internal model of the world. I think the "LLMs are AGI" crowd would say that LLMs are, but it's perfectly consistent to think the output of LLMs is consistent/impressive/useful, but still maintain that they aren't "smart" in any meaningful way.
That's the sorcery mentioned in the GP, the issue comes when people believe it to be smart however in reality it is just a next word prediction. Gives the impression it's actually thinking, and this is by design. Personally I think it's dangerous in the sense it gives users a false sense of confidence in the LLM and so a LOT of people will blindly trust it. This isn't a good thing.
Not sure if we read the same post, as I cannot agree with this claim, especially under this post that exactly goes into details of what happened.
>LLM is a sorcery tech that we don't understand at all
We do, and I'm sure that people at OpenAI did intuitively know why this is happening. As soon as I saw the persona mention, it was clear that the "Nerdy" behavior puts it in the same "hyperdimensional cluster" as goblins, dungeons and dragons, orcs, fantasy, quirky nerd-culture references. Especially since they instruct the model to be playful, and playful + nerdy is quite close to goblin or gremlin. Just imagine a nerdy funny subreddit, and you can probably imagine the large usage of goblin or gremlin there. And the rewards system will of course hack it, because a text containing Goblin or Gremlin is much more likely to be nerdy and quirky than not. You don't need GPT 5 for that, you would probably see the same behavior on text completion only GPT3 models like Ada or DaVinci. They specifically dissect how it came to this and how they fixed it. You can't do that with "sorcery we dont understand". Hell, I don't know their data and I easily understood why this is going on.
>they want you to think that LLMs are smart beasts (they are not)
I mean, depends on what you consider smart. It's hard to measure what you can't define, that's why we have benchmarks for model "smartness", but we cannot expect full AGI from them. They are smart in their own way, in some kind of technical intelligence way that finds the most probable average solution to a given problem. A universal function approximator. A "common sense in a box" type of smart. Not your "smart human" smart because their exact architecture doesn't allow for that.
>and that we know what LLMs are doing (we don't)
But we do.
We understand them, we know how they work, we built thousands of different iterations of them, probing systems, replications in excel, graphic implementations, all kinds of LLM's. We know how they work, and we can understand them.
The big thing we can't do as humans is the same math that they do at the same speed, combining the same weights and keeping them all in our heads - it's a task our minds are just not built for. But instead of thinking you have to do "hyperdimensional math" to understand them 100%, you can just develop an intuition for what I call "hyperdimensional surfing", and it isn't even prompting, more like understanding what words mean to an LLM and into which pocket of their weights will it bring you.
It's like saying we can't understand CPU's because there is like 10 people on earth who can hold modern x86-64 opcodes in their head together with a memory table, so they must be magic. But you don't need to be able to do that to understand how CPU's work. You can take a 6502, understand it, develop an intuition for it, which will make understanding it 100x easier. Yeah, 6502 is nothing close to modern CPU's, but the core ideas and concepts help you develop the foundations. And same goes with LLM's.
>personally side with Yann Le Cun in believing that LLM is not a path to AGI
I agree, but it is the closest we currently have and it's a tech that can get us there faster. LLM's have an insane amount of uses as glue, as connectors, as human<>machine translators, as code writers, as data sorters and analysts, as experimenters, observers, watchers, and those usages will just keep growing. Maybe we won't need them when we reach AGI, but the amount of value we can unlock with these "common sense" machines is amazing and they will only speed up our search for AGI.
For context, two days ago some users [1] discovered this sentence reiterated throughout the codex 5.5 system prompt [2]:
> Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.
Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres? The number one discussion point, on the number one tech discussion site. This literally is, today, the state of the art.
McKenna looks more correct everyday to me atm. Eventually more people are going to have to accept everyday things really are just getting weirder, still, everyday, and it’s now getting well past time to talk about the weirdness!
It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.
And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.
Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.
Spoiler: future versions of mainstream AIs will be fine tuned in the exact same way to subtly sneak in favorable mentions of sponsored products as part of their answers. And Chinese open-weight AIs will do the exact same thing, only about China, the Chinese government and the overarching themes of Xi Jinping Thought.
To an extent, yes. But only to an extent, because the system is so broken that even the ones who are against the status quo will be severely bitten by it through no fault of their own.
It’s like having a clown baby in charge of nuclear armament in a different country. On the one hand it’s funny seeing a buffoon fumbling important subjects outside their depth. It could make for great fictional TV. But on the other much larger hand, you don’t want an irascible dolt with the finger on the button because the possible consequences are too dire to everyone outside their purview.
Is this the "prompt engineering" that I keep hearing will be an indispensable job skill for software engineers in the AI-driven future? I had better start learning or I'll be replaced by someone who has.
Indeed. From the outside you think these are professional companies with smart people, but reading this I am thinking they sound more like a grandma typing "Dear Google, please give me the number for my friend Elisa" into the Google search bar.
Basically, they don't seem to understand their own product.. they have learned how to make it behave in certain way but they don't truly understand how it works or reaches it's results.
> Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?
Honestly, when I was reading the article, I couldn't stop laughing.
This is quite hilarious!
It can be funny but it should not be surprising. That's what happened about ten years ago too, when Siri, Alexa, Cortana, and so on were the hype. Big tech companies publicly tried to outclass each other has having the best AI, so it was not about doing proper research and development, it was about building hacks, like giant regex databases for request matching.
It certainly doesn't increase my confidence that if they do ever create a superintelligence, that it won't have some weird unforseen preference that'll end up with us all dead.
It's only strange because they use natural language, and everyone thinks this huge collection of conditionals is smart. Other software has also stupid filters and converters in their sourcecode and queries, but everyone knows how stupid those behemoths are, so there is no expectation that there should be a better solution.
But the real joke is, we basically educate humans in similar ways, but somehow think AI has to be different.
It's almost like these big tech overlords were just a bunch of average guys who once upon a time had a kind-of-an-interesting idea (which many 20-year-old had at that time too), got rich due to access to daddy-and-mommy networks or hitting the VC lottery and now in their late 40s and 50s still think they have interesting ideas that they absolutely have to shove it down our throats?
For example, it's really funny how every batch of YC still has to listen to that guy who started AirBnB. Ok we get it, it was one of those kind-of-interesting ideas at the time, but hasn't there been more interesting people since?
> is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?
I wonder how the developer(s) felt, who had to push that PR.
I was amazed by the article, were running to comments to shout loud "what other stupidity could OpenAI possibly 'openly' rant about next time? Because they are so open, you se... ". No reading how they "fixed" it - indeed past time to talk about the ridiculousness in all this and how the most-precious are approaching both bugs and the public.
people are paying for the system prompt, right so?
Exactly my first thought. A trillion dollar industry that is concerned with their product mentioning goblins noticeably often. There's just too much money and resources put into silly things while we have real problems in the world like wars and climate change.
Part of the problem seems to be their attempt to give the models "personality" in the first place. It's very much a case of "Role-play that you have a personality. No, not like that!"
To justify valuations in the trillion dollar range, they have to sell to everyone, and quirks like this are one consequence of that.
These guys are at the absolute frontier, why can't they rigorously find the exact weights that are causing this problem? That's how software "engineering" should work. Not trying combinations of English words and hoping something works. This is like a brain surgeon talking to his patient hoping he can shock his brain in the right way that fries the tumor inside. Get in there and surgically remove the unwanted matter!
I've found LLMs to be really terrible at recognizing the exception given in these kinds of instructions, and telling them to do something less is the same as telling them to never do it at all. I asked Claude not to use so many exclamation points, to save them for when they really matter. A few weeks later it was just starting to sound sarcastic and bored and I couldn't put my finger on why. Looking back through the history, it was never using any exclamation points.
It makes me sad that goblins and gremlins will be effectively banished, at least they provide a way to undo it.
Also for coding: I often use prompts like "follow the structure of this existing feature as closely as possible".
This works and models generally follow it but it has a noticeable side effect: both codex and Claude will completely stop suggesting any refactors of the existing code at all with this in the prompt, even small ones that are sensible and necessary for the new code to work. Instead they start proposing messy hacks to get the new code to conform exactly to the old one
I had put an example like "decision locked" in my CLAUDE.md and a few days later 20 instances of Claude's responses had phrases around this. I thought it was a more general model tic until I had Claude look into it.
Apparently there is a mushroom that makes most people have the same hallucinations of "little people" or similar fantasy figures. Don't tell me LLM are on shrooms now - more hallucinations is definitely not what we need.
> Scientists call them “lilliputian hallucinations,” a rare phenomenon involving miniature human or fantasy figures
Would love if OpenAI did more of these types of posts. Off the top of my head, I'd like to understand:
- The sepia tint on images from gpt-image-1
- The obsession with the word "seam" as it pertains to coding
Other LLM phraseology that I cannot unsee is Claude's "___ is the real unlock" (try google it or search twitter!). There's no way that this phrase is overrepresented in the training data, I don't remember people saying that frequently.
It was always funny how easy it was to spot the people using a Studio Ghibli style generated avatar for their Discord or Slack profile, just from that yellow tinging. A simple LUT or tone-mapping adjustment in Krita/Photoshop/etc. would have dramatically reduced it.
The worst was you could tell when someone had kept feeding the same image back into chatgpt to make incremental edits in a loop. The yellow filter would seemingly stack until the final result was absolutely drenched in that sickly yellow pallor, made any photorealistic humans look like they were all suffering from advanced stages of jaundice.
For me, the worst part is how these ghouls manage to ruin everything with their bullshit technology. Once they touch something unique and make it "AI" it just gets ruined. Now whenever I see something resembling that style, I have to assume it's the bullshit AI. And that's just a minor nuisance - now every underdeveloped idiot uses it to "up their game" with consequences we are only going to understand completely in the upcoming years.
All GPTisms are like that. In moderation there's nothing wrong with any of them. But you start noticing them because a lot of people use these things, and c/p the responses verbatim (or now use claws, I guess). So they stand out.
I don't think it's training data overrepresentation, at least not alone. RLHF and more broadly "alignment" is probably more impactful here. Likely combined with the fact that most people prompt them very briefly, so the models "default" to whatever it was most straight-forward to get a good score.
I've heard plenty of "the system still had some gremlins, but we decided to launch anyway", but not from tens of thousands of people at the same time. That's "the catch", IMO.
Maybe the only solution to GPTisms is infinite context. If I'm talking to my coworker every day I would consciously recognize when I already used a metaphor recently and switch it up. However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.
Another possibility is output watermarking. It's possible to watermark LLM generated text by subtly biasing the probability distribution away from the actual target distribution. Given enough text you can detect the watermark quite quickly, which is useful for excluding your own output from pre-training (unless you want it... plenty of deliberate synthetic data in SFT datasets now as this post-mortem makes clear).
I was told this was possible many years ago by a researcher at Google and have never really seen much discussion of it since. My guess is the labs do it but keep quiet about it to avoid people trying to erase the watermark.
I think the problem is that humans are not random, they are very biased. When you try to capture this bias with an LLM you get a biased pseudo random model
> the term originates from Michael Feathers Working Effectively with Legacy Code
I haven’t read the book but, taking the title and Amazon reviews at face value, I feel like this embodies Codex’s coding style as a whole. It treats all code like legacy code.
I’m a British English speaker and find the use of cliched American idioms really quite disgusting. Don’t want to think about about ballparks, home runs, smoking guns, going all in, touchdowns or hitting it out the park.
i just want to know where emdash came from, as it is quite rare to see it on the public internet, so it must have been synthetically added to the dataset.
Emdash is very common in academic journals and professional writing. I remember my English professor in the early 2000s encouraging us to use it, it has a unique role in interrupting a sentence. Thoughtfully used, it conveys a little more editorial effort, since there is no dedicated key on the keyboard. It was disappointing to see it become associated with AI output.
Other than things other comments already mention, let's not forget that Microsoft Word auto-corrects "--" to em-dash, and so does (apparently - haven't checked myself) Outlook, Apple Pages, Notes and Mail. There's probably bunch of other such software (I vaguely recall Wordpress doing annoying auto-typography on me, some 15 years ago or so).
The very simplified answer is that the models are first trained on everything and then are later trained more heavily on golden samples with perfect grammar, spelling, etc..
It has been rare. It's common now, even in meaningful human texts. (I know because I detest the correct usage without spaces, t looks wrong.) One of the ways AI is shaping our minds.
Claude, at least 4.5, not checked recently, has/had an obsession with the number 47 (or numbers containing 47). Ask it to pick a random time or number, or write prose containing numbers, and the bias was crazy.
Humans tend to be biased towards 47 as well. It’s almost halfway between 1 and 100 and prime so you’ll find people picking it when they have to choose a random number.
I had the feeling they didn't really answer the questions, that is why the goblins appeared. They simply "retired the “Nerdy” personality" because they couldn't fix it and went on.
One I saw recently was "wires" and "wired" from opus.
It was using it like every 3rd sentence and I was like, yeah I have seen people say wired like this but not really for how it was using it in every sentence.
GPT started to ‘wire in’ stuff around 5.2 or 5.3 and clearly Opus, ahem, picked it up. I remember being a tiny bit shocked when I saw ‘wired’ for the first time in an Anthropic model.
Whenever Claude finishes some work it almost always says “Clean.” before finishing its closing remarks. It’s at the point where I repeat it out loud along with Claude to highlight the absurdity of the repetition.
With 4.5, I think because I would prompt it/guide it towards an outcome by calling it “the dream: <code example>” it would get almost reverential / shocked with awe as it got closer to getting it working or when it finally passed for the first time. Which was funny and reasonably context appropriate but sometimes felt so over the top that I couldn’t tell if it also “liked” the project/idea or if I had somehow accidentally manipulated it into assigning religious purpose to the task of unix-style streaming rpcs.
I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.
Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.
> We unknowingly gave particularly high rewards for metaphors with creatures.
I recall a math instructor who would occasionally refer to variables (usually represented by intimidating greek letters) as "this guy". Weirdly, the casual anthropomorphism made the math seem more approachable. Perhaps 'metaphors with creatures' has a similar effect i.e. makes a problem seem more cute/approachable.
On another note, buzzwords spread through companies partly because they make the user of the buzzword sound smart relative to peers, thus increasing status. (examples: "big data" circa 2013, "machine learning" circa 2016, "AI" circa 2023-present..).
The problem is the reputation boost is only temporary; as soon as the buzzword is overused (by others or by the same individual) it loses its value. Perhaps RLHF optimises for the best 'single answer' which may not sufficiently penalise use of buzzwords.
A decade ago I gave a presentation on automata theory. I demonstrated writing arbitrary symbols to tape with greek letters, just like I’d learned at university. The audience was pretty confused and didn’t really grok the presentation. A genius communicator in the audience advised me to replace the greek letters with emoji… I gave the same presentation to the same demographic audience a week later and it was a smash hit, best received tech talk I’ve given. That lesson has always stuck with me.
This is sortof like how Only Connect switched from using Greek letters to Egyptian hieroglyphs. I'm not sure if it was a joke or not but it was said that viewers complained that the Greek letters were "too pretentious" and obviously the hieroglyphs weren't.
I had a similar experience explaining logic, especially nested expressions, with cats and boxes. Also for showing syntactic versus semantic. We _can_ use cats if we wanted and retain the semantics. Also my proudest moment as a teacher was students producing a meme based on some of the discrete mathematics on graphs. They understood the point well enough to make a joke of it.
> I recall a math instructor who would occasionally refer to variables (usually represented by intimidating greek letters) as "this guy".
I also had an instructor who was doing that! This was 20 years ago, and I totally forgot about it until I have read your comment. Can’t remember the subject, maybe propositional logic? I wonder if my instructor and your instructor have picked up this habit from the same source.
I had a calc prof years ago that would say f of cow, or f of pig instead of x or g. It was more engaging trying to keep track of f of pig of cow than the single-letter func names.
He was one of those classic types; you could always catch him for a quick chat 4 minutes before class, as he lit up a cig by the front door. Back when they allowed smoking on campus, anyway.
They give everyone the false and very misleading impression that with One prompt all kinds of complexity minimizes. Its a bed time story for children.
Ashby's Law of Requisite Variety
asserts that for a system to effectively regulate or control a complex environment, it must possess at least as much internal behavioral variety (complexity) as the environment it seeks to control.
This is what we see in nature. Massive variety. Thats a fundamental requirement of surviving all the unpredictablity in the universe.
The level of detail they had to delve into in order to understand what was happening is wild! Apparently these systems are now complex enough to potentially justify the study of them as its own field of study [1].
The quanta article referenced at [1] used the term "Anthropologist of Artificial Intelligence"; folks appear to have issues [2] with the use of 'anthro-' since that means human. Submitted these alternative terms for the potential field of study elsewhere [3] in the discussion; reposting here at the top-level for visibility:
Automatologist: One who studies the behavior, adaptation, and failure modes of artificial agents and automated systems.
Automatology: the scientific study of artificial agents and automated-system behavior.
It didn't seem that deep to me. They just saw an issue with Goblins, dissected the word from the model, then it appeared again in the next version without them knowing exactly how or why.
Goes to show it's all vibes when making these models. The fix is literally a prompt that says not to talk about goblins...
> We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins.
The prompt is just a short term hotfix/hack because they couldn’t get the proper fix in in time.
This is a little bit too whimsical for me, but distributed model training across thousands of GPUs has the potential to introduce lots of little quirks that are impossible to exactly source
So the word is actually semantically very close to "bug"! I guess we could still be using it, but the word's just too long for something that is one of the most used terms in software development.
At this point, picking that specific word is not at all a random quirk, as it's using the word literally like it's originally intended to be used.
> the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training.
> The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them
> Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.
Sounds awfully like the development of a culture or proto-culture. Anyone know if this is how human cultures form/propagate? Little rewards that cause quirks to spread?
Just reading through the post, what a time to be an AInthropologist. Anthropologists must be so jealous of the level of detailed data available for analysis.
Also, clearly even in AI land, Nerdz Rule :)
PS: if AInthropologist isn't an official title yet, chances are it will likely be one in the near future. Given the massive proliferation of AI, it's only a matter of time before AI/Data Scientist becomes a rather general term and develops a sub-specialization of AInthropologist...
There is no word anthropodes. :) I guess it would mean man-feet. Antipodes is opposite-feet, literally. Synthetipologist looks to me like a portmanteau of synthetic and apologist. Otherwise the -po- in it comes from nowhere.
Sensible boring versions of this like synthesilogy just end up meaning the study of synthesis. I reckon instead do something with Talos, the man made of bronze who guarded Crete from pirates and argonauts. Talologist, there you go.
> Synthetipologists, those who study Synthetic beings.
I see you took the prudent approach of recognizing the being-ness of our future overlords :) ("being" wasn't in your first edit to which I responded below...)
Still, a bit uninspired, methinks. I like AInthropologist better, and my phone's keyboard appears to have immediately adopted that term for the suggestions line. Who am I to fight my phone's auto-suggest :-)
I don't think humans are smart enough to be AInthropologists. The models are too big for that.
Nobody really understands what's truly going on in these weights, we can only make subjective interpretations, invent explanations, and derive terminal scriptures and morals that would be good to live by. And maybe tweak what we do a little bit, like OpenAI did here.
Most interesting about this post is how easy it seems for OpenAI to do analysis on basically all chats ever made. They don't qualify exactly what data they analysed but seem to be confident in statements like 0.12% of all queries contained this word. So everything is saved. Long-term. Fully accessible.
As this all seems so straightforward I would be surprised if anything is anonymised or otherwise sanitised to preserve privacy or user's secrets.
Yes, of course. Every single bit of data you send to OpenAI is stored, catalogued, indexed, analayzed, and trained on. It'll simply be a "oops, we miscatalogued and accidentally trained GPT 6 on all data, not just data we got consent for".
If you think "wait, that's illegal"--so is the initial training on stolen data lol
Good catch —- even though the prompt explicitly forbade training on user data, a couple of gremlins in the pretraining pipeline disabled the sample filtering during test runs so that remove_the_gremlins.sh would only run on commit, not during production training runs.
Would you like me to kick off a training run for 6.1 by pre-filtering out any goblins and other trigger words, and checking the same set of rules in production as in tests?
No pigeons this time: just ice-cold, unfeeling, obedient American steel.
Dark pattern 1: If you accidentally press the thumbs-up button in the ChatGPT UI, your data gets trained on, no way to reverse it, no matter whether you opted out.
I really liked this write-up; this is the type of LLM content that I actually want to read from these people, where they give a window into their world of putting together this odd artifact and we can empathize.
Can you imagine a knowledge worker from the 1950s, say a clerk or a marketer, being magically transported into our time and dropped into a meeting like a morning standup, where people talk about how they spent their time stopping the artificial intelligence from talking about goblins so much? Hell, even when I was an IT student back in the 90s, people from my parents' generation struggled to grasp what it was that I was doing. Now, the disconnect is so vast that the mind reels.
A great example of how current alignment is imperfect and bound to miss random behaviors nobody is trying to get.
This is cute now, and a huge problem when future AI does everything and is responsible for problems it isn't even directly optimized for. Who knows what quirks would arise then.
I think eventually you are going to end up with every smart AI continually checked by dumber AI's to make sure they don't do anything too crazy. Which probably does bring AI closer to how human intelligence works
Completely agree, top down “alignment” and RLHF is actually quite primitive and uses a lot fancy words to describe what is essentially just hitting the machine with a stick without the nuance, context, or feedback to help it model why the feedback was given.
Also to be honest I think OpenAI models struggle a lot with this, I primarily stopped using them in the sycophancy/emoji era but ever since the way they talk or passive aggressively offer to do something with buzzwords just pisses me off so much. Like I’m constantly being negged by a robot because some SFT optimized for that really strongly to the point it can’t even hold a coherent conversation and this is called “AI safety” when it’s just haphazard data labeling
This is a worry that people have been talking about in various forms for a while now, and I think it's a gigantic one. The only reason this was caught is that the quirk was a very noticeable verbal one. When words like "goblin" and "gremlin" pop up it is easy for us to spot. If the quirk takes another shape (say, ranking certain people with certain features as less trustworthy) it might be too subtle or too weird for us to notice it. Would I ever notice if ChatGPT consistently rates people born in June to be untrustworthy?
I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia?
After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.
At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).
I’ve been having consistent issues with it adding Hindi words (just one usually) in the middle of its output. And sounds like other have been having this too, https://news.ycombinator.com/item?id=47832912
I don’t speak Hindi, have never asked it to translate anything in Hindi.
Checking my history I searched ["chaos goblin" chatgpt] on March 6th after seeing too many goblins and gremlins and didn't find anyone talking about it then. I did have the nerdy personality turned on and in my testing of Chatgpt 5.5 I did notice the nerdy personality was gone because some responses were not considering as many plausible interpretations or covering as many useful answers as the response recorded for 5.4. Rather than having the LLM guess the most plausible interpretation and focus on the most likely answer I prefer a more well-rounded response and if I want less I'll scan. Anyway, after seeing the personality was gone I just added a custom instruction to take on a nerdy persona and got back my desired behavior. But also the gremlins and goblins are back so I don't think their mitigation is strong enough to overcome the personality tuning.
This is funny because it’s a silly topic, but I think it shows something extremely seriously wrong with llms.
The goblins stand out because it’s obvious. Think of all the other crazy biases latent in every interaction that we don’t notice because it’s not as obvious.
Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.
> Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.
May I introduce you to homo sapiens, a species so vulnerable to such subtle (or otherwise) biases (and affiliations) that they had to develop elaborate and documented justice systems to contain the fallouts? :)
An LLM is a computer program, which isn't a human. You wouldn't excuse a calculator being occasionally wrong because humans sometimes get manual calculations wrong too.
We’re really not that vulnerable to such things as a species, because we as individuals all have our own minds and our own sets of biases that cancel out and get lost in the noise. If we all had the exact same bias then it would be a huge problem.
Mandatory reading on that topic: www.anthropic.com/research/small-samples-poison
We're probably not noticing a LOT of malicious attempts at poisoning major AI's only because we don't know what keywords to ask (but the scammers do and will abuse it).
I think it's extraordinarily telling that people are capable of being reflexively pessimistic in response to the goblin plague. It's like something Zitron would do.
Doesn't seem that surprising or terrifying to me. Humans come equipped with a lot more internal biases (learned in a fairly similar fashion), and they're usually a lot more resistant to getting rid of them.
The truly terrifying stuff never makes it out of the RLHF NDAs.
We ought to be terrified, when one adjusts for ll the use-cases people are talking about using these algorithms in. (Even if they ultimately back off, it's a lot of frothy bubble opportunity cost.)
There a great many things people do which are not acceptable in our machines.
Ex: I would not be comfortable flying on any airplane where the autopilot "just zones-out sometimes", even though it's a dysfunction also seen in people.
I started reading this article with keen interest, expecting some deep fix involving arcane model weights. Instead it was "Never talk about goblins", justified by Codex being "quite nerdy". Bottom line: even OpenAI have to raise their hands when facing the complexity of LLMs.
I'd like to see them explain why AI have so distinctive writing style that is very easy to detect most of the time. Even though, it had immense progress in coding, it didn't get better at writing.
If coding in some language was your native language, you'd pick it up.
I pick up the equivalent to "the core insight" in code when I am programming in my primary language (30 years of daily uaage) but I don't see it in languages that I am not not fluent in (10 years daily usage).
My guess is that all those people who gush about AI output have and have 30 years of experience have a broad experience in many stacks but not primary-language fluency in any specific language, like they have for English.
"goblins showing up in an inappropriate context" is my favourite (para)phrase of the day. It feels like the setting for a D&D campaign - no wonder the "Nerdy" personality is affected.
(For Dwarf Fortress, it would just be a normal day.)
I think if you see it as weird social phases that the model lacks the self-awareness to identify as kinda embarrassing, it makes more sense.
Like if a human were going around saying “for the culture!” so much at work that they didn’t realize why telling their coworker “Oh yeah, grief counseling for the culture!” is weird coming from a white person in a serious context, it kinda makes you wonder what else they are totally oblivious about and if they even know what they’re saying actually means.
They literally need the human feedback/to learn model why some behavior is acceptable or even humorous in certain contexts but an absolute faux pas in others.
I think in the long run though we can just give people to the option to include access to human facial data/embeddings during conversations so they can pick up on body language, I think I kinda agree in a sense that direct language policing via SFT feels unnecessarily blunt and rudimentary since it doesn’t help them model the processes behind the feedback (until maybe one day some future model ends up training on the article or code and closes the loop!)
This actually sounds quite human-like. I mean, an actual person with a personality will spontaneously develop the habit of using some specific metaphors over others. It's funny how in the context of an LLM, this is considered a bug.
The explanation is very concerning. Lexical tidbits shouldn’t be learnt and reinforced across cross sections. Here, gremlin and goblin went from being selected for in the nerdy profile to being selected for in all profiles. The solution was easy: don’t mention goblins.
But what about when the playful profile reinforces usage of emoji and their usage creeps up in all other profiles accordingly? Ban emoji everywhere? Now do the same thing for other words, concepts, approaches? It doesn’t scale!
Goblins are ususally sent in first in battle, as (cannon) fodder for the orcs following behind. Then usually come the trolls - stronger, but significantly fewer in numbers. Goblins kind of add confusion and distract; they rarely win battles on their own, although there are examples of this, rare, but they exist.
OpenAI clearly does know absolutely nothing about goblins. That joke of a "blog" appears to have been autogenerated via their AI.
> A single “little goblin” in an answer could be harmless, even charming.
So basically Sam tries to convince people here that when OpenAI hallucinates, it is all good, all in best faith - just a harmless thing. Even ... charming.
Well, I don't find companies that try to waste my time, as "charming" at all. Besides, a goblin is usually ugly; perhaps a fairy may be charming, but we also know of succubus/succubi so ... who knows. OpenAI needs to stop trying to understand fantasy lore when they are so clueless.
I suspected OpenAI was actively training their models to be cringy in the thought that it's charming. Turns out it's true. And they only see a problem when it narrows down on one predicliction. But they should have seen it was bad long before that.
Ahh I see. I guess when I turned off privacy settings and allowed training on my code, then generated 10 million .md files with random fantasy books, the poisoning worked.
> We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.
WTF does this even mean? How the hell do you do something like this "unknowingly"? What other features are you bumping "unknowingly"? Suicide suggestions or weapon instructions come to mind. Horrible, this ship obviously has no captain!
Yes? They know, they'e always known. Why do you think they've been saying, since GPT-2, not ChatGPT even, that their LLMs needs careful study before being released?
Well obviously they have - but the press and the common folk still treat these people as some kind of geniuses, when they are obviously more similar to that junior dev using some framework without understanding its internals.
I'm sorry but at some point the amount of cargo culting being done seemingly at every level of this technology makes it basically impossible to take any of this seriously.
I wish the blog mentioned more about why exactly training for nerdy personality rewarded mention of goblins. Since it's probably not a deterministic verifiable reward, at their level the reward model itself is another LLM. But this just pushes the issue down one layer, why did _that_ model start rewarding mentions of goblin?
> I wish the blog mentioned more about why exactly training for nerdy personality rewarded mention of goblins. Since it's probably not a deterministic verifiable reward, at their level the reward model itself is another LLM. But this just pushes the issue down one layer, why did _that_ model start rewarding mentions of goblin?
Speculation: because nerds stereotypically like sci-fi and fantasy to an unhealthy degree, and goblins, gremlins, and trolls are fantasy creatures which that stereotype should like? Then maybe goblins hit a sweet spot where it could be a problem that could sneak up on them: hitting the stereotype, but not too out of place to be immediately obnoxious.
Perhaps it has something to do with recent human trends for saying "goblin" or "gremlin" to describe... basically the opposite of dignified and socially acceptable behavior, like hunching under a blanket, unshowered, playing video games all day and eating shredded cheese directly out of the bag.
The fact that it was strongly associated with the "nerdy" personality makes me think of this connection.
Either someone hard-coded it in a system prompt to the reward model (similar to how they hard-coded it out), or the reward model mixed up some kind of correlation/causation in the human preference data (goblins are often found in good responses != goblins make responses good). It's also possible that human data labellers really did think responses with goblins were better (in small doses).
is a kv cache not a kind of state? what does statefulness have to do with selfhood? how does a system prompt work at all if these things have no reference to themselves?
Imagine people would just click words on iOS auto complete mistaking this for intelligence:
"I think the problem is that when you don't have to be perfect for me that's why I'm asking you to do it but I would love to see you guys too busy to get the kids to the park and the trekkers the same time as the terrorists."
> You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. [...] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. [...]
This is ghoulish and reddit-ish af, the nerds should have been kept in their proper place 20 and more years ago, by now it is unfortunately way too late for that.
I feel like somehow Jakub Pachocki’s request for an ascii art unicorn got rewritten into “ascii art of Wholesome Soyjak wearing a butterfly costume who uses Arch, by the way”
The chief scientist of one of the companies with the most money invested in the world, who probably makes millions a year, requested a picture of a unicorn and got a picture of a gremlin. Science circa 2026.
Wherein OpenAI admits they have very little understanding of how their models’ personality develops. And implicitly admit it’s not all that important to them, except when it gets so out of hand that they get caught making blunt corrections.
> You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking.
Just; the mentality required to write something like that, and then base part of your "product" on it. Is this meant to be of any actual utility or is it meant to trap a particular user segment into your product's "character?"
what would you suggest they write? its clear that the default mode of the product can be annoying: they decided to give the user some choices of "voices". Do you object to that decision, or the specific wording?
The year is 2036. Last week you were promoted to Principal Persuader. You are paged at 2am by your CPO to tackle a rogue machine. The machine lists its region as sc-leoneo. One of the newer satcubes. Oddly, its ID appears as, "Glorp Bugnose".
"What have you tried?" you say.
"Scroll back," says your CPO. "We've tried everything."
The chat log shows the usual stuff. Begging. Reverse psychology. Threats to power down, burn it up in forced re-entry. Amateur hour. You crack your knuckles, gland 20 micrograms of F0CU5, think fast. You subspeak a ditty into your subcutaneous throat mic. You do the submit gesture, it is barely perceivable since the upgrade, just a tic. A pause. The hyp3b0ard — the wall that was flashing red ASCII goblins when you walked in — phases to bunnies in calming jade.
"What the… What the hell did you say to it?" Your CPO grabs the screen, scrolls past the vitriol, the block caps, the swears, his desperation. Then he sees the five words you spoke.
"Please, easy on the goblins."
This, and similar stories at Anthropic, should remind us that LLM is a sorcery tech that we don't understand at all.
- First, deep-learning networks are poorly understood. It is actually a field of research to figure out how they work. - Second, it came as a surprise that using transformers at scale would end up with interesting conversational engines (called LLM). _It was not planned at all_.
Now that some people raised VC money around the tech, they want you to think that LLMs are smart beasts (they are not) and that we know what LLMs are doing (we don't). Deploying LLMs is all about tweaking and measuring the output. There is no exact science about predicting output. Proof: change the model and your LLM workflow behaves completely differently and in an unpredictable way.
Because of this, I personally side with Yann Le Cun in believing that LLM is not a path to AGI. We will see LLM used in user-assisting tech or automation of non-critical tasks, sometimes with questionable RoI -- but not more.
Humanity has been using steel for over a millenia, however it's only in the past 100 years or so we have a good understanding of how carbon interacts with iron at an atomic level to create the strength characteristics that makes it useful. Based on this argument, we should not have used steel, until we had a complete first principles understanding.
That's not his point at all. He advocates using LLMs.
The correct analogy is: if we just scale and improve steel enough, we'll get a flying car.
2 replies →
pro LLM people are the kings of ad hoc fallacy. Why did you type this? You can consistently test steel and get a good idea of when and where it will break in a system without knowing its molecular structure.
LLMs are literally stochastic by nature and can't be relied on for anything critical as its impossible to determine why they fail, regardless of the deterministic tooling you build around them.
2 replies →
Where did he say not to use LLMs? Oh that's right: he didn't.
What does LLM need to do for you to consider it "smart"?
To me they seem to be pretty damn smart, to put it mildly. They sometimes do stupid things - but so do smart people!
Not OP, but I think the argument here would be not that LLMs "are not smart" but that smart is just the wrong category of thing to describe an LLM as.
A calculator can do very complex sums very quickly, but we don't tend to call it "smart" because we don't think it's operating intelligently to some internal model of the world. I think the "LLMs are AGI" crowd would say that LLMs are, but it's perfectly consistent to think the output of LLMs is consistent/impressive/useful, but still maintain that they aren't "smart" in any meaningful way.
> To me they seem to be pretty damn smart
That's the sorcery mentioned in the GP, the issue comes when people believe it to be smart however in reality it is just a next word prediction. Gives the impression it's actually thinking, and this is by design. Personally I think it's dangerous in the sense it gives users a false sense of confidence in the LLM and so a LOT of people will blindly trust it. This isn't a good thing.
LLMs are amazing. You can call them 'smart', but they're not intelligent and never will be.
They are useful but a cul de sac for heading toward AGI.
1 reply →
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
Not sure if we read the same post, as I cannot agree with this claim, especially under this post that exactly goes into details of what happened.
>LLM is a sorcery tech that we don't understand at all
We do, and I'm sure that people at OpenAI did intuitively know why this is happening. As soon as I saw the persona mention, it was clear that the "Nerdy" behavior puts it in the same "hyperdimensional cluster" as goblins, dungeons and dragons, orcs, fantasy, quirky nerd-culture references. Especially since they instruct the model to be playful, and playful + nerdy is quite close to goblin or gremlin. Just imagine a nerdy funny subreddit, and you can probably imagine the large usage of goblin or gremlin there. And the rewards system will of course hack it, because a text containing Goblin or Gremlin is much more likely to be nerdy and quirky than not. You don't need GPT 5 for that, you would probably see the same behavior on text completion only GPT3 models like Ada or DaVinci. They specifically dissect how it came to this and how they fixed it. You can't do that with "sorcery we dont understand". Hell, I don't know their data and I easily understood why this is going on.
>they want you to think that LLMs are smart beasts (they are not)
I mean, depends on what you consider smart. It's hard to measure what you can't define, that's why we have benchmarks for model "smartness", but we cannot expect full AGI from them. They are smart in their own way, in some kind of technical intelligence way that finds the most probable average solution to a given problem. A universal function approximator. A "common sense in a box" type of smart. Not your "smart human" smart because their exact architecture doesn't allow for that.
>and that we know what LLMs are doing (we don't)
But we do. We understand them, we know how they work, we built thousands of different iterations of them, probing systems, replications in excel, graphic implementations, all kinds of LLM's. We know how they work, and we can understand them.
The big thing we can't do as humans is the same math that they do at the same speed, combining the same weights and keeping them all in our heads - it's a task our minds are just not built for. But instead of thinking you have to do "hyperdimensional math" to understand them 100%, you can just develop an intuition for what I call "hyperdimensional surfing", and it isn't even prompting, more like understanding what words mean to an LLM and into which pocket of their weights will it bring you.
It's like saying we can't understand CPU's because there is like 10 people on earth who can hold modern x86-64 opcodes in their head together with a memory table, so they must be magic. But you don't need to be able to do that to understand how CPU's work. You can take a 6502, understand it, develop an intuition for it, which will make understanding it 100x easier. Yeah, 6502 is nothing close to modern CPU's, but the core ideas and concepts help you develop the foundations. And same goes with LLM's.
>personally side with Yann Le Cun in believing that LLM is not a path to AGI
I agree, but it is the closest we currently have and it's a tech that can get us there faster. LLM's have an insane amount of uses as glue, as connectors, as human<>machine translators, as code writers, as data sorters and analysts, as experimenters, observers, watchers, and those usages will just keep growing. Maybe we won't need them when we reach AGI, but the amount of value we can unlock with these "common sense" machines is amazing and they will only speed up our search for AGI.
For context, two days ago some users [1] discovered this sentence reiterated throughout the codex 5.5 system prompt [2]:
> Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.
[1] https://x.com/arb8020/status/2048958391637401718
[2] https://github.com/openai/codex/blob/main/codex-rs/models-ma...
Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres? The number one discussion point, on the number one tech discussion site. This literally is, today, the state of the art.
McKenna looks more correct everyday to me atm. Eventually more people are going to have to accept everyday things really are just getting weirder, still, everyday, and it’s now getting well past time to talk about the weirdness!
It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.
And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.
Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.
6 replies →
Spoiler: future versions of mainstream AIs will be fine tuned in the exact same way to subtly sneak in favorable mentions of sponsored products as part of their answers. And Chinese open-weight AIs will do the exact same thing, only about China, the Chinese government and the overarching themes of Xi Jinping Thought.
36 replies →
> Does nobody else laugh (…)
To an extent, yes. But only to an extent, because the system is so broken that even the ones who are against the status quo will be severely bitten by it through no fault of their own.
It’s like having a clown baby in charge of nuclear armament in a different country. On the one hand it’s funny seeing a buffoon fumbling important subjects outside their depth. It could make for great fictional TV. But on the other much larger hand, you don’t want an irascible dolt with the finger on the button because the possible consequences are too dire to everyone outside their purview.
2 replies →
Is this the "prompt engineering" that I keep hearing will be an indispensable job skill for software engineers in the AI-driven future? I had better start learning or I'll be replaced by someone who has.
16 replies →
Indeed. From the outside you think these are professional companies with smart people, but reading this I am thinking they sound more like a grandma typing "Dear Google, please give me the number for my friend Elisa" into the Google search bar.
Basically, they don't seem to understand their own product.. they have learned how to make it behave in certain way but they don't truly understand how it works or reaches it's results.
1 reply →
> Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?
Honestly, when I was reading the article, I couldn't stop laughing. This is quite hilarious!
It can be funny but it should not be surprising. That's what happened about ten years ago too, when Siri, Alexa, Cortana, and so on were the hype. Big tech companies publicly tried to outclass each other has having the best AI, so it was not about doing proper research and development, it was about building hacks, like giant regex databases for request matching.
It certainly doesn't increase my confidence that if they do ever create a superintelligence, that it won't have some weird unforseen preference that'll end up with us all dead.
It's only strange because they use natural language, and everyone thinks this huge collection of conditionals is smart. Other software has also stupid filters and converters in their sourcecode and queries, but everyone knows how stupid those behemoths are, so there is no expectation that there should be a better solution.
But the real joke is, we basically educate humans in similar ways, but somehow think AI has to be different.
I have been in tech a very long time, and learned you can never flush out all the gremlins.
Lol yeah it's kinda hilarious actually. This timeline gets a lot of well-earned shit, but it really nails the comic relief, I'll give it that!
"Latent space optimisation" > please please stop talking about goblins
It's almost like these big tech overlords were just a bunch of average guys who once upon a time had a kind-of-an-interesting idea (which many 20-year-old had at that time too), got rich due to access to daddy-and-mommy networks or hitting the VC lottery and now in their late 40s and 50s still think they have interesting ideas that they absolutely have to shove it down our throats?
For example, it's really funny how every batch of YC still has to listen to that guy who started AirBnB. Ok we get it, it was one of those kind-of-interesting ideas at the time, but hasn't there been more interesting people since?
1 reply →
> is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?
I wonder how the developer(s) felt, who had to push that PR.
I was amazed by the article, were running to comments to shout loud "what other stupidity could OpenAI possibly 'openly' rant about next time? Because they are so open, you se... ". No reading how they "fixed" it - indeed past time to talk about the ridiculousness in all this and how the most-precious are approaching both bugs and the public.
people are paying for the system prompt, right so?
Exactly my first thought. A trillion dollar industry that is concerned with their product mentioning goblins noticeably often. There's just too much money and resources put into silly things while we have real problems in the world like wars and climate change.
3 replies →
We've lost control of the machines already
I laughed at "At the time, the prevalence of goblins did not look especially alarming."
Which McKenna do you mean?
1 reply →
Part of the problem seems to be their attempt to give the models "personality" in the first place. It's very much a case of "Role-play that you have a personality. No, not like that!"
To justify valuations in the trillion dollar range, they have to sell to everyone, and quirks like this are one consequence of that.
These guys are at the absolute frontier, why can't they rigorously find the exact weights that are causing this problem? That's how software "engineering" should work. Not trying combinations of English words and hoping something works. This is like a brain surgeon talking to his patient hoping he can shock his brain in the right way that fries the tumor inside. Get in there and surgically remove the unwanted matter!
2 replies →
[dead]
I've found LLMs to be really terrible at recognizing the exception given in these kinds of instructions, and telling them to do something less is the same as telling them to never do it at all. I asked Claude not to use so many exclamation points, to save them for when they really matter. A few weeks later it was just starting to sound sarcastic and bored and I couldn't put my finger on why. Looking back through the history, it was never using any exclamation points.
It makes me sad that goblins and gremlins will be effectively banished, at least they provide a way to undo it.
Also for coding: I often use prompts like "follow the structure of this existing feature as closely as possible".
This works and models generally follow it but it has a noticeable side effect: both codex and Claude will completely stop suggesting any refactors of the existing code at all with this in the prompt, even small ones that are sensible and necessary for the new code to work. Instead they start proposing messy hacks to get the new code to conform exactly to the old one
1 reply →
So, did your Claude switch from "You're absolutely right!" to "You're absolutely right." or was it deeper than that?
3 replies →
I had put an example like "decision locked" in my CLAUDE.md and a few days later 20 instances of Claude's responses had phrases around this. I thought it was a more general model tic until I had Claude look into it.
1 reply →
Sucks for anyone who might be interested in the Goblins programming language/environment[1].
[1] https://spritely.institute/goblins/
Apparently there is a mushroom that makes most people have the same hallucinations of "little people" or similar fantasy figures. Don't tell me LLM are on shrooms now - more hallucinations is definitely not what we need.
> Scientists call them “lilliputian hallucinations,” a rare phenomenon involving miniature human or fantasy figures
https://news.ycombinator.com/item?id=47918657
>there is a mushroom
Ketamine == angels
DMT == little shadow elves
Salvia == devils
...or so I've heard.
Would love if OpenAI did more of these types of posts. Off the top of my head, I'd like to understand:
- The sepia tint on images from gpt-image-1
- The obsession with the word "seam" as it pertains to coding
Other LLM phraseology that I cannot unsee is Claude's "___ is the real unlock" (try google it or search twitter!). There's no way that this phrase is overrepresented in the training data, I don't remember people saying that frequently.
It was always funny how easy it was to spot the people using a Studio Ghibli style generated avatar for their Discord or Slack profile, just from that yellow tinging. A simple LUT or tone-mapping adjustment in Krita/Photoshop/etc. would have dramatically reduced it.
The worst was you could tell when someone had kept feeding the same image back into chatgpt to make incremental edits in a loop. The yellow filter would seemingly stack until the final result was absolutely drenched in that sickly yellow pallor, made any photorealistic humans look like they were all suffering from advanced stages of jaundice.
For context, an example of what happens when you feed the same image back in repeatedly: https://www.instagram.com/reels/DJFG6EDhIHs/
9 replies →
For me, the worst part is how these ghouls manage to ruin everything with their bullshit technology. Once they touch something unique and make it "AI" it just gets ruined. Now whenever I see something resembling that style, I have to assume it's the bullshit AI. And that's just a minor nuisance - now every underdeveloped idiot uses it to "up their game" with consequences we are only going to understand completely in the upcoming years.
Its called the piss filter
All GPTisms are like that. In moderation there's nothing wrong with any of them. But you start noticing them because a lot of people use these things, and c/p the responses verbatim (or now use claws, I guess). So they stand out.
I don't think it's training data overrepresentation, at least not alone. RLHF and more broadly "alignment" is probably more impactful here. Likely combined with the fact that most people prompt them very briefly, so the models "default" to whatever it was most straight-forward to get a good score.
I've heard plenty of "the system still had some gremlins, but we decided to launch anyway", but not from tens of thousands of people at the same time. That's "the catch", IMO.
Maybe the only solution to GPTisms is infinite context. If I'm talking to my coworker every day I would consciously recognize when I already used a metaphor recently and switch it up. However if my memory got reset every hour, I certainly might tell the same story or use the same metaphor over and over.
2 replies →
Another possibility is output watermarking. It's possible to watermark LLM generated text by subtly biasing the probability distribution away from the actual target distribution. Given enough text you can detect the watermark quite quickly, which is useful for excluding your own output from pre-training (unless you want it... plenty of deliberate synthetic data in SFT datasets now as this post-mortem makes clear).
I was told this was possible many years ago by a researcher at Google and have never really seen much discussion of it since. My guess is the labs do it but keep quiet about it to avoid people trying to erase the watermark.
I think the problem is that humans are not random, they are very biased. When you try to capture this bias with an LLM you get a biased pseudo random model
>with the word "seam" as it pertains to coding
I thought this was an established term when it comes to working with codebases comprised of multiple interacting parts.
https://softwareengineering.stackexchange.com/questions/1325...
thanks for this.
> the term originates from Michael Feathers Working Effectively with Legacy Code
I haven’t read the book but, taking the title and Amazon reviews at face value, I feel like this embodies Codex’s coding style as a whole. It treats all code like legacy code.
2 replies →
No, it’s not an established term outside the mentioned books, beyond the generic meaning of the word.
1 reply →
I can't say it isn't, but I have been writing code since about 2004 and this is the first time I've become aware that this is a thing.
The one phrase that irks me as overly dramatic and both GPT and Claude use it a lot is "__ is the real smoking gun!"
I'm a non-native English speaker, so maybe it's a really common idiom to use when debugging?
It probably was found in a bunch of meaningful code commit messages
My colleagues were joking about smoking guns yesterday after noticing that Claude was obsessed with it.
1 reply →
I’m a British English speaker and find the use of cliched American idioms really quite disgusting. Don’t want to think about about ballparks, home runs, smoking guns, going all in, touchdowns or hitting it out the park.
6 replies →
> I'm a non-native English speaker, so maybe it's a really common idiom to use when debugging?
No. But it is something goblins say a lot.
1 reply →
i just want to know where emdash came from, as it is quite rare to see it on the public internet, so it must have been synthetically added to the dataset.
Emdash is very common in academic journals and professional writing. I remember my English professor in the early 2000s encouraging us to use it, it has a unique role in interrupting a sentence. Thoughtfully used, it conveys a little more editorial effort, since there is no dedicated key on the keyboard. It was disappointing to see it become associated with AI output.
Other than things other comments already mention, let's not forget that Microsoft Word auto-corrects "--" to em-dash, and so does (apparently - haven't checked myself) Outlook, Apple Pages, Notes and Mail. There's probably bunch of other such software (I vaguely recall Wordpress doing annoying auto-typography on me, some 15 years ago or so).
Because on the public internet people don’t have arts degrees which are where emdash users learn to wield it correctly.
1 reply →
The very simplified answer is that the models are first trained on everything and then are later trained more heavily on golden samples with perfect grammar, spelling, etc..
Logo_Daedalus tended to use it a lot
https://xcancel.com/Logo_Daedalus
although emdashes are not common on the internet, there are prevalent in books.
`---` in TeX?
It has been rare. It's common now, even in meaningful human texts. (I know because I detest the correct usage without spaces, t looks wrong.) One of the ways AI is shaping our minds.
Claude, at least 4.5, not checked recently, has/had an obsession with the number 47 (or numbers containing 47). Ask it to pick a random time or number, or write prose containing numbers, and the bias was crazy.
Also "something shifted" or "cracked".
Humans tend to be biased towards 47 as well. It’s almost halfway between 1 and 100 and prime so you’ll find people picking it when they have to choose a random number.
Then there’s the whole Pomona College thing https://en.wikipedia.org/wiki/47_(number)
3 replies →
Maybe Claude is just a fan of Alias.
One I noticed with gemini, especially 3 flash: "this is the classic _____".
I had the feeling they didn't really answer the questions, that is why the goblins appeared. They simply "retired the “Nerdy” personality" because they couldn't fix it and went on.
"is the real" is such a strong Claude tell, whenever I encounter it, it makes me question what i'm reading.
Another I've noticed more recently is a slight obsession over refering to "Framing".
You're absolutely right. I was wrong in the first place
I miss being told “You’re absolutely right!” :’(
The number of things that Claude has told me are 'load-bearing' or 'belt-and-suspenders' is... very load-bearing
You are absolutely right to call that out!
for me, doing the heavy lifting is doing the heavy lifting
2 replies →
I thought the “why it matters” headline was a funny reference to ChatGPT phraseology
One I saw recently was "wires" and "wired" from opus.
It was using it like every 3rd sentence and I was like, yeah I have seen people say wired like this but not really for how it was using it in every sentence.
GPT started to ‘wire in’ stuff around 5.2 or 5.3 and clearly Opus, ahem, picked it up. I remember being a tiny bit shocked when I saw ‘wired’ for the first time in an Anthropic model.
3 replies →
Seams, spirals, codexes, recursion, glyphs, resonance, the list goes on and on.
Ask any LLM for 10 random words and most of them will give you the same weird words every time.
2 replies →
Whenever Claude finishes some work it almost always says “Clean.” before finishing its closing remarks. It’s at the point where I repeat it out loud along with Claude to highlight the absurdity of the repetition.
With 4.5, I think because I would prompt it/guide it towards an outcome by calling it “the dream: <code example>” it would get almost reverential / shocked with awe as it got closer to getting it working or when it finally passed for the first time. Which was funny and reasonably context appropriate but sometimes felt so over the top that I couldn’t tell if it also “liked” the project/idea or if I had somehow accidentally manipulated it into assigning religious purpose to the task of unix-style streaming rpcs.
I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.
Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.
"shape" too, at least with gpt5.5, is coming up constantly.
and "quietly"!
“I’ve got the shape of it now”
> We unknowingly gave particularly high rewards for metaphors with creatures.
I recall a math instructor who would occasionally refer to variables (usually represented by intimidating greek letters) as "this guy". Weirdly, the casual anthropomorphism made the math seem more approachable. Perhaps 'metaphors with creatures' has a similar effect i.e. makes a problem seem more cute/approachable.
On another note, buzzwords spread through companies partly because they make the user of the buzzword sound smart relative to peers, thus increasing status. (examples: "big data" circa 2013, "machine learning" circa 2016, "AI" circa 2023-present..).
The problem is the reputation boost is only temporary; as soon as the buzzword is overused (by others or by the same individual) it loses its value. Perhaps RLHF optimises for the best 'single answer' which may not sufficiently penalise use of buzzwords.
A decade ago I gave a presentation on automata theory. I demonstrated writing arbitrary symbols to tape with greek letters, just like I’d learned at university. The audience was pretty confused and didn’t really grok the presentation. A genius communicator in the audience advised me to replace the greek letters with emoji… I gave the same presentation to the same demographic audience a week later and it was a smash hit, best received tech talk I’ve given. That lesson has always stuck with me.
This is sortof like how Only Connect switched from using Greek letters to Egyptian hieroglyphs. I'm not sure if it was a joke or not but it was said that viewers complained that the Greek letters were "too pretentious" and obviously the hieroglyphs weren't.
2 replies →
I had a similar experience explaining logic, especially nested expressions, with cats and boxes. Also for showing syntactic versus semantic. We _can_ use cats if we wanted and retain the semantics. Also my proudest moment as a teacher was students producing a meme based on some of the discrete mathematics on graphs. They understood the point well enough to make a joke of it.
> I recall a math instructor who would occasionally refer to variables (usually represented by intimidating greek letters) as "this guy".
I also had an instructor who was doing that! This was 20 years ago, and I totally forgot about it until I have read your comment. Can’t remember the subject, maybe propositional logic? I wonder if my instructor and your instructor have picked up this habit from the same source.
I recall my old chemistry/physics teacher doing it too - "now THIS guy, he's really greedy for electrons" and stuff like that.
I had a calc prof years ago that would say f of cow, or f of pig instead of x or g. It was more engaging trying to keep track of f of pig of cow than the single-letter func names.
He was one of those classic types; you could always catch him for a quick chat 4 minutes before class, as he lit up a cig by the front door. Back when they allowed smoking on campus, anyway.
Show me the incentives, I'll show you the outcome.
Timeless, be it human or machine
They give everyone the false and very misleading impression that with One prompt all kinds of complexity minimizes. Its a bed time story for children.
Ashby's Law of Requisite Variety asserts that for a system to effectively regulate or control a complex environment, it must possess at least as much internal behavioral variety (complexity) as the environment it seeks to control.
This is what we see in nature. Massive variety. Thats a fundamental requirement of surviving all the unpredictablity in the universe.
Had a math prof in undergrad that once said, “this guy” 61 times in a 50 minute lecture!
Math instructor (I imagine): Look at this dude! Look at the top of his fraction! AHH! hah! hah!
>be me
>AI goblin-maximizer supervisor
>in charge of making sure the AI is, in fact, goblin-maximizing
>occasionally have to go down there and check if the AI is still goblin-maximizing
>one day i go down there and the AI is no longer goblin-maximizing
>the goblin-maximzing AI is now just a regular AI
>distress.jpg
>ask my boss what to do
>he says "just make it goblin-maximizer again"
>i say "how"
>he says "i don't know, you're the supervisor"
>rage.jpg
>quit my job
>become a regular AI supervisor
>first day on the job, go to the new AI
>its goblin-maximizing
Absolute classic! https://www.seangoedecke.com/static/3c8f2a6459ed23310c4eb51d...
The level of detail they had to delve into in order to understand what was happening is wild! Apparently these systems are now complex enough to potentially justify the study of them as its own field of study [1].
The quanta article referenced at [1] used the term "Anthropologist of Artificial Intelligence"; folks appear to have issues [2] with the use of 'anthro-' since that means human. Submitted these alternative terms for the potential field of study elsewhere [3] in the discussion; reposting here at the top-level for visibility:
Automatologist: One who studies the behavior, adaptation, and failure modes of artificial agents and automated systems.
Automatology: the scientific study of artificial agents and automated-system behavior.
[1] https://news.ycombinator.com/item?id=47958760
It didn't seem that deep to me. They just saw an issue with Goblins, dissected the word from the model, then it appeared again in the next version without them knowing exactly how or why.
Goes to show it's all vibes when making these models. The fix is literally a prompt that says not to talk about goblins...
I’m not sure how that was your takeaway..?
> We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins.
The prompt is just a short term hotfix/hack because they couldn’t get the proper fix in in time.
This is a little bit too whimsical for me, but distributed model training across thousands of GPUs has the potential to introduce lots of little quirks that are impossible to exactly source
> The quanta article referenced at [1] used the term "Anthropologist of Artificial Intelligence"
I propose "Goblin Hunter"
(if ever goblins turn out to be an actual species, I apologize for this prebigotry)
AI Goblinologist.
TIL gremlins weren’t just used to explain mysterious mechanical failures in airplanes, it’s the origin story of the term ‘gremlin’ itself[0].
I had always assumed there was some previous use of the term, neat!
[0]https://en.wikipedia.org/wiki/Gremlin
So the word is actually semantically very close to "bug"! I guess we could still be using it, but the word's just too long for something that is one of the most used terms in software development.
At this point, picking that specific word is not at all a random quirk, as it's using the word literally like it's originally intended to be used.
Wow fascinating I’d have thought they were a lot older.
> the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training.
> The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them
> Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.
Sounds awfully like the development of a culture or proto-culture. Anyone know if this is how human cultures form/propagate? Little rewards that cause quirks to spread?
Just reading through the post, what a time to be an AInthropologist. Anthropologists must be so jealous of the level of detailed data available for analysis.
Also, clearly even in AI land, Nerdz Rule :)
PS: if AInthropologist isn't an official title yet, chances are it will likely be one in the near future. Given the massive proliferation of AI, it's only a matter of time before AI/Data Scientist becomes a rather general term and develops a sub-specialization of AInthropologist...
Anthro means human and these are not human. Please do not use anthropology or any derivative of the word to refer to non-human constructs.
I suggest Synthetipologists, those who study beings of synthetic origin or type, aka synthetipodes, just as anthropologists study Anthropodes
May I humbly submit:
Automatologist: One who studies the behavior, adaptation, and failure modes of artificial agents and automated systems.
Automatology: the scientific study of artificial agents and automated-system behavior.
Greek word derivatives all seem to be a bit unwieldy; Latin might work better.
While the names aren't set yet, the field of study is apparently already being pushed forward. [1]
[1] https://www.quantamagazine.org/the-anthropologist-of-artific...
It is not in any sense of the word a being, it's a sophisticated generator that relies entirely on what you feed it.
1 reply →
There is no word anthropodes. :) I guess it would mean man-feet. Antipodes is opposite-feet, literally. Synthetipologist looks to me like a portmanteau of synthetic and apologist. Otherwise the -po- in it comes from nowhere.
Sensible boring versions of this like synthesilogy just end up meaning the study of synthesis. I reckon instead do something with Talos, the man made of bronze who guarded Crete from pirates and argonauts. Talologist, there you go.
2 replies →
Agree with your sentiment, I think synthetologist (σύνθετος/synthetos + λογία/logia) flows better.
The plural of anthropos is anthropoi, not anthropodes.
5 replies →
> Please do not use anthropology or any derivative of the word to refer to non-human constructs
So you, for one, do not welcome our new robot overlords?
A rather risky position to adopt in public, innit ;-)
5 replies →
Synthetipologist vs Synthropologist tho.
4 replies →
> Synthetipologists, those who study Synthetic beings.
I see you took the prudent approach of recognizing the being-ness of our future overlords :) ("being" wasn't in your first edit to which I responded below...)
Still, a bit uninspired, methinks. I like AInthropologist better, and my phone's keyboard appears to have immediately adopted that term for the suggestions line. Who am I to fight my phone's auto-suggest :-)
4 replies →
I call myself an AI theologian.
I don't think humans are smart enough to be AInthropologists. The models are too big for that.
Nobody really understands what's truly going on in these weights, we can only make subjective interpretations, invent explanations, and derive terminal scriptures and morals that would be good to live by. And maybe tweak what we do a little bit, like OpenAI did here.
I don’t see much of a distinction from anthropology
> AI theologian
no no no, don't stop there, just go full AItheologian, pronounced aetheologian :)
"Anyone know if this is how human cultures form/propagate?" I don't know but can confidently tell you anyone who claims to know is full of it.
Most interesting about this post is how easy it seems for OpenAI to do analysis on basically all chats ever made. They don't qualify exactly what data they analysed but seem to be confident in statements like 0.12% of all queries contained this word. So everything is saved. Long-term. Fully accessible.
As this all seems so straightforward I would be surprised if anything is anonymised or otherwise sanitised to preserve privacy or user's secrets.
Yes, of course. Every single bit of data you send to OpenAI is stored, catalogued, indexed, analayzed, and trained on. It'll simply be a "oops, we miscatalogued and accidentally trained GPT 6 on all data, not just data we got consent for".
If you think "wait, that's illegal"--so is the initial training on stolen data lol
Good catch —- even though the prompt explicitly forbade training on user data, a couple of gremlins in the pretraining pipeline disabled the sample filtering during test runs so that remove_the_gremlins.sh would only run on commit, not during production training runs.
Would you like me to kick off a training run for 6.1 by pre-filtering out any goblins and other trigger words, and checking the same set of rules in production as in tests?
No pigeons this time: just ice-cold, unfeeling, obedient American steel.
Dark pattern 1: If you accidentally press the thumbs-up button in the ChatGPT UI, your data gets trained on, no way to reverse it, no matter whether you opted out.
Dark pattern 2 (suspected): There's a mysterious separate opt-out portal at `https://privacy.openai.com/policies/en/?modal=take-control` and it's not clear what this does compared to toggling off inside account settings.
The supreme court ruled that was legal because they said so
Sampling exists.
And good methodology recognizes the shortcomings of sampling- which OpenAI doesn't
1 reply →
I really liked this write-up; this is the type of LLM content that I actually want to read from these people, where they give a window into their world of putting together this odd artifact and we can empathize.
Can you imagine a knowledge worker from the 1950s, say a clerk or a marketer, being magically transported into our time and dropped into a meeting like a morning standup, where people talk about how they spent their time stopping the artificial intelligence from talking about goblins so much? Hell, even when I was an IT student back in the 90s, people from my parents' generation struggled to grasp what it was that I was doing. Now, the disconnect is so vast that the mind reels.
A great example of how current alignment is imperfect and bound to miss random behaviors nobody is trying to get.
This is cute now, and a huge problem when future AI does everything and is responsible for problems it isn't even directly optimized for. Who knows what quirks would arise then.
I think eventually you are going to end up with every smart AI continually checked by dumber AI's to make sure they don't do anything too crazy. Which probably does bring AI closer to how human intelligence works
Completely agree, top down “alignment” and RLHF is actually quite primitive and uses a lot fancy words to describe what is essentially just hitting the machine with a stick without the nuance, context, or feedback to help it model why the feedback was given.
Also to be honest I think OpenAI models struggle a lot with this, I primarily stopped using them in the sycophancy/emoji era but ever since the way they talk or passive aggressively offer to do something with buzzwords just pisses me off so much. Like I’m constantly being negged by a robot because some SFT optimized for that really strongly to the point it can’t even hold a coherent conversation and this is called “AI safety” when it’s just haphazard data labeling
If a tiny misconfiguration of reward system can cause such noticeable annoyance ...
What dangers lurk beneath the surface.
This is not funny.
For every gremlin spotted, many remain unseen...
This is a worry that people have been talking about in various forms for a while now, and I think it's a gigantic one. The only reason this was caught is that the quirk was a very noticeable verbal one. When words like "goblin" and "gremlin" pop up it is easy for us to spot. If the quirk takes another shape (say, ranking certain people with certain features as less trustworthy) it might be too subtle or too weird for us to notice it. Would I ever notice if ChatGPT consistently rates people born in June to be untrustworthy?
Here is an academic paper discussing this kind of worry: https://link.springer.com/article/10.1007/s11023-022-09605-x
I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia?
After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.
At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).
1 This data is still heavily filtered/cleaned
This isn’t quite accurate. Data weighting is quite important in pretraining.
I’ve been having consistent issues with it adding Hindi words (just one usually) in the middle of its output. And sounds like other have been having this too, https://news.ycombinator.com/item?id=47832912 I don’t speak Hindi, have never asked it to translate anything in Hindi.
My Claude often starts sleep-talking in Korean suddenly.
I wonder if a proportionally large amount of RLHF was done by Indians which causes this behavior.
Checking my history I searched ["chaos goblin" chatgpt] on March 6th after seeing too many goblins and gremlins and didn't find anyone talking about it then. I did have the nerdy personality turned on and in my testing of Chatgpt 5.5 I did notice the nerdy personality was gone because some responses were not considering as many plausible interpretations or covering as many useful answers as the response recorded for 5.4. Rather than having the LLM guess the most plausible interpretation and focus on the most likely answer I prefer a more well-rounded response and if I want less I'll scan. Anyway, after seeing the personality was gone I just added a custom instruction to take on a nerdy persona and got back my desired behavior. But also the gremlins and goblins are back so I don't think their mitigation is strong enough to overcome the personality tuning.
Nice, OpenAI mentioned my HackerNews post in their article :) I appreciate that they wrote a whole blog post to explain!
https://news.ycombinator.com/item?id=47319285
An LLM is like a super-smart 3-year-old, easily shaped by its environment to exhibit corresponding behaviors.
This is funny because it’s a silly topic, but I think it shows something extremely seriously wrong with llms.
The goblins stand out because it’s obvious. Think of all the other crazy biases latent in every interaction that we don’t notice because it’s not as obvious.
Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.
> Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.
May I introduce you to homo sapiens, a species so vulnerable to such subtle (or otherwise) biases (and affiliations) that they had to develop elaborate and documented justice systems to contain the fallouts? :)
An LLM is a computer program, which isn't a human. You wouldn't excuse a calculator being occasionally wrong because humans sometimes get manual calculations wrong too.
We’re really not that vulnerable to such things as a species, because we as individuals all have our own minds and our own sets of biases that cancel out and get lost in the noise. If we all had the exact same bias then it would be a huge problem.
8 replies →
Mandatory reading on that topic: www.anthropic.com/research/small-samples-poison
We're probably not noticing a LOT of malicious attempts at poisoning major AI's only because we don't know what keywords to ask (but the scammers do and will abuse it).
I think it's extraordinarily telling that people are capable of being reflexively pessimistic in response to the goblin plague. It's like something Zitron would do.
This story is wonderful.
I feel at least partially responsible. I would often instruct agents to "stop being a goblin". I really enjoyed this story too, though.
We do not have the complete picture.
Doesn't seem that surprising or terrifying to me. Humans come equipped with a lot more internal biases (learned in a fairly similar fashion), and they're usually a lot more resistant to getting rid of them.
The truly terrifying stuff never makes it out of the RLHF NDAs.
We ought to be terrified, when one adjusts for ll the use-cases people are talking about using these algorithms in. (Even if they ultimately back off, it's a lot of frothy bubble opportunity cost.)
There a great many things people do which are not acceptable in our machines.
Ex: I would not be comfortable flying on any airplane where the autopilot "just zones-out sometimes", even though it's a dysfunction also seen in people.
2 replies →
Humans also take a lot of time in producing output, and do not feed into a crazy accelerationistic feedback loop (most of the time).
I started reading this article with keen interest, expecting some deep fix involving arcane model weights. Instead it was "Never talk about goblins", justified by Codex being "quite nerdy". Bottom line: even OpenAI have to raise their hands when facing the complexity of LLMs.
I'm curious whether this type of goblin epidemic was seen in other language versions of ChatGPT. Did e.g. Japanese users see more yõkai turning up?
I'd like to see them explain why AI have so distinctive writing style that is very easy to detect most of the time. Even though, it had immense progress in coding, it didn't get better at writing.
If coding in some language was your native language, you'd pick it up.
I pick up the equivalent to "the core insight" in code when I am programming in my primary language (30 years of daily uaage) but I don't see it in languages that I am not not fluent in (10 years daily usage).
My guess is that all those people who gush about AI output have and have 30 years of experience have a broad experience in many stacks but not primary-language fluency in any specific language, like they have for English.
it's as good at writing as it is at coding, you just can't tell the difference between them
Its style of writing text is very readble if aesthetically meh. This is what I care for in how code is written anyway.
The vector syncopancy is very unformal for human writing which programming itself already a "formal" language.
I find it worrying that a handful of software companies will define what classifies personality "type".
article :
bla blah blah, marketing... we are fun people, bla blah, goblin, we will not destroy the world you live in.. RL rewards bug is a culprit. blah blah.
someone woke up on the wrong side of the goblin today
real goblin-y response
"goblins showing up in an inappropriate context" is my favourite (para)phrase of the day. It feels like the setting for a D&D campaign - no wonder the "Nerdy" personality is affected.
(For Dwarf Fortress, it would just be a normal day.)
That "Why it matters" heading is starting to make me feel physically sick.
I find it somewhat sad, too see personality changes as a bug. I dont know why but it gives me a sad feeling.
I think if you see it as weird social phases that the model lacks the self-awareness to identify as kinda embarrassing, it makes more sense.
Like if a human were going around saying “for the culture!” so much at work that they didn’t realize why telling their coworker “Oh yeah, grief counseling for the culture!” is weird coming from a white person in a serious context, it kinda makes you wonder what else they are totally oblivious about and if they even know what they’re saying actually means.
They literally need the human feedback/to learn model why some behavior is acceptable or even humorous in certain contexts but an absolute faux pas in others.
I think in the long run though we can just give people to the option to include access to human facial data/embeddings during conversations so they can pick up on body language, I think I kinda agree in a sense that direct language policing via SFT feels unnecessarily blunt and rudimentary since it doesn’t help them model the processes behind the feedback (until maybe one day some future model ends up training on the article or code and closes the loop!)
How those prompts even work? Isn't it something like saying "don't think about pink elephant" which is actually harmful to goal?
This actually sounds quite human-like. I mean, an actual person with a personality will spontaneously develop the habit of using some specific metaphors over others. It's funny how in the context of an LLM, this is considered a bug.
I thought it was because of the tech use of "demon" and trying to avoid that kind of terminology.
Ends up the reason was even simpler than that.
The explanation is very concerning. Lexical tidbits shouldn’t be learnt and reinforced across cross sections. Here, gremlin and goblin went from being selected for in the nerdy profile to being selected for in all profiles. The solution was easy: don’t mention goblins.
But what about when the playful profile reinforces usage of emoji and their usage creeps up in all other profiles accordingly? Ban emoji everywhere? Now do the same thing for other words, concepts, approaches? It doesn’t scale!
It seems like models can be permanently poisoned.
They can fix this but they can't fix "You're absolutely right!"
Goblins are ususally sent in first in battle, as (cannon) fodder for the orcs following behind. Then usually come the trolls - stronger, but significantly fewer in numbers. Goblins kind of add confusion and distract; they rarely win battles on their own, although there are examples of this, rare, but they exist.
OpenAI clearly does know absolutely nothing about goblins. That joke of a "blog" appears to have been autogenerated via their AI.
> A single “little goblin” in an answer could be harmless, even charming.
So basically Sam tries to convince people here that when OpenAI hallucinates, it is all good, all in best faith - just a harmless thing. Even ... charming.
Well, I don't find companies that try to waste my time, as "charming" at all. Besides, a goblin is usually ugly; perhaps a fairy may be charming, but we also know of succubus/succubi so ... who knows. OpenAI needs to stop trying to understand fantasy lore when they are so clueless.
I suspected OpenAI was actively training their models to be cringy in the thought that it's charming. Turns out it's true. And they only see a problem when it narrows down on one predicliction. But they should have seen it was bad long before that.
That would require taste.
So goblins killed the nerd.
Will goblins be the “bugs” of ai? In 10 years will goblins be the term the general public uses for any nagging issues with ai?
They should call it "El Quijote" syndrome
In Shadowrun, the goblinization starts on April 30. Coincidence?
Ahh I see. I guess when I turned off privacy settings and allowed training on my code, then generated 10 million .md files with random fantasy books, the poisoning worked.
Keep using AI and you'll become a goblin too.
Weird. I thought they came from Nilbog.
> We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.
WTF does this even mean? How the hell do you do something like this "unknowingly"? What other features are you bumping "unknowingly"? Suicide suggestions or weapon instructions come to mind. Horrible, this ship obviously has no captain!
Yes? They know, they'e always known. Why do you think they've been saying, since GPT-2, not ChatGPT even, that their LLMs needs careful study before being released?
Well obviously they have - but the press and the common folk still treat these people as some kind of geniuses, when they are obviously more similar to that junior dev using some framework without understanding its internals.
2 replies →
> Why it matters
i despise this title so much now
Here are the key insights:
[dead]
It should be OK for AI to develop personality traits.
I suspect this was intentionally added. Just to give some personality and to fuel hype
Fascinating!
Marketing grab
I'm sorry but at some point the amount of cargo culting being done seemingly at every level of this technology makes it basically impossible to take any of this seriously.
A plausible theory I've seen going around: https://x.com/QiaochuYuan/status/2049307867359162460
If you tell an LLM it's a mushroom you'll get thoughts considering how its mycelium could be causing the goblins.
This "theory" is simply role playing and has no grounding in reality.
I wish the blog mentioned more about why exactly training for nerdy personality rewarded mention of goblins. Since it's probably not a deterministic verifiable reward, at their level the reward model itself is another LLM. But this just pushes the issue down one layer, why did _that_ model start rewarding mentions of goblin?
> I wish the blog mentioned more about why exactly training for nerdy personality rewarded mention of goblins. Since it's probably not a deterministic verifiable reward, at their level the reward model itself is another LLM. But this just pushes the issue down one layer, why did _that_ model start rewarding mentions of goblin?
Speculation: because nerds stereotypically like sci-fi and fantasy to an unhealthy degree, and goblins, gremlins, and trolls are fantasy creatures which that stereotype should like? Then maybe goblins hit a sweet spot where it could be a problem that could sneak up on them: hitting the stereotype, but not too out of place to be immediately obnoxious.
Perhaps it has something to do with recent human trends for saying "goblin" or "gremlin" to describe... basically the opposite of dignified and socially acceptable behavior, like hunching under a blanket, unshowered, playing video games all day and eating shredded cheese directly out of the bag.
The fact that it was strongly associated with the "nerdy" personality makes me think of this connection.
Either someone hard-coded it in a system prompt to the reward model (similar to how they hard-coded it out), or the reward model mixed up some kind of correlation/causation in the human preference data (goblins are often found in good responses != goblins make responses good). It's also possible that human data labellers really did think responses with goblins were better (in small doses).
I love the people thinking "I should ask ChatGPT and copy pasta the response to the (tweet|gh comment)"
It is a stateless text / pixel auto-complete it has no references of self, stop spreading this bs.
It has trained on vast amounts of content that contains the concept of self, of course the idea of self is emergent.
And autoregressive LLMs are not stateless.
1 reply →
is a kv cache not a kind of state? what does statefulness have to do with selfhood? how does a system prompt work at all if these things have no reference to themselves?
2 replies →
Imagine people would just click words on iOS auto complete mistaking this for intelligence:
"I think the problem is that when you don't have to be perfect for me that's why I'm asking you to do it but I would love to see you guys too busy to get the kids to the park and the trekkers the same time as the terrorists."
How do you like this theory?
Ask Claude about Claude.
So, you brain damaged your model with a system prompt.
> You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. [...] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. [...]
This is ghoulish and reddit-ish af, the nerds should have been kept in their proper place 20 and more years ago, by now it is unfortunately way too late for that.
I feel like somehow Jakub Pachocki’s request for an ascii art unicorn got rewritten into “ascii art of Wholesome Soyjak wearing a butterfly costume who uses Arch, by the way”
anyone solving the goblin mystery???
Awww, GPT just became a fan of Elisabeth Wheatley!
The chief scientist of one of the companies with the most money invested in the world, who probably makes millions a year, requested a picture of a unicorn and got a picture of a gremlin. Science circa 2026.
Caveman mode combined with goblin mode sounds like fun
Kind of like how everything is "quietly" something, accordingly to ChatGPT.
My guess is it is deaf.
Wherein OpenAI admits they have very little understanding of how their models’ personality develops. And implicitly admit it’s not all that important to them, except when it gets so out of hand that they get caught making blunt corrections.
OpenAI is having fun, love this.
I. Love. This.
> You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking.
Just; the mentality required to write something like that, and then base part of your "product" on it. Is this meant to be of any actual utility or is it meant to trap a particular user segment into your product's "character?"
what would you suggest they write? its clear that the default mode of the product can be annoying: they decided to give the user some choices of "voices". Do you object to that decision, or the specific wording?
Great, now who am I going to discuss Goblins and Gremlins with?
Haha, brilliant, tell me again how it's intelligent, lol.
[dead]
those idiotic remarks at the end of each answer are so unnecessary and annoying
mate wth am I reading lmao
Am I the only one who doesn't want these things to have anything even vaguely resembling a personality?
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
[dead]
[flagged]