← Back to context

Comment by Doohickey-d

2 months ago

> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

Could the exclusion of CoT that be because of this recent Anthropic paper?

https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...

>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.

I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.

Because it's alchemy and everyone believes they have an edge on turning lead into gold.

  • I've been thinking for a couple of months now that prompt engineering, and therefore CoT, is going to become the "secret sauce" companies want to hold onto.

    If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.

    The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.

    Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)

    • The thing with alchemy was not that their hypotheses were wrong (they eventually created chemistry), but that their method of secret esoteric mysticism over open inquiry was wrong.

      Newton is the great example of this: he led a dual life, where in one he did science openly to a community to scrutinize, in the other he did secret alchemy in search of the philosopher's stone. History has empirically shown us which of his lives actually led to the discovery and accumulation of knowledge, and which did not.

      3 replies →

  • We won't know without an official answer leaking, but a simple answer could be - people spend too much time trying to analyse those without understanding the details. There was a lot of talk on HN about the thinking steps second guessing and contradicting itself. But in practice that step is both trained by explicitly injecting the "however", "but" and similar words and they do more processing than simply interpreting the thinking part as text we read. If the content is commonly misunderstood, why show it?

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

  • Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

    • I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

      Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

      9 replies →

  • Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

    If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

    Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

    • Oh sure, RLHF instruction tuning was what turned an model of mostly academic interest into a global phenomenon.

      But it also compromised model accuracy & performance at the same time: The more you tune to eliminate or reinforce specific behaviours, the more you affect the overall performance of the model.

      Hence my speculation that Anthropic is using a chain-of-thought model that has not been alignment tuned to improve performance. This would then explain why you don’t get to see its output without signing up to special agreements. Those agreements presumably explain all this to counter-parties that Anthropic trusts will cope with non-aligned outputs in the chain-of-thought.

Guess we have to wait till DeepSeek mops the floor with everyone again.

  • DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.

    • I think qwen3:R1 is apples:oranges, if you mean the 32B models. R1 has 20x the parameters and likely roughly as much knowledge about the world. One is a really good general model, while you can run the other one on commodity hardware. Subjectively, R1 is way better at coding, and Qwen3 is really good only at benchmarks - take a look at aider‘s leaderboard, it’s not even close: https://aider.chat/docs/leaderboards/

      R2 could turn out really really good, but we‘ll see.

    • DeepSeek made OpenAI panic, they initially hid the CoT for o1 and then rushed to release o3 instead of waiting for GPT-5.

    • I disagree. I find myself constantly going to their free offering which was able to solve lots of coding tasks that 3.7 could not.

  • Do people actually believe this? While I agree their open source contribution was impressive, I never got the sense they mopped the floor. Perhaps firms in China may be using some of their models but beyond learnings in the community, no dents in the market were made for the West.

> because it helped to see when it was going to go down the wrong track

It helped me tremendously learning Zig.

Seeing his chain of thought when asking it stuff about Zig and implementations let me widen the horizon a lot.

it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening

The Google CoT is so incredibly dumb. I thought my models had been lobotomized until I realized they must be doing some sort of processing on the thing.

  • You are referring to the new (few days old-ish) CoT right? It’s bizzare as to why google did it, it was very helpful to see where the model was making assumptions or doing something wrong. Now half the time it feels better to just use flash with no thinking mode but ask it to manually “think”.

  • it’s fake cot, just like oai

    • I had assumed it was a way to reduce "hallucinations". Instead of me having to double check every response and prompt it again to clear up the obvious mistakes it just does that in the background with itself for a bit.

      Obviously the user still has to double check the response, but less often.