Comment by lukev
1 year ago
It's interesting to note that there's really two things going on here:
1. A LLM (probably a finetuned GPT-4o) trained specifically to read and emit good chain-of-thought prompts.
2. Runtime code that iteratively re-prompts the model with the chain of thought so far. This sounds like it includes loops, branches and backtracking. This is not "the model", it's regular code invoking the model. Interesting that OpenAI is making no attempt to clarify this.
I wonder where the real innovation here lies. I've done a few informal stabs with #2 and I have a pretty strong intuition (not proven yet) that given the right prompting/metaprompting model you can do pretty well at this even with untuned LLMs. The end game here is complex agents with arbitrary continuous looping interleaved with RAG and tool use.
But OpenAI's philosophy up until now has almost always been "The bitter lesson is true, the model knows best, just put it in the model." So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.
Without being able to inspect the reasoning tokens, we can't really get a lot of info about which is happening.
If it really is Reinforcement Learning as they claim, it means there might not be any direct supervision on the "thinking" section of the output, just on the final answer.
Just like for Chess or Go you don't train a supervised model by giving it the exact move it should do in each case, you use RL techniques to learn which moves are good based on end results of the game.
In practice, there probably is some supervision to enforce good style and methodology. But the key here is that it is able to learn good reasoning without (many) human examples, and find strategies to solve new problems via self-learning.
If that is the case it is indeed an important breakthrough.
This is the bitter lesson/just put it in the model. They're trying to figure out more ways of converting compute to intelligence now that they're running out of text data: https://images.ctfassets.net/kftzwdyauwt9/7rMY55vLbGTlTiP9Gd...
A cynical way to look at it is that we're pretty close to the ultimate limits of what LLMs can do and now the stake holders are looking at novel ways of using what they have instead of pouring everything into novel models. We're several years into the AI revolution (some call it a bubble) and Nvidia is still pretty much the only company that makes bank on it. Other than that it's all investment driven "growth". And at some point investors are gonna start asking questions...
That is indeed cynical haha.
A very simple observation, our brains are vastly more efficient. Obtaining vastly better outcomes from lesser input. This evidence means there's plenty of room for improvement without a need to go looking for more data. Short term gain versus long term gain like you say, shareholder return.
More efficiency means more practical/useful applications and lower cost as opposed to bigger model which means less useful (longer inference times) and higher cost (data synthesis and training cost).
8 replies →
One aspect that’s not achievable is they discuss hiding the chain of thought in its raw form because the chains are allowed to be unaligned. This allows the model to operate without any artifacts from alignment and apply them in the post processing, more or less. This requires effectively root and you would need the unaligned weights.
Ok but this presses on a latent question: what do we mean by alignment?
Practically it's come to mean just sanitization... "don't say something nasty or embarrassing to users." But that doesn't apply here, the reasoning tokens are effectively just a debug log.
If alignment means "conducting reasoning in alignment with human values", then misalignment in the reasoning phase could potentially be obfuscated and sanitized, participating in the conclusion but hidden. Having an "unaligned" model conduct the reasoning steps is potentially dangerous, if you believe that AI alignment can give rise to danger at all.
Personally I think that in practice alignment has come to mean just sanitization and it's a fig leaf of an excuse for the real reason they are hiding the reasoning tokens: competitive advantage.
Alignment started as a fairly nifty idea, but you can't meaningfully test for it. We don't have the tools to understand the internals of an LLM.
So yes, it morphed into the second best thing, brand safety - "don't say racist / anti-vax stuff so that we don't get bad press or get in trouble with the regulators".
The challenge is alignment ends up changing the models in ways that aren’t representative of the actual training set and as I understand it this generally lowers the performance even for aligned things. Further the decision to summarize the chains of thought includes the answers that wouldn’t pass alignment themselves without removal. From what I read the final output is aligned but could have considered unaligned COT. In fact because they’re in the context they’re necessarily changing the final output even if the final output complies with the alignment. There are a few other “only root could do this,” which says yes anyone could implement these without secret sauce as long as they have a raw frontier model.
Glass half full and the good faith argument.
It's a compromise.
OpenAI will now have access to vaste amounts of unaligned output so they can actually study it's thinking.
Whereas the current checks and balances meant the request was rejected and the data providing this insight was not created in the first place.
I have also spent some time on 2) and implemented several approaches in this open source optimising llm proxy - https://github.com/codelion/optillm
In my experience it does work quite well, but we probably need different techniques for different tasks.
Maybe 1 is actually hat you just suggested - an RL approach to select the strategy for 2. Thank you for implementing optillm and working out all the various strategy options, it’s a really neat reference for understanding this space.
One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?
That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.
> So it's also possible that the prompt loop has no special sauce and that the capabilities here do come mostly from the model itself.
The prompt loop code often encodes intelligence/information that the human developers tend to ignore during their evaluations of the solution. For example, if you add a filter for invalid json and repeatedly invoke the model until good json comes out, you are now carrying water for the LLM. The additional capabilities came from a manual coding exercise and additional money spent on a brute force search.
Well, if LLMs are system 1, this difference would be building towards system 2.
https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
Yes indeed, and personally if we have AGI I believe it will arise from multiple LLMs working in tandem with other types of machine learning, databases for "memory", more traditional compute functions, and a connectivity layer between them all.
But to my knowledge, that's not the kind of research OpenAI is doing. They seem mostly focused on training bigger and better models and seeking AGI through emergence in those.
The innovation lies in using RL to achieve 1.) and provide a simple interface to 2.)
You don't to execute code to have it backtrack. The LLM can inherently backtrack itself if trained to. It knows all the context provided to it and the output it has written already.
If it knows it needs to backtrack then could it gain much by outputting something that tells the code to backtrack for it? For example, outputting something like "I've disproven the previous hypothesis, remove the details". Almost like asking to forget.
This could reduce the number of tokens it needs at inference time, saving compute. But with how attention works, it may not make any difference to the performance of the LLM.
Similarly, could there be gains by the LLM asking to work in parallel? For example "there's 3 possible approaches to this, clone the conversation so far and resolve to the one that results in the highest confidence".
This feels like it would be fairly trivial to implement.
O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238
this is why i became skeptical of openai's claims
if they shared the COT the grift wont work
its just RL
I can't help but feel that saying "it's just RL" is like someone at the start of the 20th century saying "it's just electricity", as if understanding the underlying mechanism is the same as understanding the applications it can enable.
Tbf RL is pretty incredible.
I trained a model to play a novel video game using only screenshots and a score using RL and I discovered how not to lose
The innovation lies in making the whole loop available to an end user immediately, without them being a programmer. My grandma can build games using ChatGPT now.
No she can't, comments likes yours are just made up nonsense that AI hype-mans and investors somehow convinced us are a fair opinions to have.
Check out replit agents, they can make games and apps autonomously now
4 replies →
While AI is overhyped by some people, the parent's statement is not only true but was true long before o1 was released.
2 replies →
What games have people made with ChatGPT? Do you have an example of a live, deployed game?
Yes, a gazillion of them. Someone in a scrabble Facebook group made this entirely with ChatGPT: https://aboocher.github.io/scrabble/ingpractice.html
3 replies →
Ada Lovelace is my grandma
My great aunt literally asked o1 for fantasy football bets and won $1000 on draftkings. This is a gamechanger
what game has she made