Comment by layer8
1 year ago
The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"
While the reasoning may have been improved, this doesn't solve the problem of the model having no way to assess if what it conjures up from its weights is factual or not.
The failure is in how you're using it. I don't mean this as a personal attack, but more to shed light on what's happening.
A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.
The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.
Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.
On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.
Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.
> Treat it as a naive but intelligent intern
That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.
I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.
For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.
10 replies →
1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.
12 replies →
> LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.
1 reply →
I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.
LLMs are giant word Plinko machines. A million monkeys on a million typewriters.
LLMs are not interns. LLMs are assumption machines.
None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.
LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.
LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.
17 replies →
Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.
2 replies →
> That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.
An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.
many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways
8 replies →
I think this is the main issue with these tools... what people are expecting of them.
We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...
1 reply →
> A good intern will ask clarifying questions, tell me “I don’t know”
Your expectations are bigger than mine
(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)
5 replies →
Makes me wonder if "I don't know" could be added to LLM: whenever an activation has no clear winner value (layman here), couldn't this indicate low response quality?
1 reply →
They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.
It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.
> With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding.
With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.
1 reply →
Is this a dataset issue more than an LLM issue?
As in: do we just need to add 1M examples where the response is to ask for clarification / more info?
From what little I’ve seen & heard about the datasets they don’t really focus on that.
(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)
Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?
Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.
12 replies →
> have no idea whether the LLM understood what I’m asking
That's easy. The answer is it doesn't. It has no understanding of anything it does.
> if it’s able to do it
This is the hard part.
A lot of interns are overconfident though
Can I have some of those sorts of interns?
> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output
This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.
https://x.com/colin_fraser/status/1834336440819614036
Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).
19 replies →
This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.
Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.
I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.
2 replies →
1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.
(a) Sometimes things are useful even when imperfect e.g. search engines.
(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!
I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.
2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.
[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...
Reminds me of a trick question about Schrödinger's cat.
“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”
The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.
10 replies →
What I'm not able to comprehend is why people are not seeing the answer as brilliant!
Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.
Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).
In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).
35 replies →
I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.
They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".
I think it may answer correctly if you start off asking "Please solve the below riddle:"
There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).
6 replies →
Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.
I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero
5 replies →
what's weird is it gets it right when I try it.
https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...
5 replies →
The reason why that question is a famous question is that _many humans get it wrong_.
> The failure is in how you're using it.
People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.
This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.
> the distance from the earth to the sun, at least not with better precision than about a light week
The sun is eight light minutes away.
1 reply →
Empirically, they have reduced hallucinations. Where do OpenAI / Anthropic claim that their models won't hallucinate?
3 replies →
> Treat it as a naive but intelligent intern.
You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.
Treat it as text completion and you can get more accurate answers.
> You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say.
And an intern does?
Anthropomorphising LLMs isn't entirely incorrect: they're trained to complete text like a human would, in completely general setting, so by anthropomorphising them you're aligning your expectations with the models' training goals.
Oh no, I'm well aware that it's a big file full of numbers. But when you chat with it, you interact with it as though it were a person so you are necessarily anthropomorphizing it, and so you get to pick the style of the interaction.
(In truth, I actually treat it in my mind like it's the Enterprise computer and I'm Beverly Crusher in "Remember Me")
> Treat it as a naive but intelligent intern.
That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.
The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.
ive been doing exactly this for bout a year now. feed it words data, give it a task. get better words back.
i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.
chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.
> i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.
I don't think it works like that...
And how much data can you give it?
I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.
Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?
Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.
>And how much data can you give it?
Well, theoretically you can give it up to the context size minus 4k tokens, because the maximum it can output is 4k. In practice, though, its ability to effectively recall information in the prompt drops off. Some people have studied this a bit - here's one such person: https://gritdaily.com/impact-prompt-length-llm-performance/
1 reply →
> And how much data can you give it?
128,000 tokens, which is about the same as a decent sized book.
Their other models can also be fine-tuned, which is kinda unbounded but also has scaling issues so presumably "a significant percentage of the training set" before diminishing returns.
It is great for proof-reading text if you are not a native English speaker. Things like removing passive voice. Just give it your text and you get a corrected version out.
Use a cli tool to automate this from the cli. Ollama for local models, llm for openai.
People never talk about Gemini, and frankly it's output is often the worst of SOTA models, but it's 2M context window is insane.
You can drop a few textbooks into the context window before you start asking questions. This dramatically improves output quality, however inference does take much much longer at large context lengths.
Except that it sometimes does do those tasks well. The danger in an LLM isn't that it sometimes hallucinates, the danger is that you need to be sufficiently competent to know when it hallucinates in order to fully take advantage of it, otherwise you have to fallback to double checking every single thing it tells you.
> On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.
There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).
It gets really hard problems better, but it's unclear that matters all that much.
> A lot of people use LLMs as a search engine.
Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.
> The magic is that _it knows english_.
I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.
In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.
It doesn't know anything. Stop anthropomorphizing the model. It's predictive text and no the brain isn't also predictive text.
> Treat it as a naive but intelligent intern.
I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.
1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.
Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.
I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.
This is demonstrably wrong, because you can just add "is this real" to a response and it generally knows if it made it up or not. Not every time, but I find it works 95% of the time. Given that, this is exactly a step I'd hope an advanced model was doing behind the scenes.
> Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.
Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.
With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.
Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.
I've recently started using an LLM to choose the best release of shows using data scraped from several trackers. I give it hard requirements and flexible preferences. It's not that I couldn't do this, it's that I don't want to do this on the scale of multiple thousand shows. The "magic" here is that releases don't all follow the same naming conventions, they're an unstructured dump of details. The LLM is simultaneously extracting the important details, and flexibly deciding the closest match to my request. The prompt is maybe two paragraphs and took me an hour to hone.
Ooh yeah it's great for bouncing ideas on what to name things off of. You can give it something's function and a backstory and it'll come up with a list of somethings for you to pick and choose from.
> The failure is in how you're using it
This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here
> Provide it data, give it a task, and let it surprise you with its output.
And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.
> hallucinates non-existing libraries and functions
This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.
Yeah except. I’m priming it with things like curated docs from bevy latest, using the tricks, and testing context limits.
It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.
Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.
I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”
PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh
This model is, thankfully, far more susceptible for longer and elaborate explanation as input. The rest (4,4o,Sonnet) seem to struggle with comprehensive explanation; this one seems to perform better with a spec like input.
> A lot of people use LLMs as a search engine.
GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).
> Treat it as a naive but intelligent intern
So mostly useless then?
Interns are cheaper than o1-preview
Not for long.
Sorry, but that does not seem to be the case. A friend of mine who runs a long context benchmark on understanding novels [1] just ran an eval and o1 seemed to improve by 2.9% over GPT-4o (the result isn't on the website yet). It's great that there is an improvement, but it isn't drastic by any stretch. Additionally, since we cannot see the raw reasoning it's basing the answers off of, it's hard to attribute this increase to their complicated approach as opposed to just cleaner higher quality data.
EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.
[1]: https://novelchallenge.github.io/
It's a good rebranding. It was getting ridiculous 3.5, 4, 4.5,
This is a great description.
Intelligent?
Just ask ChatGPT
How many Rs are in strawberry?
https://chatgpt.com/share/66e3f9e1-2cb4-8009-83ce-090068b163...
Keep up, that was last week's gotcha, with the old model.
3 replies →
Perfectly well put! We should change the name from "AI" (which it is not) to something like, "lossy compressed databases".
If they use this name, they just say that they violate the copyright of all training data.
That abbreviates to LCD. If we could make it LSD somehow, that would help to explain the hallucinations.
1 reply →
Yes, this only helps multi-step reasoning. The model still has problems with general knowledge and deep facts.
There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.
I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.
Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.
o1-mini is a small model (knows a lot less about the world) and is tuned for reasoning through symbolic problems (maths, programming, chemistry etc.).
You're using a calculator as a search engine.
It's actually much worse than that and you're inadvertently down playing how bad it is.
It doesn't even know mildly obsecure facts that are on the internet.
For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.
You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.
This is an obscure, but well documented, part of the spec.
So it's not about facts that aren't on the internet, it's just bad at facts fullstop.
What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.
If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.
As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.
I wonder though, is the documentation only referenced a few places on the Internet, and are there also many forums with people pasting "Why isn't this working?" problems?
If there are a lot of people pasting broken code, now the LLM has all these examples of broken code, which it doesn't know are that, and only a couple of references to documentation. Worse, a well trained LLM may realise that specs change, and that even documentation may not be considered 100% accurate (for it is older, out of date).
After all, how many times have you had something updated, an API, a language, a piece of software, but the docs weren't updates? Happens all the time, sadly.
So it may believe newer examples of code, such as the aforementioned pasted code, might be more correct than the docs.
Also, if people keep trying to solve the same issue again, and keep pasting those examples again, well...
I guess my point here is, hallucinations come from multi-faceted issues, one being "wrong examples are more plentiful than correct". Or even "there's just a lot of wrong examples".
Its not always the right tool depending on the task. IMO using LLMs is also a skill, much like learning how to Google stuff.
E.g. apparently C# generics isn’t something its good at. Interesting, so don’t use it for that, apparently its the wrong tool. In contrast, its amazing at C++ generics, and thus speeds up my productivity. So do use it for that!
> For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.
Just use it on an instance instead
>>>As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.
This is kinda crazy to think about.
1 reply →
I've had the opposite experience with some coding samples. After reading Nick Carlini's post, I've gotten into the habit of powering through coding problems with GPT (where previously I'd just laugh and immediately give up) by just presenting it the errors in its code and asking it to fix them. o1 seems to be effectively screening for some of those errors (I assume it's just some, but I've noticed that the o1 things I've done haven't had obvious dumb errors like missing imports, and all my 4o attempts have).
My experience is likely colored by the fact that I tend to turn to LLMs for problems I have trouble solving by myself. I typically don't use them for the low-hanging fruits.
That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.
I use LLMs for three things:
* To catch passive voice and nominalizations in my writing.
* To convert Linux kernel subsystems into Python so I can quickly understand them (I'm a C programmer but everyone reads Python faster).
* To write dumb programs using languages and libraries I haven't used much before; for instance, I'm an ActiveRecord person and needed to do some SQLAlchemy stuff today, and GPT 4o (and o1) kept me away from the SQLAlchemy documentation.
OpenAI talks about o1 going head to head with PhDs. I could care less. But for the specific problem we're talking about on this subthread: o1 seems materially better.
8 replies →
LLMs are not for expanding the sphere of human knowledge, but for speeding up auto-correct of higher order processing to help you more quickly reach the shell of the sphere and make progress with your own mind :)
3 replies →
It's funny because I'm very happy with the productivity boost from LLMs, but I use them in a way that is pretty much diametrically opposite to yours.
I can't think of many situations where I would use them for a problem that I tried to solve and failed - not only because they would probably fail, but in many cases it would even be difficult to know that it failed.
I use it for things that are not hard, can be solved by someone without a specialized degree that took the effort to learn some knowledge or skill, but would take too much work to do. And there are a lot of those, even in my highly specialized job.
LLMs: When the code can be made by an enthusiastic new intern with web-search and copy-paste skills, and no ability to improve under mentorship. :p
Tangentially related, a comic on them: https://existentialcomics.com/comic/557
> That's the frustrating thing. LLMs don't materially reduce the set of problems where I'm running against a wall or have trouble finding information.
As you step outside regular Stack Overflow questions for top-3 languages, you run into limitations of these predictive models.
There's no "reasoning" behind them. They are still, largely, bullshit machines.
9 replies →
>The o1-preview model still hallucinates non-existing libraries and functions for me, and is quickly wrong about facts that aren't well-represented on the web. It's the usual string of "You're absolutely correct, and I apologize for the oversight in my previous response. [Let me make another guess.]"
After that you switch to Claude Soñnet and after sometime it also gets stuck.
Problem with LLM is that they are not aware of libraries.
I've fed them library version, using requirements.txt, python version I am using etc...
They still make mistakes and try to use methods which do not exist.
Where to go from here? At this point I manually pull the library version I am using and go to its docs, I generate a page which uses the this library correctly (then I feed that example into LLM)
Using this approach works. Now I just need to automate it so that I don't have to manually find the library, create specific example which uses the methods I need in my code!
Directly feeding the docs isn't working well either.
One trick that people are using, when using Cursor and specifically Cursor's compose function, is to dump library docs into a text file in your repo, and then @ that doc file when you're asking it to do something involving that library.
That seems to eliminate a lot of the issues, though it's not a seamless experience, and it adds another step of having to put the library docs in a text file.
Alternatively, cursor can fetch a web page, so if there's a good page of docs you can bring that in by @ the web page.
Eventually, I could imagine LLMs automatically creating library text doc files to include when the LLM is using them to avoid some of these problems.
It could also solve some of the issues of their shaky understanding of newer frameworks like SvelteKit.
Cursor also has the shadow workspace feature [1] that is supposed to send feedback from linting and language servers to the LLM. I'm not sure whether it's enabled in compose yet though.
[1] https://www.cursor.com/blog/shadow-workspace
My point of view: this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
I honestly can’t believe this is the hyped up “strawberry” everyone was claiming is pretty much AGI. Senior employees leaving due to its powers being so extreme
I’m in the “probabilistic token generators aren’t intelligence” camp so I don’t actually believe in AGI, but I’ll be honest the never ending rumors / chatter almost got to me
Remember, this is the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for
The whole safety aspect of AI has this nice property that it also functions as a marketing tool to make the technology seem "so powerful it's dangerous". "If it's so dangerous it must be good".
> probabilistic token generators aren’t intelligence
Maybe this has been extensively discussed before, but since I've lived under a rock: which parts of intelligence do you think are not representable as conditional probability distributions?
> which parts of intelligence do you think are not representable as conditional probability distributions
Maybe I'm wrong here but a lot of our brilliance comes from acting against the statistical consensus. What I mean is, Nicolaus Copernicus probably consumed a lot of knowledge on how the Earth is the center of the universe etc. and probably nothing contradicting that notion. Can a LLM do that ?
5 replies →
"Senior employees leaving due to its powers being so extreme"
This never happened. No one said it happened.
"the model some media outlet reported recently that is so powerful OAI is considering charging $2k/month for"
The Information reported someone at a meeting suggested this for future models, not specifically Strawberry, and that it would probably not actually be that high.
Elon Musk and Ilya Sutskever Have Warned About OpenAI’s ‘Strawberry’ Jul 15, 2024 — Sutskever himself had reportedly begun to worry about the project's technology, as did OpenAI employees working on A.I. safety at the time.
https://observer.com/2024/07/openai-employees-concerns-straw...
And I’m ignoring the hundreds of Reddit articles speculating every time someone at OAI leaves
And of course that $2000 article was spread by every other media outlet like wildfire
I know I’m partially to blame for believing the hype, this is pretty obviously no better at stating facts or good code than what we’ve known for the past year
1 reply →
I mean, considering how many tokens their example prompt consumed, I wouldn't be surprised if it costs ~$2k/month/user to run
I think this model is a precursor model that is designed for agentic behavior. I expect very soon OpenAI to allow this model tool use that will allow it to verify its code creations and whatever else it claims through use of various tools like a search engine, a virtual machine instance with code execution capabilities, api calling and other advanced tool use.
Stupid question: Why can't models be trained in such a way to rate the authoritativeness of inputs? As a human, I contain a lot of bad information, but I'm aware of the source. I trust my physics textbook over something my nephew thinks.
o1-preview != o1.
In public coding AI comparison tests, results showed 4o scoring around 35%, o1-preview scoring ~50% and o1 scoring ~85%.
o1 is not yet released, but has been run through many comparison tests with public results posted.
Good reminder. Why did OpenAI talk about o1 and not release it? o1-preview must be a stripped down version: cheaper to run somehow?
Don't forget about o1-mini. It seems better than o1-preview for problems that fit it (don't require so much real world knowledge).
gpt-4 base was never released and this will be the same thing
I don’t really see this as a massive problem. Its code. If it doesn’t run, you ask it to reconsider, give some more info if necessary, and it usually gets it right.
The system doesn’t become useless if it takes 2 tries instead of 1 to get it right
Still saves an incredible amount of time vs doing it yourself
> Its code. If it doesn’t run, you ask it to reconsider
It is perfectly possible to have code that runs without errors but gives a wrong answer. And you may not even realise it’s wrong until it bites you in production.
While I agree, I saw it abused in this way a lot, in the sense that the code did what it was supposed to do in a given scenario but was obviously flawed in various was so it was just sitting there waiting for a disaster.
I haven't found a single instance where it saved me any significant amount of time. In all cases I still had to rewrite the whole thing myself, or abandon endeavor.
And a few times the amount of time I spent trying to coax a correct answer out of AI trumped any potential savings I could've had
To the extent we've now got the output of the underlying model wrapped in an agent that can evaluate that output, I'd expect it to be able to detect it's own hallucinations some of the time and therefore provide an alternate answer.
It's like when an LLM gives you a wrong answer and all it takes is "are you sure?" to get it to generate a different answer.
Of course the underlying problem of the model not knowing what it knows or doesn't know persists, so giving it the ability to reflect on what it just blurted out isn't always going to help. It seems the next step is for them to integrate RAG and tool use into this agentic wrapper, which may help in some cases.
> The o1-preview model still hallucinates non-existing libraries and functions for me
Oooh... oohhh!! I just had a thought: By now we're all familiar with the strict JSON output mode capability of these LLMs. That's just a matter of filtering the token probability vector by the output grammar. Only valid tokens are allowed, which guarantees that the output matches the grammar.
But... why just data grammars? Why not the equivalent of "tab-complete"? I wonder how hard it would be to hook up the Language Server Protocol (LSP) as seen in Visual Studio code to an AI and have it only emit syntactically valid code! No more hallucinated functions!
I mean, sure, the semantics can still be incorrect, but not the syntax.
This would be a big undertaking to get working for just one language+package-manager combination, but would be beautiful if it worked.
I still fail to see the overall problem. Hallucinating non-existing libraries is a good programming practice in many cases: you express your solution in terms of an imaginary API that is convenient for you, and then you replace your API with real functions, and/or implement it in terms of real functions.
One of the biggest problems with this generation of AI is how people conflate the natural language abilities and the access to what it knows.
Both abilities are powerful, but they are very different powers.
Just pass a link to a GitHub issue and ask for a response or even a webpage to summarize and will see the beautiful hallucinations it will come up to as the model is not web browsing yet.
You should not be asking it questions that require it to already know detailed information about apis and libraries. It is not good at that, and it will never be good at that. If you need it to write code that uses a particular library or api, include the relevant documentation and examples.
It's your right to dismiss it, if you want, but if you want to get some value out of it, you should play to it's strengths and not look for things that it fails at as a gotcha.
The best one I got recently was after I pointed out that the method didn’t exist, it proposed another method and said “use this method if it exists” :D
Has anyone tried asking it to generate the libraries/functions that it's hallucinating and seeing if it can do so correctly? And then seeing if it can continue solving the original problem with the new libraries? It'd be absolutely fascinating if it turns out it could do this.
Not for libraries, but functions will sometimes get created if you work with an agent coding loop. If the tests are in the verification step, the code will typically be correct.
I sometimes give it snippets of code and omit helper functions if they seem obvious enough, and it adds its own implementation into the output.
Just ask it for things it has seen before on the internet and you're golden. Mixes of ideas, new ideas and precise and clear thinking; not so much.
It begs the question of whether we can supply a function to be called (e.g., one that compiles and runs code) to evaluate intermediate CoT results
It seems OpenAI has decided to keep the CoT results a secret. If they were to allow the model to call out to tools to help fill in the CoT steps, then this might reveal what the model is thinking - something they do not want the outside world to know about.
I could imagine OpenAI might allow their own vetted tools to be used, but perhaps it will be a while (if ever) before developers are allowed to hook up their own tools. The risks here are substantial. A model fine-tuned to run chain-of-thought that can answer graduate level physics problems at an expert level can probably figure out how to scam your grandma out of her savings too.
It's only a matter of time. When some other company releases the tool, they likely will too.
1 reply →
The answer is yes if you are willing to code it. OpenAI supports tool calls. Even if it didn't you could just make multiple calls to their API and submit the result of the code execution yourself.
The intermediate CoT results aren't in the API.
1 reply →
That problems feels somewhat fundamental to saying that these things have any ability to reason at all.
> having no way to assess if what it conjures up from its weights is factual or not.
This comment makes no sense in the context of what an LLM is. To even say such a thing demonstates a lack of understandting of the domain. What we are doing here is TEXT COMPLETION, no one EVER said anything about being accurate and "true". We are building models that can complete text, what did you think an LLM was, a "truth machine"?
I mean of course you're right, but then I question what's the usefulness?
I'm honestly confused as to why it is doing this and why it thinks I'm right when I tell it that it is incorrect.
I've tried asking it factual information, and it asserts that it's incorrect but it will definitely hallucinate questions like the above.
You'd think the reasoning would nail that and most of the chain-of-thought systems I've worked on would have fixed this by asking it if the resulting answer was correct.