Comment by MPSimmons

1 year ago

The failure is in how you're using it. I don't mean this as a personal attack, but more to shed light on what's happening.

A lot of people use LLMs as a search engine. It makes sense - it's basically a lossy compressed database of everything its ever read, and it generates output that is statistically likely - varying degrees of likeliness depending on the temperature, as well as how many times the particular weights your prompt ends up activating.

The magic of LLMs, especially one like this that supposedly has advanced reasoning, isn't the existing knowledge in its weights. The magic is that _it knows english_. It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output. It's not _just_ an output engine. It's an engine that outputs.

Asking it about nuanced details in the corpus of data it has read won't give you good output unless it read a bunch of it.

On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

Don't treat it as a database. Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

> Treat it as a naive but intelligent intern

That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.

  • I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.

    For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.

    • This is the way.

      I've had this experience many times:

      - hey, can you write me a thing that can do "xyz"

      - sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)

      - can you add onto this with "abc"

      - sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.

      - what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?

      - yes, you're absolutely right. I'm sorry for making this over complicated.

      If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.

      8 replies →

    • Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.

      It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.

  • 1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.

    • I get a lot of value out of ChatGPT but I also, fairly frequently, run into issues here. The real danger zones are areas that lie at or just beyond the edges of my own knowledge in a particular area.

      I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.

      The balance is still in its favour, but you have to keep your wits about you when using it.

      4 replies →

    • I ask ChatGPT whether it knows things all the time. But it's almost never answers no.

      As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.

      I then asked it if it could solve an arbitrary quintic and it said no.

      So I guess it can say it doesn't know if it can prove to itself it doesn't know.

    • The difference is a junior cost 30-100$/hr and will take 2 days to complete the task. The LLM will do it in 20 seconds and cost 3c

      3 replies →

    • The LLMs absolutely can and do say "I don't know"; I've seen it with both GPT-4 and LLaMA. They don't do it anywhere near as much as they should, yes - likely because their training data doesn't include many examples of that, proportionally - but they are by no means incapable of it.

    • This surprises me. I made a simple chat fed with PDF's and using LangChain and it by default said it didn't know if I asked questions outside of the corpus. It was a simple matter of the confidence score getting too low?

  • > LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

    This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.

    • I don't know that "confabulation" is right either: it has a couple of other meanings beyond "a fabricated memory believed to be true" and, of course, the other issue is that LLMd don't believe anything. They'll backtrack on even correct information if challenged.

  • I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.

    LLMs are giant word Plinko machines. A million monkeys on a million typewriters.

    LLMs are not interns. LLMs are assumption machines.

    None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.

    LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.

    LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.

    • I think LLMs are definitely on the path to AGI in the same way that the ball bearing was on the path to the internal combustion engine. I think its quite likely that LLMs will perform important functions within the system of an eventual AGI.

      11 replies →

    • It probably depends on your problem space. In creative writing, I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base. But for programming or other falsifiable (and rapidly changing) disciplines it is noticeable and a problem.

      Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.

      3 replies →

  • Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.

    • The LLMs are getting better and better at a certain kind of task, but there's a subset of tasks that I'd still much rather have any human than an LLM, today. Even something simple, like "Find me the top 5 highest grossing movies of 2023" it will take a long time before I trust an LLM's answer, without having a human intern verify the output.

    • I think listing off a set of pros and cons for interns and LLMs misses the point, they seem like categorically different kinds of intelligence.

  • > That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.

    An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.

  • many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways

  • I think this is the main issue with these tools... what people are expecting of them.

    We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...

    • I completely agree.

      To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.

      It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!

  • > A good intern will ask clarifying questions, tell me “I don’t know”

    Your expectations are bigger than mine

    (Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)

    • Indeed. My expectation of a good intern is to produce nothing I will put in production, but show aptitude worth hiring them for. It's a 10 week extended interview with lots of social events, team building, tech talks, presentations, etc.

      Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).

      2 replies →

  • They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.

    It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.

  • > With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding.

    With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.

    • This might be the best definition I will come across of what it means to be an "IT project manager".

  • Is this a dataset issue more than an LLM issue?

    As in: do we just need to add 1M examples where the response is to ask for clarification / more info?

    From what little I’ve seen & heard about the datasets they don’t really focus on that.

    (Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)

  • Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?

    Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.

    • Sure, but to what end?

      The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.

      I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.

      As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:

      "The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."

      ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."

      4 replies →

    • It all stems from the fact that it just talks English.

      It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!

      Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.

  • > have no idea whether the LLM understood what I’m asking

    That's easy. The answer is it doesn't. It has no understanding of anything it does.

    > if it’s able to do it

    This is the hard part.

> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output

This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.

https://x.com/colin_fraser/status/1834336440819614036

  • Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).

    • I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.

    • The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.

      10 replies →

    • I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.

      This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.

      Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.

      1 reply →

  • This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.

    Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.

    I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.

    • My codebases are riddled with these gotchas. For instance, I sometimes write Python for the Blender rendering engine. This requires highly non-idiomatic Python. Whenever something complex comes up, LLM's just degenerate to cookie cutter basic bitch Python code. There is simply no "there" there. They are very useful to help you reason about unfamiliar codebases though.

      1 reply →

  • 1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.

    (a) Sometimes things are useful even when imperfect e.g. search engines.

    (b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!

    I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.

    2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.

    [1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...

  • Reminds me of a trick question about Schrödinger's cat.

    “I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”

    The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.

    • There is no "trick" in the linked question, unlike the question you posed.

      The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.

      3 replies →

    • Yeah, I think what a lot of people miss about these sort of gotchas are that most of them were invented explicitly to gotcha humans, who regularly get got by them. This is not a failure mode unique to LLMs.

      5 replies →

  • What I'm not able to comprehend is why people are not seeing the answer as brilliant!

    Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.

    Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).

    In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).

    • The 'riddle': A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?

      GPT Answer: The doctor is the boy's mother

      Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)

      This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.

      It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.

      LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.

      27 replies →

    • > why would anyone out of blue ask such question

      I would certainly expect any person to have the same reaction.

      > So, it started its chain of thought with "Interpreting the riddle" (smart!).

      How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.

    • The doctor is obviously a parent of the boy. The language tricks simply emulate the ambiance of reasoning. Similarly to a political system emulating the ambiance of democracy.

  • I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.

    They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

    I think it may answer correctly if you start off asking "Please solve the below riddle:"

    There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).

    • > They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

      How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.

      1 reply →

    • Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.

      3 replies →

  • Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.

    I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero

    • Nondeterminism provides an excuse for errors, determinism doesn't.

      Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.

    • Determinism only helps if you always ask the question with exactly the same words. There's no guarantee a slightly rephrased version will give the same answer, so a certain amount of unpredictability is unavoidable anyway. With a deterministic LLM you might find one phrasing that always gets it right and a dozen basically indistinguishable ones that always get it wrong.

      1 reply →

  • what's weird is it gets it right when I try it.

    https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...

    • That’s not weird at all, it’s how LLMs work. They statistically arrive at an answer. You can ask it the same question twice in a row in different windows and get opposite answers. That’s completely normal and expected, and also why you can never be sure if you can trust an answer.

    • Perhaps OpenAI hot-patches the model for HN complaints:

        def intercept_hn_complaints(prompt):
          if is_hn_trick_prompt(prompt):
             # special_case for known trick questions.

      1 reply →

    • Waat, got it on second try:

      This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.

  • The reason why that question is a famous question is that _many humans get it wrong_.

> The failure is in how you're using it.

People, for the most part, know what they know and don't know. I am not uncertain that the distance between the earth and the sun varies, but I'm certain that I don't know the distance from the earth to the sun, at least not with better precision than about a light week.

This is going to have to be fixed somehow to progress past where we are now with LLMs. Maybe expecting an LLM to have this capability is wrong, perhaps it can never have this capability, but expecting this capability is not wrong, and LLM vendors have somewhat implied that their models have this capability by saying they won't hallucinate, or that they have reduced hallucinations.

  • > the distance from the earth to the sun, at least not with better precision than about a light week

    The sun is eight light minutes away.

    • Thanks, I was not sure if it was light hours or minutes away, but I knew for sure it's not light weeks (emphasis on plural here) away. I will probably forget again in a couple of years.

  • Empirically, they have reduced hallucinations. Where do OpenAI / Anthropic claim that their models won't hallucinate?

> Treat it as a naive but intelligent intern.

You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say. It just statistically knows what a likely response would be.

Treat it as text completion and you can get more accurate answers.

  • > You are falling into the trap that everyone does. In anthropomorphising it. It doesn't understand anything you say.

    And an intern does?

    Anthropomorphising LLMs isn't entirely incorrect: they're trained to complete text like a human would, in completely general setting, so by anthropomorphising them you're aligning your expectations with the models' training goals.

  • Oh no, I'm well aware that it's a big file full of numbers. But when you chat with it, you interact with it as though it were a person so you are necessarily anthropomorphizing it, and so you get to pick the style of the interaction.

    (In truth, I actually treat it in my mind like it's the Enterprise computer and I'm Beverly Crusher in "Remember Me")

> Treat it as a naive but intelligent intern.

That's the crux of the problem. Why and who would treat it as an intern? It might cost you more in explaining and dealing with it than not using it.

The purpose of an intern is to grow the intern. If this intern is static and will always be at the same level, why bother? If you had to feed and prep it every time, you might as well hire a senior.

ive been doing exactly this for bout a year now. feed it words data, give it a task. get better words back.

i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

chatgpt is fickle daily. most days its on point. some days its wearing a bicycle helmet and licking windows. kinda sucks i cant just zone out and daydream while working. gotta be checking replies for when the wheels fall off the convo.

  • > i sneak in a benchmark opening of data every time i start a new chat - so right off the bat i can see in its response whether this chat session is gonna be on point or if we are going off into wacky world, which saves me time as i can just terminate and try starting another chat.

    I don't think it works like that...

And how much data can you give it?

I'm not up to date with these things because I haven't found them useful. But with what you said, and previous limitations in how much data they can retain essentially makes them pretty darn useless for that task.

Great learning tool on common subjects you don't know, such as learning a new programming-language. Also great for inspiration etc. But that's pretty much it?

Don't get me wrong, that is mindblowingly impressive but at the same time, for the tasks in front of me it has just been a distracting toy wasting my time.

  • >And how much data can you give it?

    Well, theoretically you can give it up to the context size minus 4k tokens, because the maximum it can output is 4k. In practice, though, its ability to effectively recall information in the prompt drops off. Some people have studied this a bit - here's one such person: https://gritdaily.com/impact-prompt-length-llm-performance/

    • You should be able to provide more data than that in the input if the output doesn't use the full 4k tokens. So limit is context_size minus expected length of output.

  • > And how much data can you give it?

    128,000 tokens, which is about the same as a decent sized book.

    Their other models can also be fine-tuned, which is kinda unbounded but also has scaling issues so presumably "a significant percentage of the training set" before diminishing returns.

  • It is great for proof-reading text if you are not a native English speaker. Things like removing passive voice. Just give it your text and you get a corrected version out.

    Use a cli tool to automate this from the cli. Ollama for local models, llm for openai.

  • People never talk about Gemini, and frankly it's output is often the worst of SOTA models, but it's 2M context window is insane.

    You can drop a few textbooks into the context window before you start asking questions. This dramatically improves output quality, however inference does take much much longer at large context lengths.

Except that it sometimes does do those tasks well. The danger in an LLM isn't that it sometimes hallucinates, the danger is that you need to be sufficiently competent to know when it hallucinates in order to fully take advantage of it, otherwise you have to fallback to double checking every single thing it tells you.

> On the other hand, if you were to paste the entire documentation set to a tool it has never seen and ask it to use the tool in a way to accomplish your goals, THEN this model would be likely to produce useful output, despite the fact that it had never encountered the tool or its documentation before.

There's not much evidence of that. It only marginally improved on instruction following (see livebench.ai) and it's score as a swe-bench agent is barely above gpt-4o (model card).

It gets really hard problems better, but it's unclear that matters all that much.

> A lot of people use LLMs as a search engine.

Except this is where LLMs are so powerful. A sort of reasoning search engine. They memorized the entire Internet and can pattern match it to my query.

> The magic is that _it knows english_.

I couldn't agree more, this is exactly the strength of LLMs that what we should focus on. If you can make your problem fit into this paradigm, LLMs work fantastic. Hallucinations come from that massive "lossy compressed database", but you should consider that part as more like the background noise that taught the model to speak English, and the syntax of programming languages, instead of the source of the knowledge to respond with. Stop anthropomorphizing LLMs, play to it's strengths instead.

In other words it might hallucinate a API but it will rarely, if ever, make a syntax error. Once you realize that, it becomes a much more useful tool.

It doesn't know anything. Stop anthropomorphizing the model. It's predictive text and no the brain isn't also predictive text.

> Treat it as a naive but intelligent intern.

I've found an amazing amount of success with a three step prompting method that appears to create incredibly deep subject matter experts who then collaborate with the user directly.

1) Tell the LLM that it is a method actor, 2) Tell the method actor they are playing the role of a subject matter expert, 3) At each step, 1 and 2, use the technical language of that type of expert; method actors have their own technical terminology, use it when describing the characteristics of the method actor, and likewise use the scientific/programming/whatever technical jargon of the subject matter expert your method actor is playing.

Then, in the system prompt or whatever logical wrapper the LLM operates through for the user, instruct the "method actor" like you are the film director trying to get your subject matter expert performance out of them.

I offer this because I've found it works very well. It's all about crafting the context in which the LLM operates, and this appears to cause the subject matter expert to be deeper, more useful, smarter.

This is demonstrably wrong, because you can just add "is this real" to a response and it generally knows if it made it up or not. Not every time, but I find it works 95% of the time. Given that, this is exactly a step I'd hope an advanced model was doing behind the scenes.

> Treat it as a naive but intelligent intern. Provide it data, give it a task, and let it surprise you with its output.

Well, I am a naive but intelligent intern (well, senior developer). So in this framing, the LLM can’t do more than I can already do by myself, and thus far it’s very hit or miss if I actually save time, having to provide all the context and requirements, and having to double-check the results.

With interns, this at least improves over time, as they become more knowledgeable, more familiar with the context, and become more autonomous and dependable.

Language-related tasks are indeed the most practical. I often use it to brainstorm how to name things.

  • I've recently started using an LLM to choose the best release of shows using data scraped from several trackers. I give it hard requirements and flexible preferences. It's not that I couldn't do this, it's that I don't want to do this on the scale of multiple thousand shows. The "magic" here is that releases don't all follow the same naming conventions, they're an unstructured dump of details. The LLM is simultaneously extracting the important details, and flexibly deciding the closest match to my request. The prompt is maybe two paragraphs and took me an hour to hone.

  • Ooh yeah it's great for bouncing ideas on what to name things off of. You can give it something's function and a backstory and it'll come up with a list of somethings for you to pick and choose from.

> The failure is in how you're using it

This isn’t true because, as you can read in the first sentence of the post you’re responding to, GP did give it a task like you recommend here

> Provide it data, give it a task, and let it surprise you with its output.

And it fails the task. Specifically it fails it by hallucinating important parts of accomplishing it.

> hallucinates non-existing libraries and functions

This post only makes sense if your advice to “let it surprise you with its output” is mandatory, like you’re using it wrong if you do not make yourself feel impressed by it.

Yeah except. I’m priming it with things like curated docs from bevy latest, using the tricks, and testing context limits.

It’s still changing things to be several versions old from its innate kb pattern-matching or whatever you want to call it. I find that pretty disappointing.

Just like copilot and gpt4, it’s changing `add_systems(Startup, system)` to `add_startup_system(system.sytem())` and other pre-schedule/fanciful APIs—things it should have in context.

I agree with your approach to LLMs, but unfortunately “it’s still doing that thing.”

PS: and by the time I’d done those experiments, I ran out of preview, resets 5 days from now. D’oh

This model is, thankfully, far more susceptible for longer and elaborate explanation as input. The rest (4,4o,Sonnet) seem to struggle with comprehensive explanation; this one seems to perform better with a spec like input.

> A lot of people use LLMs as a search engine.

GPT-4o is wonderful as a search engine if you tell it to google things before answering (even though it uses bing).

Sorry, but that does not seem to be the case. A friend of mine who runs a long context benchmark on understanding novels [1] just ran an eval and o1 seemed to improve by 2.9% over GPT-4o (the result isn't on the website yet). It's great that there is an improvement, but it isn't drastic by any stretch. Additionally, since we cannot see the raw reasoning it's basing the answers off of, it's hard to attribute this increase to their complicated approach as opposed to just cleaner higher quality data.

EDIT: Note this was run over a dataset of short stories rather than the novels since the API errors out with very long contexts like novels.

[1]: https://novelchallenge.github.io/

Intelligent?

Just ask ChatGPT

How many Rs are in strawberry?