Comment by williamdclt
1 year ago
> Treat it as a naive but intelligent intern
That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.
I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.
For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.
This is the way.
I've had this experience many times:
- hey, can you write me a thing that can do "xyz"
- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)
- can you add onto this with "abc"
- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.
- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?
- yes, you're absolutely right. I'm sorry for making this over complicated.
If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.
> If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff.
If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.
Humans do this intuitively.
But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.
What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)
5 replies →
> That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context.
Me too - open new chat and start by copy/pasting the "last-known-good-state". OpenAI can introduce a "new-chat-from-here" feature :)
Some good suggestions here. I have also had success asking things like, “is this a standard/accepted approach for solving this problem?”, “is there a cleaner, simpler way to do this?”, “can you suggest a simpler approach that does not rely on X library?”, etc.
Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.
It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.
1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.
I get a lot of value out of ChatGPT but I also, fairly frequently, run into issues here. The real danger zones are areas that lie at or just beyond the edges of my own knowledge in a particular area.
I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.
The balance is still in its favour, but you have to keep your wits about you when using it.
Agreed, but the problem is if these things replace practitioners (what every MBA wants them to do), it's going to wreck the industry. Or maybe we'll get paid $$$$ to fix the problems they cause. GPT-4 introduced me to window functions in SQL (haven't written raw SQL in over a decade). But I'm experienced enough to look at window functions and compare them to subqueries and run some tests through the query planner to see what happens. That's knowledge that needs to be shared with the next generation of developers. And LLMs can't do that accurately.
2 replies →
This is basically the problem with all AI. It's good to a point, but they don't sufficiently know their limits/bounds and they will sometimes produce very odd results when you are right at those bounds.
AI in general just needs a way to identify when they're about to "make a coin flip" on an answer. With humans, we can quickly preference our asstalk with a disclaimer, at least.
I ask ChatGPT whether it knows things all the time. But it's almost never answers no.
As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.
I then asked it if it could solve an arbitrary quintic and it said no.
So I guess it can say it doesn't know if it can prove to itself it doesn't know.
The difference is a junior cost 30-100$/hr and will take 2 days to complete the task. The LLM will do it in 20 seconds and cost 3c
Thank god we can finally end the scourge of interns to give the shareholders a little extra value. Good thing none of us ever started out as an intern.
2 replies →
The LLMs absolutely can and do say "I don't know"; I've seen it with both GPT-4 and LLaMA. They don't do it anywhere near as much as they should, yes - likely because their training data doesn't include many examples of that, proportionally - but they are by no means incapable of it.
This surprises me. I made a simple chat fed with PDF's and using LangChain and it by default said it didn't know if I asked questions outside of the corpus. It was a simple matter of the confidence score getting too low?
> LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.
This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.
I don't know that "confabulation" is right either: it has a couple of other meanings beyond "a fabricated memory believed to be true" and, of course, the other issue is that LLMd don't believe anything. They'll backtrack on even correct information if challenged.
I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.
LLMs are giant word Plinko machines. A million monkeys on a million typewriters.
LLMs are not interns. LLMs are assumption machines.
None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.
LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.
LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.
I think LLMs are definitely on the path to AGI in the same way that the ball bearing was on the path to the internal combustion engine. I think its quite likely that LLMs will perform important functions within the system of an eventual AGI.
We're learning valuable lessons from all modern large-scale (post-AlexNet) NN architectures, transformers included, and NNs (but maybe trained differently) seem a viable approach to implement AGI, so we're making progress ... but maybe LLMs will be more inspiration than part of the (a) final solution.
OTOH, maybe pre-trained LLMs could be used as a hardcoded "reptilian brain" that provides some future AGI with some base capabilities (vs being sold as newborn that needs 20 years of parenting to be useful) that the real learning architecture can then override.
5 replies →
This may be accurate. I wonder if there's enough energy in the world for this endeavour.
4 replies →
LLMs mostly know what they know. Of course, that doesn't mean they're going to tell you.
https://news.ycombinator.com/item?id=41504226
It probably depends on your problem space. In creative writing, I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base. But for programming or other falsifiable (and rapidly changing) disciplines it is noticeable and a problem.
Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.
In creative writing the problem becomes things like word choice and implications that have unexpected deviations from its expectations.
It can get really obvious when it's repeatedly using clichés. Both in repeated phrases and in trying to give every story the same ending.
> I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base
The problem space in creative writing is well beyond the problem space for programming or other "falsifiable disciplines".
> It probably depends on your problem space
Makes me wonder if the medical doctors can ever blame the LLM over other factors for killing their patients.
Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.
The LLMs are getting better and better at a certain kind of task, but there's a subset of tasks that I'd still much rather have any human than an LLM, today. Even something simple, like "Find me the top 5 highest grossing movies of 2023" it will take a long time before I trust an LLM's answer, without having a human intern verify the output.
I think listing off a set of pros and cons for interns and LLMs misses the point, they seem like categorically different kinds of intelligence.
> That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.
An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.
many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways
“building” is the easy part
building SOTA systems is the easy part?! Easy compared to what?
6 replies →
I think this is the main issue with these tools... what people are expecting of them.
We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...
I completely agree.
To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.
It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!
> A good intern will ask clarifying questions, tell me “I don’t know”
Your expectations are bigger than mine
(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)
Indeed. My expectation of a good intern is to produce nothing I will put in production, but show aptitude worth hiring them for. It's a 10 week extended interview with lots of social events, team building, tech talks, presentations, etc.
Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).
If I had to respect the basic human rights of my LLM backends, it would probably be less appealing - but "Unlimited free smart-for-being-braindead zombies" might be a little more useful, at least?
1 reply →
Note that we are talking about a “good” intern here
Unreasonably good. Beyond fresh junior employee good. Also, that's your standard; 'MPSimmons said to treat the model as "naive but intelligent" intern, not a good one.
Makes me wonder if "I don't know" could be added to LLM: whenever an activation has no clear winner value (layman here), couldn't this indicate low response quality?
This exists and does work to some degree, e.g. Detecting hallucinations in large language models using semantic entropy https://www.nature.com/articles/s41586-024-07421-0
They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.
It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.
> With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding.
With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.
This might be the best definition I will come across of what it means to be an "IT project manager".
Is this a dataset issue more than an LLM issue?
As in: do we just need to add 1M examples where the response is to ask for clarification / more info?
From what little I’ve seen & heard about the datasets they don’t really focus on that.
(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)
Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?
Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.
Sure, but to what end?
The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.
I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.
As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:
"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."
ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."
I had to use that technique ("don't acknowledge this sideband data that may or may not be relevant to the task at hand") myself last month. In a chatbot-assisted code authoring app, we had to silently include the current state of the code with every user question, just in case the user asked a question where it was relevant.
Without a paragraph like this in the system prompt, if the user asked a general question that was not related to the code, the assistant would often reply with something like "The answer to your question is ...whatever... . I also see that you've sent me some code. Let me know if you have specific questions about it!"
(In theory we'd be better off not including the code every time but giving the assistant a tool that returns the current code)
3 replies →
It all stems from the fact that it just talks English.
It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!
Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.
I have to say, having to tell it to ask me clarifying questions DOES make it really look smart!
imagine if you make it keep going without having to reprompt it
4 replies →
> have no idea whether the LLM understood what I’m asking
That's easy. The answer is it doesn't. It has no understanding of anything it does.
> if it’s able to do it
This is the hard part.
A lot of interns are overconfident though
Can I have some of those sorts of interns?