Comment by joshribakoff

1 year ago

I have been using various LLMs to do some meal planning and recipe creation. I asked for summaries of the recipes and they looked good.

I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.

I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes

When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.

I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.

Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.

My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.

When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.

For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.

You're expecting it to be an 'oracle' that you prompt it with any question you can think of, and it answers correctly. I think your experiences will make more sense in the context of thinking of it as a heuristic model based situation simulation engine, as I described above.

For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.

The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.

For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.

You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.

Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?

  • > an oxtail soup recipe

    Sounds like the model just copy pasted one from the internet, hard to get that wrong. GP could have had a bespoke recipe and list of ingredients. This particular example of yours just reconfirmed what was being said: it's only able to copy-paste existing content, and it's lost otherwise.

    In my case I have huge trouble making it create useful TypeScript code for example, simply because apparently there isn't sufficient advanced TS code that is described properly.

    For completeness sake, my last prompt was to create a function that could infer one parameter type but not the other. After several prompts and loops, I learned that this is just not possible in TypeScript yet.

    • No, that example is not something that I would find very useful or a good example of its abilities- just one thing I generally expected it to be capable of doing. One can quickly confirm that it is doing the work and not copying and pasting the list by altering the recipe to include steps and ingredients not typical for such a recipe. I made a few such alterations just now, and reran it, and it adjusted correctly from a clean prompt.

      I've found it able to come up with creative new ideas for solving scientific research problems, by finding similarities between concepts that I would not have thought of. I've also found it useful for suggesting local activities while I'm traveling based on my rather unusual interests that you wouldn't find recommended for travelers anywhere else. I've also found it can solve totally novel classical physics problems with correct qualitative answers that involve keeping track of the locations and interactions of a lot of objects.. I'm not sure how useful that is, but it proves real understanding and modeling - something people repeatedly say LLMs will never be capable of.

      I have found that it can write okay code to solve totally novel problems, but not without a ton of iteration- which it can do, but is slower than me just doing it myself, and doesn't code in my style. I have not yet decided to use any code it writes, although it is interesting to test its abilities by presenting it with weird coding problems.

      Overall, I would say it's actually not really very useful, but is actually exhibiting (very much alien and non-human like) real intelligence and understanding. It's just not an oracle- which is what people want and would find useful. I think we will find them more useful with having our own better understanding of what they actually are and can do, rather than what we wish they were.