Comment by saaaaaam

2 days ago

I tested Gemini today, asking it to extract key pieces of data from a large report (72 slide) PDF deck which includes various visualisations, and present it as structured data. It failed miserably. Two of the key stats that are the backbone of the report, it simply made up. When I queried it, it gave an explanation, which further compounded its error. When I queried that, extracted the specific slide, and provided it, it repeated the same error.

I asked Claude to do the same thing, it got every data point, and created a little react dashboard and a relatively detailed text summary.

I used exactly the same prompt with each.

Maybe the prompt you used was more Claude-friendly than Gemini-friendly?

I'm only half-joking. Different models process their prompts differently, sometimes markedly so; vendors document this, but hardly anyone pays any attention to it - everyone seems to be writing prompts for an idealized model (or for whichever one they use the most), and then rate different LLMs on how well they respond.

Example: Anthropic documents both the huge impact of giving the LLM a role in its system prompt, and of structuring your prompt with XML tags. The latter is, AFAIK, Anthropic-specific. Using it improves response quality (I've tested this myself), and yet as far I've seen, no BYOK tool offering multiple vendor support respects or leverages that.

Maybe Gemini has some magic prompt features, too? I don't know, I'm in the EU, and Google hates us.

  • Possibly. But my Claude prompts work fine on ChatGPT, the only difference being ChatGPT isn't very good. I pay for both.

    I would not pay for Gemini - which is presumably why they've added it for "free" for everyone.

    My anthropic prompts in the API are structured. I've got one amazing API prompt that has 67 instructions, and gives mind-blowing results (to the point that it has replaced a human) but for a simple question I don't find value in that. And, frankly, 'consumer'-facing AI chatbots shouldn't need prompting expertise for basic out of the box stuff.

    The prompt I used in this example was simply "Please extract the data points contained within this report and present as structured data"

    > and yet as far I've seen, no BYOK tool offering multiple vendor support respects or leverages that

    When you say BYOK tool do you mean effectively a GUI front end on the API? I use typingmind for quickly throwing things at my API keys for testing, and I'm pretty sure you can have a persistent custom system prompt, though I think you'd need to input it for each vendor/model.

    • > When you say BYOK tool do you mean effectively a GUI front end on the API?

      Less that, and more focused tools like e.g. Aider (OSS Cursor from before Cursor was a thing).

      I use TypingMind almost exclusively for any and all LLM chatting, and I do maintain a bunch of Claude-optimized prompts that specifically exploit the "XML tags" feature (some of them I also run through the Anthropic's prompt improver) -- but I don't expect the generic frontends to care about vendor-specific prompting tricks by default. Here, my only complaint is that I don't have control over how it injects attachments, and inlined text attachments in particular are something Anthropic docs recommend demarking with XML tags, which TypingMind almost certainly doesn't do. I'd also love for the UI to recognize XML tags in output and perhaps offer some structuring or folding on the UI side, e.g. to auto-collapse specified tags, such as "<thinking>" or "<therapeuticAnalysis>" or whatever I told the LLM to use.

      (Oh, and another thing: Anthropic recently introduced a better form of PDF upload, in which the Anthropic side handles simultaneously OCR-ing and imaging the PDF and feeding it to the model, to exploit its multimodal capabilities. TypingMind, as far as I can tell, still can't take advantage of it, despite it boiling down to an explicit if/else on the model vendor.)

      No, I first and foremost mean the more focused tools, that generalize across LLMs. Taking Aider as an example, as far as I can tell, it doesn't have any special handling for Anthropic, meaning it doesn't use XML tags to mark up the repo map structure, or demarcate file content or code snippets it says, or to let the LLM demarcate diffs in reply, etc. It does its own model-agnostic thing, which means that using Claude 3.5 Sonnet, I lose out on model performance boost it's not taking advantage of.

      I singled out Aider, but there's plenty of tools and plugins out there that utilize some common LLM portability libraries, and end up treating every LLM the same way. The LLM portability libraries however are not the place to solve it - by their nature, they target the lowest common denominator. Those specialized tools should be doing it IMO, and it's not even much work - it's a bunch of model-based if/elses. Might not look pretty, but it's not a maintenance burden.

That matches with my experience, Claude is clearly ahead of its competitors in anything logic- or reasoning-based.

I find Gemini is better at queries that involve more kind of intuitive judgment over things where there isn't a clear "correct" answer. E.g. if I want a podcast recommendation, or advice on the best place to learn about a given problem, I find Gemini better than Claude.

Unfortunately for Gemini, 90% of the things I want an LLM for are better with stronger logic and reasoning.

I got a 1-year trial of Gemini Advanced with my Pixel 9 and I've had similar experiences. It makes up stuff far more often than any other models and it's just not very smart. I used the free version and thought the paid Advanced version would be better but I could hardly notice any difference, they both fail at the same prompts I've tried.

This is not to mention the poor app experience where some of the features are just missing or broken. For example it's able to "remember" stuff I ask it to remember, but when I ask it to forget something it says I have to manage it at this webpage (they didn't bother to implement this menu within the mobile app) that asks me to sign in again because it's opened in my web browser where I'm not signed into Google, and then it shows me an empty list and "Something went wrong". It's now calling me a name I told it as a joke and there's no way to make it forget