Comment by lelanthran
14 hours ago
> In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.
I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?
I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.
This way there'd be more confidence that the model itself reasoned and did not make an error.
Yes, and as the article itself notes, the page image has more than just "145" - there's a "u"-like symbol over the 1, which the model is either failing to notice, or perhaps is something it recognizes from training as indicating pounds.
The article's assumption of how the model ended up "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb 5oz" seems very speculative. It seems more reasonable to assume that a massive frontier model knows something about loaves of sugar and their weight range, and in fact Google search's "AI overview" of "how heavy is a loaf of sugar" says the common size is approximately 14lb.
There’s also a clear extra space between the 4 and 5, so figuring out to group it as “not 1 45, nor 145 but 14 5” doesn’t seem worthy of astonishment.
If I ask a model to transcribe something exactly and it outputs an interpretation, that is an error and not a success.
Author already mentions that a correction is still an error in the context of this task.
I implemented a receipt scanner to Google Sheet using Gemini Flash.
The fact that it is ”intelligent" it's fine for some things.
For example I created structured output schema that had a field "currency" with the 3 letter format (USD, EUR...). So I scanned a receipt from some shop in Jakarta and it filled that field with IDR (Indonesian Rupiah). It inferred that data because of the city name on the receipt.
Would it be better for my use case that it would have returned no data for the currency field? Don't think so.
Note: if needed maybe I could have changed the prompt to not infer the currency when not explicitly listed on the receipt.
> Would it be better for my use case that it would have returned no data for the currency field? Don't think so.
If there’s a decent chance it infers the wrong currency, potentially one where the value of each unit is a few units of scale larger or smaller than that of IDR, it might be better to not infer it.
I think most tools in this space do the "infer a bunch of data and show it to the user for confirmation", which lowers the pain of a miss here.
> Would it be better for my use case that it would have returned no data for the currency field?
Almost certainly yes.
Except in setups where you always check its work, and the effort from the 5% of the time you have to correct the currency is vastly outweighed due to effort saved from the other 95% of the time. Pretty common situation.
[flagged]
The comment above seems to violate several HN guidelines. Curious, I asked GPT and Gemini which ones stood out. Both replied with the same top three:
https://news.ycombinator.com/newsguidelines.html
They are:
1. “Be kind. Don't be snarky. … Edit out swipes.”
2. “Please don't sneer, including at the rest of the community.”
3. “Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.”
I'd be interested in seeing these guidelines updated to include "don't re-post the output of an LLM" to reduce comments of this sort.
I don't really feel like comments with LLM output as the primary substance meet the bar of "thoughtful and substantive", and (ironically, in this instance) could actually be used as good example of shallow dismissal, since you, a human, didn't actually provide an opinion or take a stance either way that I could use to begin a good-faith engagement on the topic.
1 reply →
genuinely, why is your response to being curious to ask two different LLMs to explain something to you?
the list of guidelines has 18 items in it. did you actually need them to interpret it? or is it, perhaps, you couldn’t resist a little sneering yourself?
1 reply →