A new Google model is nearly perfect on automated handwriting recognition

4 days ago (generativehistory.substack.com)

> In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.

I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?

I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.

This way there'd be more confidence that the model itself reasoned and did not make an error.

  • Yes, and as the article itself notes, the page image has more than just "145" - there's a "u"-like symbol over the 1, which the model is either failing to notice, or perhaps is something it recognizes from training as indicating pounds.

    The article's assumption of how the model ended up "transcribing" "1 loaf of sugar u/145" as "1 loaf of sugar 14lb 5oz" seems very speculative. It seems more reasonable to assume that a massive frontier model knows something about loaves of sugar and their weight range, and in fact Google search's "AI overview" of "how heavy is a loaf of sugar" says the common size is approximately 14lb.

    • There’s also a clear extra space between the 4 and 5, so figuring out to group it as “not 1 45, nor 145 but 14 5” doesn’t seem worthy of astonishment.

  • If I ask a model to transcribe something exactly and it outputs an interpretation, that is an error and not a success.

  • I implemented a receipt scanner to Google Sheet using Gemini Flash.

    The fact that it is ”intelligent" it's fine for some things.

    For example I created structured output schema that had a field "currency" with the 3 letter format (USD, EUR...). So I scanned a receipt from some shop in Jakarta and it filled that field with IDR (Indonesian Rupiah). It inferred that data because of the city name on the receipt.

    Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

    Note: if needed maybe I could have changed the prompt to not infer the currency when not explicitly listed on the receipt.

    • > Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

      If there’s a decent chance it infers the wrong currency, potentially one where the value of each unit is a few units of scale larger or smaller than that of IDR, it might be better to not infer it.

      1 reply →

  • [flagged]

    • The comment above seems to violate several HN guidelines. Curious, I asked GPT and Gemini which ones stood out. Both replied with the same top three:

      https://news.ycombinator.com/newsguidelines.html

      They are:

      1. “Be kind. Don't be snarky. … Edit out swipes.”

      2. “Please don't sneer, including at the rest of the community.”

      3. “Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.”

      4 replies →

I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.

> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

  • >I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge

    Completely off topic, but out of curiosity, where are you reading these documents? As a Spaniard I’m kinda interested.

  • My language does not use Latin letters, but they are separate letters. Is there a way to train some handwriting recognition on my own handwriting in my own language, such that it will be effective and useful? I mostly need to recognize text in PDF documents, generated by writing on an e-ink tablet with an EMR stylus.

  • You are right to be skeptical.

    There are plenty of so called windows(or other) web 'os' clones.

    There were a couple of these posted on HN actually this very year.

    Here is one example I google dthat was also on HN : https://news.ycombinator.com/item?id=44088777

    This is not an OS as in emulating a kernel in javascript or wasm, this is making a web app that looks like the desktop of an OS.

    I have seen plenty such projects, some mimick windows UI entirely, you xan find them via google.

    So this was definitely in the training data, and is not as impressive as the blog post or the twitter thread make it to be.

    The scary thing is the replies in the twitter thread have no critical thinking at all and are impressed beyond belief, they think it coded a whole kernel, os, made an interpeter for it, ported games etc.

    I think this is the reason why some people are so impressed by AI, when you can only judge an app visually or only how you intetcat with it and don't have the depth of knowledge to understand, for such people it works all the way.land AI seems magical beyond comprehension.

    But all this is only superficial IMHO.

    • Every time a model is about to be released, there are a bunch of these hype accounts that spin up. I don't know they get paid or they spring up organically to farm engagement. Last time there was such hype for a model was "strawberry" (o1) then gpt-5, and both turned out to be meaningful improvements but nowhere near the hype.

      I don't doubt though that new models will be very good at frontend webdev. In fact this is explicitly one of the recent lmarena tasks so all the labs have probably been optimizing for it.

      2 replies →

    • Its always amusing when "an app like windows xp" considered hard or challenging somehow.

      Literally the most basic html/css, not sure why it is even included in benchmarks.

      3 replies →

  • > This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

    Thanks for this, I was almost convinced and about to re-think my entire perspective and experience with LLMs.

  • I'd love to find more info on this but from what I can find it seems to be making webpages that look like those product, and seemingly can "run python" or "emulate a game" but writing something that, based on all of GitHub, can approximate an iPhone or emulator in javscript/css/HTML is very very very different than writing an OS.

  • > Whats the kernel look like?

    Those clones are all HTML/CSS, same for game clones made by Gemini.

  • Oh! That's a nice use-case and not too far from stuff I have been playing with! (happily I do not have to deal with handwriting, just bad scans of older newspapers and texts)

    I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.

    The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

    • > The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

      What does that look like? How well does it work?

      I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.

      2 replies →

  • I'm skeptical that they're actually capable of making something novel. There are thousands of hobby operating systems and video game emulators on github for it to train off of so it's not particularly surprising that it can copy somebody else's homework.

    • I remain confused but still somewhat interested as to a definition of "novel", given how often this idea is wielded in the AI context. How is everyone so good at identifying "novel"?

      For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

      52 replies →

    • Doing something novel is incredibly difficult through LLM work alone. Dreaming, hallucinating, might eventually make novel possible but it has to be backed up be rock solid base work. We aren't there yet.

      The working memory it holds is still extremely small compared to what we would need for regular open ended tasks.

      Yes there are outliers and I'm not being specific enough but I can't type that much right now.

    • I believe they can create a novel instance of a system from a sufficient number of relevant references - i.e. implement a set of already-known features without (much) code duplication. LLMs are certainly capable of this level of generalization due to their huge non-relevant reference set. Whether they can expand beyond that into something truly novel from a feature/functionality standpoint is a whole other, and less well-defined, question. I tend to agree that they are closed systems relative to their corpus. But then, aren't we? I feel like the aperture for true novelty to enter is vanishingly small, and cultures put a premium on it vis-a-vis the arts, technological innovation, etc. Almost every human endeavor is just copying and iterating on prior examples.

      14 replies →

    • Of course they can come up with something novel. They're called hallucinations when they do, and that's something that can't be in their training data, because it's not true/doesn't exist. Of course, when they do come up totally novel hallucinations, suddenly being creative is a bad thing to be "fixed".

  • I'm surprised people didn't click through to the tweet.

    https://x.com/chetaslua/status/1977936585522847768

    > I asked it for windows web os as everyone asked me for it and the result is mind blowing , it even has python in terminal and we can play games and run code in it

    And of course

    > 3D design software, Nintendo emulators

    No clue what these refer to but to be honest it sounds like they've incrementally improved one-shotting capabilities mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a Gameboy or NES emulator working to boot Tetris or Mario, while it is a decent chunk of code to get things going, there's an absolute boatload of code on the Internet, and the complexity is lower than you might imagine. (I have written a couple of toy Gameboy emulators from scratch myself.)

    Don't get me wrong, it is pretty cool that a machine can do this. A lot of work people do today just isn't that novel and if we can find a way to tame AI models to make them trustworthy enough for some tasks it's going to be an easy sell to just throw AI models at certain problems they excel at. I'm sure it's already happening though I think it still mostly isn't happening for code at least in part due to the inherent difficulty of making AI work effectively in existing large codebases.

    But I will say that people are a little crazy sometimes. Yes it is very fascinating that an LLM, which is essentially an extremely fancy token predictor, can one-shot a web app that is mostly correct, apparently without any feedback, like being able to actually run the application or even see editor errors, at least as far as we know. This is genuinely really impressive and interesting, and not the aspect that I think anyone seeks to downplay. However, consider this: even as relatively simple as an NES is compared to even moderately newer machines, to make an NES emulator you have to know how an NES works and even have strategies for how to emulate it, which don't necessarily follow from just reading specifications or even NES program disassembly. The existence of many toy NES emulators and a very large amount of documentation for the NES hardware and inner workings on the Internet, as well as the 6502, means that LLMs have a lot of training data to help them out.

    I think that these tasks which extremely well-covered in the training data gives people unrealistic expectations. You could probably pick a simpler machine that an LLM would do significantly worse at, even though a human who knows how to write emulation software could definitely do it. Not sure what to pick, but let's say SEGA's VMU units for the Dreamcast - very small, simple device, and I reckon there should be information about it online, but it's going to be somewhat limited. You might think, "But that's not fair. It's unlikely to be able to one-shot something like that without mistakes with so much less training data on the subject." Exactly. In the real world, that comes up. Not always, but often. If it didn't, programming would be an incredibly boring job. (For some people, it is, and these LLMs will probably be disrupting that...) That's not to say that AI models can never do things like debug an emulator or even do reverse engineering on its own, but it's increasingly clear that this won't emerge from strapping agents on top of transformers predicting tokens. But since there is a very large portion of work that is not very novel in the world, I can totally understand why everyone is trying to squeeze this model as far as it goes. Gemini and Claude are shockingly competent.

    I believe many of the reasons people scoff at AI are fairly valid even if they don't always come from a rational mindset, and I try to keep my usage of AI to be relatively tasteful. I don't like AI art, and I personally don't like AI code. I find the push to put AI in everything incredibly annoying, and I worry about the clearly circular AI market, overhyped expectations. I dislike the way AI training has ripped up the Internet, violated people's trust, and lead to a more closed Internet. I dislike that sites like Reddit are capitalizing on all of the user-generated content that users submitted which made them rich in the first place, just to crap on them in the process.

    But I think that LLMs are useful, and useful LLMs could definitely be created ethically, it's just that the current AI race has everyone freaking the fuck out. I continue to explore use cases. I find that LLMs have gotten increasingly good at analyzing disassembly, though it varies depending on how well-covered the machine is in its training data. I've also found that LLMs can one-shot useful utilities and do a decent job. I had an LLM one-shot a utility to dump the structure of a simple common file format so I could debug something... It probably only saved me about 15-30 minutes, but still, in that case I truly believe it did save me time, as I didn't spend any time tweaking the result; it did compile, and it did work correctly.

    It's going to be troublesome to truly measure how good AI is. If you knew nothing about writing emulators, being able to synthesize an NES emulator that can at least boot a game may seem unbelievable, and to be sure it is obviously a stunning accomplishment from a PoV of scaling up LLMs. But what we're seeing is probably more a reflection of very good knowledge rather than very good intelligence. If we didn't have much written online about the NES or emulators at all, then it would be truly world-bending to have an AI model figure out everything it needs to know to write one on-the-fly. Humans can actually do stuff like that, which we know because humans had to do stuff like that. Today, I reckon most people rarely get the chance to show off that they are capable of novel thought because there are so many other humans that had to do novel thinking before them. Being able to do novel thinking effectively when needed is currently still a big gap between humans and AI, among others.

    • i think google is going to repeat history with gemini.. as in chatgpt, grok, etc will be like altavista, lycos, etc

  • I'm skeptical because my entire identity is basically built around being a software engineer and thinking my IQ and intelligence is higher than other people. If this AI stuff is real then it basically destroys my entire identity so I choose the most convenient conclusion.

    Basically we all know that AI is just a stochastic parrot autocomplete. That's all it is. Anyone who doesn't agree with me is of lesser intelligence and I feel the need to inform them of things that are obvious: AI is not a human, it does not have emotions. It just a search engine. Those people who are using AI to code and do things that are indistinguishable from human reasoning are liars. I choose to focus on what AI gets wrong, like hallucinations, while ignoring the things it gets right.

  • "> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts."

    Wow I'm doing it way wrong. How do I get the good stuff?

    • Your not.

      I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.

      You can make something that looks like a cake but would not be good to eat.

      The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.

      12 replies →

I read the whole article, but have never tried the model. Looking at the input document, I believe the model saw enough of a space between the 14 and 5 to simply treat it that way. I saw the space too. Impressive, but it's a leap to say it saw 145 then used higher order reasoning to correct 145 to 14 and 5.

  • I also read the whole article, and this behaviour that the author is most excited about only happened once. For a process that inherently has some randomness about it, I feel it's too early to bit this excited.

    • Yep. A lot of things looked magical in the GPT-4 days. Eventually you realised it did it by chance and more often than not gets it wrong

My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.

Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.

I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!

  • Most models don’t have good spatial information from the images. Gemini models do preprocessing and so are typically better for that. It depends a lot on how things get segmented though.

  • But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?

    • There would be no point in going via an LLM then, if I had a specialist model ready I'd just invoke it on the images directly. I don't particularly need or want a chatbot for this.

  • That's fairly unfair comparison. Did you include in the prompt a basic set of instructions about which way is "correct" and what to look for?

    • I didn't give a detailed explanation to the model, but I should have been more clear: they all seemed to know what to look for, they wrote explanations of what they were looking for, which were generally correct enough. They still got the answer wrong, hallucinating the locations of the anatomical features they insisted they were looking at.

      It's something that you can solve by just treating the brain as roughly egg-shaped and working out which way the pointy end is, or looking for the very obvious bilateral symmetry. You don't really have to know what any of the anatomy actually is.

  • This might be showing bugs in the training data. It is common to augment image data sets with mirroring, which is cheap and fast.

  • and then, in a different industry, one that has physical factories, there's this obsession about getting really good at making the machine that makes the machine (product) being the route to success. So it's funny that LLMs being able to write programs to do the thing you want is seen as a failure here.

  • What is the “normal” way? Is that defined in a technical specification? Did you provide the definition/description of what you mean by “normal”?

    I would not have expected a language model to perform well on what sounds like a computer vision problem? Even if it was agentic, as you also imply how a five year old could learn how to do it, so too an AI system would need to be trained or at the very least be provided with a description of what is looking at.

    Imagine you took an MRI brain scan back in time and showed it to a medical Doctor in even the 1950s or maybe 1900. Do you think they would know what the normal orientation is, let alone what they are looking at?

    I am a bit confused and also interested in how people are interacting with AI in general, it really seems to have a tendency to highlight significant holes in all kinds of human epistemological, organizational, and logical structures.

    I would suggest maybe you think of it as a kind of child, and with that, you would need to provide as much context and exact detail about the requested task or information as possible. This is what context engineering (are we still calling it that?) concerns itself with.

    • The models absolutely do know what the standard orientation is for a scan. They respond extensively about what they're looking for and what the correct orientation would be, more or less accurately. They are aware.

      They then give the wrong answer, hallucinating anatomical details in the wrong place, etc. I didn't bother with extensive prompting because it doesn't evince any confusion on the criteria, it just seems to not understand spatial orientations very well, and it seemed unlikely to help.

      The thing is that it's very, very simple: an axial slice of a brain is basically egg-shaped. You can work out whether it's pointing vertically (ie, nose pointing to towards the top of the image) or horizontally by looking at it. LLMs will insist it's pointing vertically when it isn't. it's an easy task for someone with eyes.

      Essentially all images an LLM will have seen of brains will be in this orientation, which is either a help or a hindrance, and I think in this case a hindrance- it's not that it's seen lots of brains and doesn't know which are correct, it's that it has only ever seen them in the standard orientation and it can't see the trees for the forest, so to speak.

I haven’t seen this new google model but now must try it out.

I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.

I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.

Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.

I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.

I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.

I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.

  • Predicting the next word is the interface, not the implementation.

    (It's a pretty constraining interface though - the model outputs an entire distribution and then we instantly lose it by only choosing one token from it.)

  •   >I really have a hard time making (or believing) the argument that they are just predicting the next word.
    

    It's true, but by the same token our brain is "just" thresholding spike rates.

  • Predicting the next word requires understanding, they're not separate things. If you don't know what comes after the next word, then you don't know what the next word should be. So the task implicitly forces a more long-horizon understanding of the future sequence.

    • This is utterly wrong. Predicting the next word requires a large sample of data made into a statistical model. It has nothing to do with "understanding", which implies it knows why rather than what.

      15 replies →

    • > Predicting the next word requires understanding

      If we were talking about humans trying to predict next word, that would be true.

      There is no reason to suppose than an LLM is doing anything other than deep pattern prediction pursuant to, and no better than needed for, next word prediction.

      23 replies →

I will note that 2.5 pro preview… march? Was maybe the best model I’ve used yet. The actual release model was… less. I suspect Google found the preview too expensive and optimized it down but it was interesting to see there was some hidden horsepower there. Google has always been poised to be the AI leader/winner - excited to see if this is fluff or the real deal or another preview that gets nerfed.

  • Dunno if you're right, but I'd like to point out that I've been reading comments like these about every model since GPT 3. It's just starting to seem more likely to me to be a cognitive bias than not.

    • I haven’t noticed the effect of things getting worse after a release but definitely 2.5’s abilities got worse. Or perhaps they optimized for something else? But I haven’t noticed the usual “things got worse after release!” Except for when sonnet had a bug for a month and gpt5’s autorouter broke.

      1 reply →

    • Sometimes it is just bias but the 2.5 pro had benchmarks showing the degradation (plus they changed the name every time so it was obviously a different ckpt or model).

    • Why would you assume cognitive bias? Any evidence? These things are indeed very expensive to run, and are often run at a loss. Wouldn't quantization or other tuning be just as reasonable of an answer as cognitive bias? It's not like we are talking about reptilian aliens running the whitehouse.

      2 replies →

  • I noticed the degradation when Gemini stopped being a good research tool, and made me want to strangle it on a daily basis.

    It's incredibly frustrating to have a model start to hallucinate sources and be incapable of revisiting its behavior.

    Couldn't even understand that it was making up non-sensical RFC references.

Am I missing something here? Colonial merchant ledgers and 18th-century accounting practices have been extensively digitized and discussed in academic literature. The model has almost certainly seen examples where these calculations are broken down or explained. It could be interpolating from similar training examples rather than "reasoning."

  • The author claims that they tried to avoid that: "[. . .] we had to choose them carefully and experiment to ensure that these documents were not already in the LLM training data (full disclosure: we can’t know for sure, but we took every reasonable precaution)."

    • Even if that specific document wasn't in the training data, there could be many similar documents from others at the time.

It seems like a leap to assume it has done all sorts of complex calculations implicitly.

I looked at the image and immediately noticed that it is written as “14 5” in the original text. It doesn’t require calculation to guess that it might be 14 pounds 5 ounces rather than 145. Especially since presumably, that notation was used elsewhere in the document.

> So that is essentially the ceiling in terms of accuracy.

I think this is mistaken. I remember... ten years ago? When speech-to-text models came out that dealt with background noise that made the audio sound very much like straight pink noise to my ear, but the model was able to transcribe the speech hidden within at a reasonable accuracy rate.

So with handwritten text, the only prediction that makes sense to me is that we will (potentially) reach a state where the machine is at least probably more accurate than humans, although we wouldn't be able to confirm it ourselves.

But if multiple independent models, say, Gemini 5 and Claude 7, both agree on the result, and a human can only shrug and say, "might be," then we're at a point where the machines are probably superior at the task.

  • That depends on how good we get at interpretability. If the models can not only do the job but also are structured to permit an explanation of how they did it, we get the confirmation. Or not, if it turns out that the explanation is faulty.

I’ve seen those A/B choices on Google AI Studio recently, and there wasn’t a substantial difference between the outputs. It felt more like a different random seed for the same model.

Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.

  • For me they've been very similar, except in one case where I corrected it and on one side it doubled down on being objectively wrong, and on the other side it took my feedback and started over with a new line of thinking.

I've been complaining on hn for some time now that my only real test of an LLM is that it can help my poor wife with her research, she spends all day every day in small town archives pouring over 18th century American historical documents. I thought maybe that day had come, I showed her the article and she said "good for him I'm still not transcribing important historical documents with a chat bot and nor should he" - ha. If you wanna play around with some difficult stuff here are some images from her work I've posted before: https://s.h4x.club/bLuNed45

  • People have had spotty access to this model for brief periods (gemini 3 pro) for a few weeks now, but its strongly expected to be released next week, and definitely by year end.

    • Oh I didn't realize this wasn't 2.5 pro (I skimmed, sorry) - i also haven't had time to run some of her docs on 5.1 yet, I should.

  • While it's of course a good thing to be critical the author did provide some more context on the why and how of doing it with LLM's on the hard fork podcast today [0]: mostly as a way to see how these models _can_ help them with these tasks.

    I would recommend listening to their explanation, maybe it'll give more insight.

    Disclosure: After listening the podcast and looking up and reading the article I emailed @dang to suggest it goes into the HN second chance pool. I'm glad more people enjoyed it.

    [0]: https://www.nytimes.com/2025/11/14/podcasts/hardfork-data-ce...

  • > ...of an LLM is that it can help my poor wife with her research, she spends all day every day in small town archives pouring over 18th century American historical documents.

    > I'm still not transcribing important historical documents with a chat bot and nor should he

    Doesn't sound like she's interested in technology, or wants help.

  • It doesnt have to be perfect to be useful. If it does a decent job then your wife reviews and edits, that will be much faster than doing the whole thing by hand. The only question is if she can stay committed to perfection. I dont see the downside of trying it unless she's worried about getting lazy.

    • I raised this point with her, she said there are times it would be ambiguous for both her and the model, and she thinks it would be dangerous for her to be influenced by it. I'm not a professional historical researcher so I'm not sure if her concern is valid or not.

      4 replies →

I think the author has become a bit too enthusiastic. "Emerging capabilities" become code for, unexpectedly good results that are statistical serendipity but that I prefer to infer as some hidden capability in a model I can't resist anthropomorphizing.

Of it can read ancient handwriting, it will be a revolution for historians work.

My wife is a historian and she is trained to recognize old handwriting. When we go to museums she"translates" the texts for the family

This is exciting news, as I have some elegantly scribed family diaries from the 1800s that I can barely read (:

With that said, the writing here is a bit hyperbolic, as the advances seem like standard improvements, rather than a huge leap or final solution.

  • Statistics in the article has a low number of samples to make definitive conclusion, but expert-level WER looks like a huge leap.

The thinking models (especially OpenAI's o3) still seem to do by far the best at this task as they look across the document to see how the writer wrote certain letters where the word is more clear when it runs into confusing words.

I built a whole product around this: https://DocumentTranscribe.com

But I imagine this will keep getting better and that excites me since this was largely built for my own research!

  • I find Gemini 2.5 pro, not flash, way better than the chatGPT models. I didn't remember testing o3 though. Maybe it's o3 pro and it's one of the old costly and thinking models?

This might just be a handcrafted prompt framework for handwriting recognition tied in with reasoning - do a rough pass, make assumptions and predictions, check assumptions and predictions, if they pass, use the degree of confidence in their passage to inform what the other characters might be, and gradually flesh out an interpretation of what was intended to be communicated.

If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.

  • I don't see how this performance could be anything like that. There is no way that Google included specialized system prompts with anything to do with converting shillings to pounds in their model.

Is anyone aware of any benchmark evaluation for handwriting recognition? I have not been able to find one, myself — which is somewhat surprising.

Author says "It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental." and then jumps back to the "beginning of the story". Including talking about a trip to Canada.

Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)

  • The new model may or may not be great at handwriting but I found the author's constant repetition about how amazing it was irritating enough to stop reading and to wonder if the article itself was slop-written.

    "users have reported some truly wild things" "the results were shocking" "the most amazing thing I have seen an LLM do" "exciting and frightening all at once" "the most astounding result I have ever seen" "made the hair stand up on the back of my neck"

I dunno man, looks like goodharts law in action to me. That isnt to say the models wont be good at what is stated, but it does mean it might not signal a general improvement in competence but rather a targeted gain with more general deficits rising up in untested/ignored areas, some which may or may not be catastrophic. I guess we will see but for now Imma keep my hype in the box.

I just used AI studio for recognizing text from a relative's 60 day log of food ingested 3 times a day. I think I am using models/gemini-flash-latest and it was shockingly good at recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet (IIRC its 4.5) model.

https://pasteboard.co/euHUz2ERKfHP.png

Its response I have captured here https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I could only spot 2 mistakes. And those that seems to have been the ones even I could not read or was very difficult for me to make out what the text was.

  • I basically fed it all 60 images 5 at a time and made a table out of them to correlate sugar levels <-> food and colocate it with the person's exercise routines. This is insane.

Rgd the "14 lb 5 oz" point in the article, the simpler explanation than the hypothesis there that it back calculated the weight is that there seems to be a space between 14 and 5 - i.e. It reads more like "14 5" than "145"?

I always have to ask with OCR in a professional context is which digits of the numbers is it allowed to get wrong?

Could it be guessing via orders of magnitude? Like 145 lb * 1/4 is high confidence not the answer and 1 lb 45 oz is non standard notation as 1lb = 16oz - so it's most likely 14 lb 5 oz

Gemini 2.5 PRO is already incredibly good in handwritten recognition. It makes maybe one small mistake every 3 pages.

It has completely changed the way I work, and it allows me to write math and text and then convert it with the Gemini app (or with a scanned PDF in the browser). You should really try it.

> it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts

> As is so often the case with AI, that is exciting and frightening all at once

> we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required

> this will be a big deal when it’s released

> What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model

> model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself

Just another post with extreme hyperbolic wording to blow up another model release. How many times have we seen such non-realistic build up in the past couple of years.

I much prefer this tone about improvements in AI over the doomerism I constantly read. I was waiting for a twist where the author changed their minds and suddenly went "this is the devil's technology" or "THEY T00K OUR JOBS" but it never happened. Thank you for sharing, it felt like breathing for the first time in a long time.

No, just another academic with the ominous handle @generativehistory that is beguiled by "AI". It is strange that others can never reproduce such amazing feats.

  • I don't know if I'd call it an 'amazing feat', but claude had me pause for a moment recently.

    Some time ago, I'd been working on a framework that involved a series of servers (not the only one I've talked to claude about) that had to pass messages around in a particular fashion. Mostly technical implementation details and occasional questions about architecture.

    Fast forward a ways, and on a lark I decided to ask in the abstract about the best way to structure such an interaction. Mark that this was not in the same chat or project and didn't have any identifying information about the original, save for the structure of the abstraction (in this case, a message bus server and some translation and processing services, all accessed via client.)

    so:

    - we were far enough removed that the whole conversation pertaining to the original was for sure not in the context window

    - we only referred to the abstraction (with like a A=>B=>C=>B=>A kind of notation and a very brief question)

    - most of the work on the original was in claude code

    and it knew. In the answer it gave, it mentioned the project by name. I can think of only two ways this could have happened:

    - they are doing some real fancy tricks to cram your entire corpus of chat history into the current context somehow

    - the model has access to some kind of fact database where it was keeping an effective enough abstraction to make the connection

    I find either one mindblowing for different reasons.

What an unnecessarily wordy article. It could have been a fifth of the length. The actual point is buried under pages and pages of fluff and hyperbole.

  • I would just suggest that if you want your comment to be more helpful than the article that you're critiquing, you might want to actually quote the part which you believe is "The actual point".

    Otherwise you are likely to have people agreeing with you, while they actually had a very different point that they took away.

  • Yes, and I agree and it seems like the author has a naïve experience with LLMs because what he’s talking about is kind of the bread and butter as far as I’m concerned

    • Indeed. To me, it has long been clear that LLMs do things that, at the very least, are indistinguishable from reasoning. The already classic examples where you make them do world modeling (I put an ice cube into a cup, put the cup in a black box, take it into the kitchen, etc... where is the ice cube now?) invalidate the stochastic parrot argument.

      But many people in the humanities have read the stochastic parrot argument, it fits their idea of how they prefer things to be, so they take it as true without questioning much.

      1 reply →

Substack: When you have nothing to say and all day to say it.

  • “This AI did something amazing but first I’m going to put in 72 paragraphs of details only I care about.”

    I was thinking as I skimmed this it needs a “jump to recipe” button.

  • It was an embarrassing read. I should ask an llm to read it since he probably wrote it the same way.

Reading HN comments just makes me realize how vastly LLMs exceed human intelligence.