Comment by borroka

25 days ago

I am developing a web application for a dictionary that translates words from the national language into the local dialect.

Vibe coding and other tools, such as Google Vision, helped me download images published online, compile a PDF, perform OCR (Tesseract and Google Vision), and save everything in text format.

The OCR process was satisfactory for a first draft, but the text file has a lot of errors, as you'd expect when the dictionary has about 30,000 entries: Diacritical marks disappear, along with typographical marks and dashes, lines are moved up and down, and parts of speech (POS) are written in so many different ways due to errors that it is necessary to identify the wrong POS's one by one.

If the reasoning abilities of LLM-derived coding agents were as advanced as some claim, it would be possible for the LLM to derive the rules that must be applied to the entire dictionary from a sufficiently large set of “gold standard” examples.

If only that were the case. Every general rule applied creates other errors that propagate throughout the text, so that for every problem partially solved, two more emerge. What is evident to me is not clear to the LLM, in the sense that it is simple for me, albeit long and tedious, to do the editing work manually.

To give an example, if trans.v. (for example) indicates a transitive verb, it is clear to me that .trans.v. is a typographical error. I can tell the coding tool (I used Gemini, Claude, and Codex, with Codex being the best) that, given a standard POS, if there is a “.” before it, it must be deleted because it is a typo. The generalization that comes easily to me but not to the coding agent is that if not one but two periods precede the POS, it means there are two typos, not to delete just one of the two dots.

This means that almost all rules have to be specified, whereas I expected the coding agent to generalize from the gigantic corpus on which it was trained (it should “understand” what the POS are, typical typos, the language in which the dictionary is written, etc.).

The transition from text to json to webapp is almost miraculous, but what is still missing from the mix is human-level reasoning and common sense (in part, I still believe that coding agents are fantastic, to be clear).

0 comments

borroka

No comments yet

Contribute on Hacker News ↗