Comment by thangalin

5 months ago

I created an NLP library to help curl straight quotes into curly quotes. Last I checked, LLMs struggled to curl the following straight quotation marks:

    ''E's got a 'ittle box 'n a big 'un,' she said, 'wit' th' 'ittle 'un 'bout 2'×6". An' no, y'ain't cryin' on th' "soap box" to me no mo, y'hear. 'Cause it 'tweren't ever a spec o' fun!' I says to my frien'.

The library is integrated into my Markdown editor, KeenWrite (https://keenwrite.com/), to correctly curl quotation marks into entities before passing them over to ConTeXt for typesetting. While there are other ways to indicate opening and closing quotation marks, none are as natural to type in plain text as straight quotes. I would not trust an LLM curl quotation marks accurately.

For the curious, you can try it at:

https://whitemagicsoftware.com/keenquotes/

If you find any edge cases that don't work, do let me know. The library correctly curls my entire novel. There are a few edge cases that are completely ambiguous, however, that require semantic knowledge (part-of-speech tagging), which I haven't added. PoS tagging would be a heavy operation that could prevent real-time quote curling for little practical gain.

The lexer, parser, and test cases are all open source.

https://gitlab.com/DaveJarvis/KeenQuotes/-/tree/main/src/mai...

2 comments

thangalin

jcheng 5 months ago

Great example. I just tried it with a few LLMs and got horrible results. GPT-4o got a ton of them wrong, GPT-1o got them all correct AFAICT but took 1m50s to do so, and Claude 3.5 Sonnet said “Here's the text with straight quotes converted to curly quotes” but then returned the text with all the straight quotes intact.

I’m very surprised all three models didn’t nail it immediately.

gf000 5 months ago

I would be interested how well would even a smaller LLM model work after fine tuning. Besides the overhead of an LLM, I would assume they would do a much better job at it in the edge cases (where contextual understanding is required).