← Back to context

Comment by polshaw

7 months ago

This is cool, as far as a practical issue though (aside from the 280gb TTF file!) is that it makes it incompatible with all other fonts; if you copy and paste your "improved" text then it will no longer say what you thought it did. It just alters the presentation, not the content. I guess you would have to ocr to get the content as you see it.

I was wondering why this was never used for an simpler autocorrect, but i guess that's why.

Also perhaps someone more educated on LLMs could tell me; this wouldn't always be consistent right? Like "once upon a time _____" wouldn't always output the same thing, yes? If so even copying and pasting in your own system using the correct font could change the content.

> if you copy and paste your "improved" text then it will no longer say what you thought it did

It's not a bug, it's a feature - a DRM. Your content can now be consumed, but cannot be copied or modified - all without external tools, as long as you embed that TTF somehow.

Which kind of reminds me of a PDF invoices I got from my electricity provider. It looked and printed perfectly fine, but used weird codepoint mapping which resulted in complete garbage when trying to copy any text from it. Fun times, especially when pasting account number to a banking app.

  • This is while pretty much all software that extracts structured data from PDFs throws away the text and just OCRs the page. Too many tricks with layouts and fonts.

    • I'm always surprised how "generate PDF from Word" turns one word into 10 different print points, all with just a single letter.

      Or even straight lines in a table. The straight lines from a table boundary get hacked into pieces. You'd think one line would be the ideal presentation for a line, but who are you to judge PDF?

  • Eh, what AI taketh, AI can give; modern OCR has gotten mostly decent. If you're on Windows you should try the powertools OCR tool.

    • > If you're on Windows you should try the powertools OCR tool.

      Which is open source (MIT-licensed), the source code is here: https://github.com/microsoft/PowerToys/tree/main/src/modules...

      It is written in C#, and uses the Windows.Media.Ocr UWP API to do the actual OCR part: https://learn.microsoft.com/en-us/uwp/api/windows.media.ocr?... – so if your app runs on Windows it can potentially call the same API and get OCR for free

      Apple provides OCR through VisionKit ImageAnalyzer API – https://developer.apple.com/documentation/visionkit/imageana... – albeit that is only officially supported to call from Swift (although apparently you can expose it to Objective C if your write a "proxy Swift framework"–a custom Swift framework that wraps the original and adds @objc everywhere–I assume such a proxy framework could be autogenerated using reflection, but I'm not sure if anyone has written a tool that actually does that). There is also the older VNRecognizeTextRequest API which is supported by Objective C, but its OCR quality is inferior.

      I'm not sure what the best answer for Linux or Android is. I guess https://github.com/tesseract-ocr/tesseract ?

    • A very similar thing is also just built in to the screenshot tool, at least in Windows 11, easier for me to use since it's the same keybind as always to take a screenshot, then it's just a tool in it.

The small model/TTF is only 60MB.

The 280GB you saw is the Llama3-70B model which is basically chatgpt level (if not better).

If there's any randomness involved in inference, it ought to be deterministic as long as the same seed is used each time.

  • Is there even any possibility of using a different seed? I'd doubt the WASM shaper has accesss to any source of non-determinism.

> this wouldn't always be consistent right? Like "once upon a time _____" wouldn't always output the same thing, yes?

Would be cool if you could turn up/down the LLM’s temperature by pressing different keys other than just !!!!

Say pressing keyword numbers 0-9