Comment by Syzygies

3 months ago

I'm trying to sort out whether this article is relevant to a problem I've been working on the last few days. (It's staggering how deploying AI changes time scales.)

We need "Centaur" documentation that can efficiently transfer information formerly targeting humans, to AI. To fit within current token windows, one needs semantic compression. What data representation would be ideal for this?

This seems so obvious once you consider it, it becomes impossible to explain why OpenAI or Anthropic or Cursor or Windsurf don't offer "knowledge packs" that can transfer their documentation to AI. Of course, it's frequently the case that people who make tools don't "get" them.

My immediate need is to condense the Lean 4 website into a Claude 3.5 Sonnet context window. No AI can code in Lean 4 reliably (not that many humans either ;-) but I don't want the Lean / AI choice to be either / or.

> No AI can code in Lean 4 reliably

I wonder if this is due to the nature of the language.

Lean 4 lets you redefine its syntax in ways that most other languages do not allow[1], so effectively you are dealing with a recursive language, that could require a Touring complete token representation system.

[1] https://leanprover-community.github.io/lean4-metaprogramming... What other language lets you redefine the meaning of digits? The mix of syntax + macros + elaboration makes it really flexible, but hard to treat reliably.

LLMs based on transformers are not Touring complete (nitpick: they are but only if you use arbitrary precision math, which is not the case in practical implementation https://arxiv.org/abs/1901.03429).