Comment by conartist6
4 months ago
For the last five years I've been working on this problem!
To solve it we need to be able to describe the structured content of a document without rendering it, and that means we need an embedding language for code documents.
I hope this doesn't sound overly technical: I'm just borrowing ideas from web browsers. I think of my project as being the creation of a DOM for code documents. The DOM serves a similar function. A semantic HTML documents has meaning independent of its rendered presentation and so it can be rendered many ways.
CSTML is my novel embedding language for code. You could think of it like a safe way to hold or serialize an arbitrary parse tree. Like HTML a CSTML document has "inner text" which this case is the source text if the program the parser saw. E.g. a tiny document might be `<Boolean> 'true' </>`. The parser injects node tags into the source text, creating what is essentially the perfect data stream to feed a syntax highlighter. To do the highlighting you print the string content if the document and use the control tags to decide on color. This is actually already how we syntax highlight the output from our own CLI as it happens. We use our streaming parser technology to parse our log output into a CSTML tag stream (in real time) and then we just swap out open and close node tags for ANSI escape codes, print the strings, and send that stream to stdout.
Here's a more complicated document generated from a real parse: https://gist.github.com/conartist6/412920886d52cb3f4fdcb90e3...
I think this is a good idea, but your language seems extremely similar to XML - why not just write an XML schema? Seems like you can leverage a lot of existing tools and abstractions that way.
SrcML is a piece of prior art that we studied that does use XML and I think it limits them. For example because literal content isn't quoted in XML all content inside tags is inner text. So they can't pretty print their documents, because indentation added to the embedding document would become indentation in the embedded document instead. Oops! We also support named references and named namespaces, which XML does not.
How many of the ideas he proposes would this support? For example, classifying something as a <Keyword> lets you highlight it in the traditional way, but doesn't do much for "highlight different levels of nesting" or "highlight if imported from a different file". Seems like the parallel to HTML means CSTML mostly supports different rendering like screen reading or styling.
Yeah I would say we support all of them, but right now the support is low-level rather than high-level. We're not stopping you but we're not (yet) making it trivially easy either.
Technically right now BABLR turns text into parse trees but it doesn't render the trees, so it doesn't have any firsthand concept of styling. If you print the content of a CSTML document to the terminal, you'll have to style it with ANSI codes. If you want to print the document to a web page, you'll have to style it with CSS. Right now we leave that part as an exercise to the user. The tree has the data needed to achieve any of the results you suggest, and as time goes on we will do better at providing higher level APIs that make it really easy to implement those kinds of code-semantic styling rules
How does this approach cohere/compete/disagree with the treesitter ecosystem?
Yeah it kinda does all three. We think Tree-sitter could adopt CSTML as a way of communicating its parse results with relative ease.
We also think that at some point in the future we could run Tree-sitter grammars without first compiling them from JS to C or wasm.
Our major innovations over Tree-sitter are scripted grammars (no compile step), streaming parsing, and the idea that we are a standalone complete source of truth for an IDE, where Tree-sitter only wants to be half the story: it expects to sync with a text buffer where the text buffer is the source of truth.
> the idea that we are a standalone complete source of truth for an IDE
So that XML-like tree would become the source of truth?
> where Tree-sitter only wants to be half the story: it expects to sync with a text buffer where the text buffer is the source of truth.
There are probably a ton of reasons for this, e.g. 1) The source of truth at the file system level is actually the bare text. 2) Performance reasons. 3) Stuff like git diff is easier to implement.
6 replies →