Comment by chrismorgan

6 months ago

Coarse parsing is really good for the basics in almost all programming languages. But it’s not good at semantic detail, even though editors like Vim try to put some in there. One of the most notable ones is splitting Identifier up by adding Function. These have routinely then been misused and inconsistently applied, with the result that historically a language like JavaScript would look completely different from C; I think there was some tidying up of things a few years ago, but can’t remember—I wrote a deliberately simple colorscheme that discards most of those differences anyway. Sometimes you’ll find Function being used for a function’s name at definition time; sometimes at call time too/instead; sometimes a `function` keyword instead.

In many languages, it’s simply not possible to match function names in definitions or calls using coarse parsing. C is definitely such a language. A large part of the problem is when you don’t have explicit delimiting syntax. That’s what you need. Oils, by contrast, looks to say `proc demo { … }`, so you can look for the `proc` keyword.

Vim’s syntax highlighting is unfortunately rather limited, and if you try to stretch what it’s capable of, it can get arbitrarily slow. It’s my own fault, but the Rust syntax files try to be too clever, and on certain patterns of curly braces after a few hundred lines, any editing can have multiple seconds of lag. I wish there were better tools for identifying what’s making it slow. I tried to figure it out once, but gave up.

I’ve declared coarse parsing rerally good for the basics in almost all programming languages, and that explicit delimiting syntax is necessary. This leads to probably my least favourite limitation in Vim syntax highlighting: you can’t model indent-based mode switching. In Markdown, for example (keep the leading two spaces, they’re fine):

   Text

         Code

  1.  Text

         Vim says code, actually text

                 Code

reStructuredText highlighting suffers greatly too, though it honestly can’t be highlighted correctly without a full parser (the appropriate mode inside the indented block can’t be known statically).

This is a real problem for my own lightweight markup language too, which uses meaningful indentation.

Oh cool, I'd be interested to read about the issues you had expressing Rust syntax in Vim!

And yes, there are a whole bunch of limitations:

- C function definitions - harder than JavaScript because there's no "function". It's still syntactic, but probably requires parsing, not just lexing.

- C variable definitions - I'd call this "coarse semantic analysis", not coarse parsing! Because of the "lexer hack"

- Indentation as you mention - there was a thread where someone was complaining that treesitter had 2 composed parsers for Markdown -- block and inline -- although I'm not sure if this causes a problem in practice? (feedback appreciated)

---

But I did intend for YSH to be easier to parse than shell, and that actually worked, because it fits quite well in Vim!

I noted here that OSH/bash has 10 lexer modes -- I just looked and it's up to ~16 now

https://www.oilshell.org/blog/2019/02/07.html#2019-updates

Whereas YSH has 3 mutually recursive modes, and maybe 6 modes total.

---

On the "coarse semantic analysis", another motivation is that I found Github's semantic source browser a bit underwhelming. Not sure if others had that same experience. I think it can't really be accurate because it doesn't have a lot of build time info. So I think they could have embraced "coarseness" more, to make it faster

Although maybe I am confusing the UI speed with the analysis speed. (One reason that this was originally a Codeberg link is that Codeberg/Forejo's UI is faster, without all the nav stuff)

There are some related links here, like How To Build Static Analyzers in Orders of Magnitude Less Code:

https://github.com/oils-for-unix/oils/wiki/Polyglot-Language...

I question this? Sure, it is difficult, if not possible, to match function names/calls using a naive single pass. But, I don't see any reason you couldn't do a full parse and work from there?

This is really no different than how we process language, though? Even using proper names everywhere, turns out proper names get reused. A lot. Such that you pretty much have to have an active simulation of what you are reading in order for most things to attach to identities. No?

  • I’m explicitly (and very clearly!) talking about coarse parsing.

    • Right, I meant my lead to be an agreement on the the coarsest possible parsing having trouble doing these things. With the understanding that it isn't just coarse straight to non-coarse. You will build up more capabilities as you progress.

      And, directly to my second sentence, we have the resources that you don't need to cater to the coarsest possible capabilities. You can augment the things you are looking for just fine. You can see in one pass that a new function named "foo" was added, then in another pass, you can start highlighting the function foo easily enough. Staying in a coarse world, you could probably even add a regex that roughly checks the correct number of parameters. Yes, it takes more passes, but isn't impossible?

      Further still, you could basically scan all identifiers and any that are a short hamming distance from each other can be queried to see if they are mistakes. Any that match after case folding? Still coarsely found, but still very helpful.

      My point on how we treat language is that we do that more than not. Is my biggest gripe with the jump from a coarse parse straight to a context free grammar approach. We work in contexts. I don't know why we go through so much effort to make the context unnecessary.

      As a fun example in natural language where I've seen this. We were going on a trip to Savannah with our kids when they were younger. The oldest was so excited that she was going to get to see cheetahs. An easy mistake to spot when being asked to look for it.