Comment by dmitriid

5 years ago

> Parsing is the single most important feature of building a programming language

It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

> It influences how the language can be structured and also how it should be structured for performance reasons.

Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi. Even if you do it with regexps. Everything that comes after parsing is very complex and is rarely explained in any detail. And everything that comes after actualy directly influences the structure, the performance etc.

It really makes zero difference if you decide that your typeing info should be written like this:

   T x = new T();

or like this:

   let x: T = T()

or like this:

   var x := T()

All this is trivial to parse. However, deciding on how you implement typing, what needs to be in the languaage to support your choice etc., now that definitely influences the structure, performance, trade-offs etc.

How much of that is in the Dragon Book? Next to zero? But surely there are 4 chapters on how to parse this.

> It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

No,not really. I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?

> Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi.

Actually, you're quite wrong. RapidJSON only handles a very basic language, and doesn't need to handle any weird errors to output meaningful errors, and it boasts at most 1GB/s.

And this is a parser optimized for speed.

Perhaps if more people had a clue about compilers, this sort of errors and misconceptions wouldn't be so common.

  • > while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

    > And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?

    These two statements are at odds with each other as having great developer experience is indeed a "fancy optimization" that you shouldn't spend too much time on when starting a project (and I say that as someone spending their time every day thinking about making compiler errors better), and part of the reason the literature around parsers is problematic is because graceful error recovery isn't explored in detail.

    > And this is a parser optimized for speed.

    Parse time is never the bottleneck. It is a waste of time to focus on optimizing parse speed at the beginning. It is worth it to eventually measure and tune it, and to think about the implications when designing your grammar, but beyond that the slowness of compilers is never driven by their parsers.

    > you cannot get a compiler out of the door withot a parser

    Technically, you can: you make your compiler something that consumes an already existing bytecode (like the JVM's or .Net's CLI) or only implement an AST parser that isn't intended for humans to actually write while you nail down the specifics.

    > Syntax/grammar errors? You need the parser for that, don't you agree?

    Error recovery in a parser is a tricky thing: it is intimately tied to the choices you've made in your grammar. The more redundancy and uniqueness between scopes you can have, the easier it can be for your parser to identify where things have gone badly. For example, in python if you write

        if foo = bar:
            print(foo)
    

    you have little information but can still infer that you likely wanted to compare equality instead using ==. But because Python has a very flexible grammar and semantics, the following is valid but likely not what was intended:

        class Foo:
            def bar(self):
                print("hi!")
    
        foo = Foo()
        foo.bar = "hi!"
    

    If you make different more restrictive decisions on both grammar and semantics you can still provide tools to allow people to do this, by being more verbose, but people who do this by accident can be given excellent diagnostics.

  • > I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

    Exactly: an MVP. For an MVP the parser may indeed be "the single most important feature of building a programming language.". But even then it's almost definitely isn't.

    The moment you "leave out fancy optimisation fetures" your language becomes at most mediocre.

    Because between parsing (which is trivial) and "fancy optimisation features" (which come very late in a compiler life) there are a million things that are significantly more important than parsing:

    - how do you do typing?

    - how do you code generation (do you output machine code yourself or use a backend)?

    - do you use an intermediate representation?

    - what are the actual optimisations on performance (none of which have anything to do with parsing)? As an example, C++'s horrible header system has nothing to do with parsing, but slows compiling by hundreds of orders of magnitude.

    - how do you do graceful error recovery and reporting?

    - how are lambdas handled? how are scopes handled? memory layouts? GC/ARC/manual memory management? standard lib?

    - a million other things

    > Actually, you're quite wrong.

    It was a hyperbole. Do you even understand how fast 1GB/s is? It roughly 10 to 20 million lines of code [1] Windows 10 is roughly 50 million lines [2]. So, at 1GB/s you can parse the entirety of Windows 10 codebase in under 5 seconds.

    Even if your parsing speed is 1000 times slower (3 orders of magnitude slower), you can still parse ~10k lines a second. And I honsetly can't even begin to think how you can parse something this slowly. ANd if you do, it's still not a concern, because the rest comes from outside parsing, for example:

    - can yo run parsers in parallel?

    - do you need to parse the entirety of your project to begin compilation, or your files/modules are isolated and compiler can work on any of those?

    Parsing is trivial.

    [1] For example, this SO answer for an unrelated question creates a file with 150 million lines, and the result is 6GB. 150/6 = 25, I went for even lower numbers.

    [2] https://answers.microsoft.com/en-us/windows/forum/all/window...