Comment by rualca
5 years ago
> Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing.
I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.
Also, you're far more likely to write a parser for a language than implement anything remotely related to compilers and interpreters.
It's one thing to state that some topics might justify more content, but it's absurd to downplay the fundamental importance of parsers in putting together a compiler.
The problem with the old old Dragon Book's treatment of parsing is that it explains a particular impractical algorithm (LR(1)? LR(k) maybe? it's been a while) in way too much detail. You would need that detail if you were to implement a parser generator (for that specific algorithm), or a parser using that specific algorithm, which isn't terribly well suited for generating useful error messages for your users.
Today's computers are ridiculously powerful, and the perceived performance advantages of LR parsing are simply irrelevant. Further, even if you were to implement an LR parser, you would do it with the aid of a parser generator, which hides the boring details and saves you from needing to understand the LR algorithm itself. But more likely when you write a parser you use a parser generator using a different approach, or you write a recursive descent parser by hand. In either case, the dry parts of the old old Dragon Book are completely irrelevant.
The people criticizing the 1986 Dragon Book's treatment of parsing aren't saying that parsing is irrelevant. They are saying that parsing that specific way is either irrelevant or done for you by a tool that is not the interesting part of a compiler.
If parsing "influences how the language can be structured and also how it should be structured for performance reasons" it's probably going to be a mediocre language, in which self-imposed constraints interfere with useful special features resulting in a bland design lacking competitive advantages.
Pascal has already been cited in other comments as having been designed for ease of parsing, but the Turbo Pascal compiler is an isolated masterpiece, and high-quality compilation at decent speed is normally more useful than cheap compilation at ludicrous speed.
Maybe parsing seems a large part of constructing a compiler because it's naturally the first task and because it's likely to be entangled with designing the syntax of the language, which can pose difficulties (e.g. ambiguities, context sensitivity, messy operator precedence).
> Parsing is the single most important feature of building a programming language
It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).
> It influences how the language can be structured and also how it should be structured for performance reasons.
Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi. Even if you do it with regexps. Everything that comes after parsing is very complex and is rarely explained in any detail. And everything that comes after actualy directly influences the structure, the performance etc.
It really makes zero difference if you decide that your typeing info should be written like this:
or like this:
or like this:
All this is trivial to parse. However, deciding on how you implement typing, what needs to be in the languaage to support your choice etc., now that definitely influences the structure, performance, trade-offs etc.
How much of that is in the Dragon Book? Next to zero? But surely there are 4 chapters on how to parse this.
> It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).
No,not really. I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?
And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?
> Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi.
Actually, you're quite wrong. RapidJSON only handles a very basic language, and doesn't need to handle any weird errors to output meaningful errors, and it boasts at most 1GB/s.
And this is a parser optimized for speed.
Perhaps if more people had a clue about compilers, this sort of errors and misconceptions wouldn't be so common.
> while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?
> And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?
These two statements are at odds with each other as having great developer experience is indeed a "fancy optimization" that you shouldn't spend too much time on when starting a project (and I say that as someone spending their time every day thinking about making compiler errors better), and part of the reason the literature around parsers is problematic is because graceful error recovery isn't explored in detail.
> And this is a parser optimized for speed.
Parse time is never the bottleneck. It is a waste of time to focus on optimizing parse speed at the beginning. It is worth it to eventually measure and tune it, and to think about the implications when designing your grammar, but beyond that the slowness of compilers is never driven by their parsers.
> you cannot get a compiler out of the door withot a parser
Technically, you can: you make your compiler something that consumes an already existing bytecode (like the JVM's or .Net's CLI) or only implement an AST parser that isn't intended for humans to actually write while you nail down the specifics.
> Syntax/grammar errors? You need the parser for that, don't you agree?
Error recovery in a parser is a tricky thing: it is intimately tied to the choices you've made in your grammar. The more redundancy and uniqueness between scopes you can have, the easier it can be for your parser to identify where things have gone badly. For example, in python if you write
you have little information but can still infer that you likely wanted to compare equality instead using ==. But because Python has a very flexible grammar and semantics, the following is valid but likely not what was intended:
If you make different more restrictive decisions on both grammar and semantics you can still provide tools to allow people to do this, by being more verbose, but people who do this by accident can be given excellent diagnostics.
> I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?
Exactly: an MVP. For an MVP the parser may indeed be "the single most important feature of building a programming language.". But even then it's almost definitely isn't.
The moment you "leave out fancy optimisation fetures" your language becomes at most mediocre.
Because between parsing (which is trivial) and "fancy optimisation features" (which come very late in a compiler life) there are a million things that are significantly more important than parsing:
- how do you do typing?
- how do you code generation (do you output machine code yourself or use a backend)?
- do you use an intermediate representation?
- what are the actual optimisations on performance (none of which have anything to do with parsing)? As an example, C++'s horrible header system has nothing to do with parsing, but slows compiling by hundreds of orders of magnitude.
- how do you do graceful error recovery and reporting?
- how are lambdas handled? how are scopes handled? memory layouts? GC/ARC/manual memory management? standard lib?
- a million other things
> Actually, you're quite wrong.
It was a hyperbole. Do you even understand how fast 1GB/s is? It roughly 10 to 20 million lines of code [1] Windows 10 is roughly 50 million lines [2]. So, at 1GB/s you can parse the entirety of Windows 10 codebase in under 5 seconds.
Even if your parsing speed is 1000 times slower (3 orders of magnitude slower), you can still parse ~10k lines a second. And I honsetly can't even begin to think how you can parse something this slowly. ANd if you do, it's still not a concern, because the rest comes from outside parsing, for example:
- can yo run parsers in parallel?
- do you need to parse the entirety of your project to begin compilation, or your files/modules are isolated and compiler can work on any of those?
Parsing is trivial.
[1] For example, this SO answer for an unrelated question creates a file with 150 million lines, and the result is 6GB. 150/6 = 25, I went for even lower numbers.
[2] https://answers.microsoft.com/en-us/windows/forum/all/window...
The thing about parsers is this: we have tools that let you input any context-free grammar and will spit out a parser for you. Getting a working parser for your language is trivial, even if getting a good parser is non-trivial (though even there, I hesitate to say that it's hard).
In terms of time investment, building a parser takes like an hour whereas the semantics for the same language would take a hundred hours. Even when I throw together a parser for a simple language for some project, I spend more time trying to wrangle the output of the parser (that is, the semantics) than I do on the actual parsing portion of it. All of that effort in the Dragon book explaining how to build an LL(1) or LR(1) parser, or even how to build a lexer? Never used it. Almost completely useless. I spend more time thinking if I want to use a peek/next approach or a get/unget approach than handling shift-reduce conflicts or building first/follow sets.
Of everyone I know who works in compilers, I can't think of anyone who doesn't think that the Dragon book spends too much time on parsing.
So it is a different world now than in 1996. Parsers are much more solved now than they were back then.
I've written production compiler, but well before the Dragon book was written. I found it valuable, but kind of after the fact.
The references that we used included "A Compiler Generator" by David B. Wortman, Jim Horning, William M. MacKeeman. There weren't that many other parser generators available at the time.
Lots of valuable work has been done to make parsers easier to write. At MWC, dgc (David G. Conroy, author of MicroEMACS, which is now 'mg' on most systems) said that "YACC parsers make the hard part harder and the easy part easier". He wrote the C compiler with a hand-crafted parser. Starting in assembler, then bootstrapped it to C. Of the folks you know that write compilers, I wonder if they started writing after 1996, or even 2006.
If you are simply a user of parsers, then maybe you don't need to know how they work.
This is an interesting comment. What's the book that's like the Dragon book that comes closest to getting it right?
> I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.
Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago. The necessary treatment of all things parsing is then covered on twenty pages of his Compiler Construction text.
> Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago.
"Recursive descent" is a whole class of parsing algorithms. Claiming that recursive descent settled the question is a kin to claim that compiled programming languages settled the question of how to develop programming languages.
Moreover, even if you go as far as believing that a specific class of parsing algorithms settled anything, it's absurd to argue that people who are learning how to write a parser should not learn about them, specially as your personal choice has fundamental limitations such as left-recursice operations and very specific techniques to mitigate them.
And finally, we still see articles being accepted into academic journals on how to parse concrete languages such as JSON. I'm sure we can agree that slapping together something with a random parser generator is not something that servers the students' best interests.
It turns out that whatever limitations there are to that approach hardly matter in practice, since Wirth developed a wide variety of languages this way, and they work just fine.
2 replies →