Comment by dmitriid

5 years ago

I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.

Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing. It's not uncommon for a book or a course dedicate 60-70% of its time to parsing. And leaves nothing to everything else.

> I'm not an expert by far, but I feel like many people who write compilers or develop languages no longer recommend The Dragon Book as it's very outdated.

Yes. There are newer editions that I can't judge, but the 1986 edition linked from the page should nowadays only be read by historians. If you want to implement a compiler, look elsewhere, for something newer.

This isn't to say that this was a bad book 35 years ago. It wasn't. But it is simply outdated. It does not contain things that are important today. It has very little treatment of type systems and type inference, for example. Its treatment of data flow analysis is based on intervals, which have limitations (though also some interesting theoretical advantages) and are used by zero compilers written in this millennium (and not even described very well in the Dragon book IIRC). And then of course there is the parsing problem, see my rant elsethread.

As for the article, I think A Catalogue of Optimizing Transformations (https://www.clear.rice.edu/comp512/Lectures/Papers/1971-alle...) would make an interesting addition.

Edit: Forgot to add that the old old Dragon book doesn't cover two topics that the article itself considers very relevant: SSA form (which I believe was discovered just around that time) and linear scan register allocation (which is much newer).

  • What would you recommend that covers data flow analysis better than the dragon book? I'm currently starting data flow analysis for my own compiler using intervals and/or modular congruences (operations can be reduced to congruences, then combined with the Chinese Remainder Theorem). What are the theoretical advantages of intervals and what do you recommend over that?

    • Ah, sorry, I should probably clarify that there are two completely distinct uses of the term "interval" in data flow analysis.

      One (introduced in https://amturing.acm.org/p137-allen.pdf and discussed in the old Dragon Book) groups program points into nested "intervals", i.e., intervals are pieces of code. Analysis information is then propagated among these intervals. This is what I was referring to above. As far as I know this approach is completely obsolete by now, with everyone using methods based more or less indirectly on Kildall's worklist iteration approach (http://static.aminer.org/pdf/PDF/000/546/451/a_unified_appro...). The theoretical advantage I was referring to above was that the interval approach is AFAIR sometimes more efficient because it can take program structure (loop nesting) into account more directly.

      The other meaning of "intervals" is used for an analysis domain, i.e., the kind of data you want your analysis to compute. For example, you might determine that the possible values of some variable x are in the interval [10, 20], the values of y are in the interval [5, 15], and then you can compute that the values of z = x + y must be in [15, 35]. There is nothing wrong with this approach, and any optimizing compiler is pretty much forced to use something like this. Based on your reference to modular congruences I think you probably want to do analysis of numerical values and meant this sense, which is anyway the only one in common use nowadays. Go ahead and do this. Extensions are with congruences, like "x is in [10, 20] and also congruent to 1 modulo 4" and/or "known bits", like "x is in [10, 20], and I know that (counting from the LSB) its bit 2 is 0 and bit 3 is 1".

      Personally I learned about data flow analysis in a university course based on Principles of Program Analysis: https://www.springer.com/gp/book/9783540654100. It's a book that goes into a lot of mathematical depth but not much breadth of analyses, and it doesn't handle data flow analysis on SSA form. Then I learned additional details from various well-known papers like the ones linked above, and the classical early papers around analysis on SSA form, like https://www.cse.wustl.edu/~cytron/531Pages/f11/Resources/Pap.... I'm sure there must be a more straightforward text to get you there, but I don't think I can give a definite recommendation. I have vague memories of the coverage in Muchnick's "Advanced Compiler Design and Implementation" and/or Cooper and Torczon's "Engineering a Compiler" being reasonable, but I can't find my copies right now.

> Additionally, the vast majority of compiler books and resources spend too much time on the easiest part of compiling: parsing.

I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.

Also, you're far more likely to write a parser for a language than implement anything remotely related to compilers and interpreters.

It's one thing to state that some topics might justify more content, but it's absurd to downplay the fundamental importance of parsers in putting together a compiler.

  • The problem with the old old Dragon Book's treatment of parsing is that it explains a particular impractical algorithm (LR(1)? LR(k) maybe? it's been a while) in way too much detail. You would need that detail if you were to implement a parser generator (for that specific algorithm), or a parser using that specific algorithm, which isn't terribly well suited for generating useful error messages for your users.

    Today's computers are ridiculously powerful, and the perceived performance advantages of LR parsing are simply irrelevant. Further, even if you were to implement an LR parser, you would do it with the aid of a parser generator, which hides the boring details and saves you from needing to understand the LR algorithm itself. But more likely when you write a parser you use a parser generator using a different approach, or you write a recursive descent parser by hand. In either case, the dry parts of the old old Dragon Book are completely irrelevant.

    The people criticizing the 1986 Dragon Book's treatment of parsing aren't saying that parsing is irrelevant. They are saying that parsing that specific way is either irrelevant or done for you by a tool that is not the interesting part of a compiler.

  • If parsing "influences how the language can be structured and also how it should be structured for performance reasons" it's probably going to be a mediocre language, in which self-imposed constraints interfere with useful special features resulting in a bland design lacking competitive advantages.

    Pascal has already been cited in other comments as having been designed for ease of parsing, but the Turbo Pascal compiler is an isolated masterpiece, and high-quality compilation at decent speed is normally more useful than cheap compilation at ludicrous speed.

    Maybe parsing seems a large part of constructing a compiler because it's naturally the first task and because it's likely to be entangled with designing the syntax of the language, which can pose difficulties (e.g. ambiguities, context sensitivity, messy operator precedence).

  • > Parsing is the single most important feature of building a programming language

    It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

    > It influences how the language can be structured and also how it should be structured for performance reasons.

    Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi. Even if you do it with regexps. Everything that comes after parsing is very complex and is rarely explained in any detail. And everything that comes after actualy directly influences the structure, the performance etc.

    It really makes zero difference if you decide that your typeing info should be written like this:

       T x = new T();
    

    or like this:

       let x: T = T()
    

    or like this:

       var x := T()
    

    All this is trivial to parse. However, deciding on how you implement typing, what needs to be in the languaage to support your choice etc., now that definitely influences the structure, performance, trade-offs etc.

    How much of that is in the Dragon Book? Next to zero? But surely there are 4 chapters on how to parse this.

    • > It realy isn't. Also, it's the most trivial part of parsing a programming language (in a compiler).

      No,not really. I mean, think about it for a second: while you cannot get a compiler out of the door withot a parser, you can indeed leave all the fancy optimization features out, don't you? And if you want to get a MVP out of the door, do you focus your effort in developing optional features or the basic, critical part that ensures that your language exists?

      And then there's developer experience. Syntax/grammar errors? You need the parser for that, don't you agree?

      > Once again, no, not really. However complex your language is, it can be parsed at 100 GB/s on a Raspberry Pi.

      Actually, you're quite wrong. RapidJSON only handles a very basic language, and doesn't need to handle any weird errors to output meaningful errors, and it boasts at most 1GB/s.

      And this is a parser optimized for speed.

      Perhaps if more people had a clue about compilers, this sort of errors and misconceptions wouldn't be so common.

      2 replies →

  • The thing about parsers is this: we have tools that let you input any context-free grammar and will spit out a parser for you. Getting a working parser for your language is trivial, even if getting a good parser is non-trivial (though even there, I hesitate to say that it's hard).

    In terms of time investment, building a parser takes like an hour whereas the semantics for the same language would take a hundred hours. Even when I throw together a parser for a simple language for some project, I spend more time trying to wrangle the output of the parser (that is, the semantics) than I do on the actual parsing portion of it. All of that effort in the Dragon book explaining how to build an LL(1) or LR(1) parser, or even how to build a lexer? Never used it. Almost completely useless. I spend more time thinking if I want to use a peek/next approach or a get/unget approach than handling shift-reduce conflicts or building first/follow sets.

    Of everyone I know who works in compilers, I can't think of anyone who doesn't think that the Dragon book spends too much time on parsing.

    • So it is a different world now than in 1996. Parsers are much more solved now than they were back then.

      I've written production compiler, but well before the Dragon book was written. I found it valuable, but kind of after the fact.

      The references that we used included "A Compiler Generator" by David B. Wortman, Jim Horning, William M. MacKeeman. There weren't that many other parser generators available at the time.

      Lots of valuable work has been done to make parsers easier to write. At MWC, dgc (David G. Conroy, author of MicroEMACS, which is now 'mg' on most systems) said that "YACC parsers make the hard part harder and the easy part easier". He wrote the C compiler with a hand-crafted parser. Starting in assembler, then bootstrapped it to C. Of the folks you know that write compilers, I wonder if they started writing after 1996, or even 2006.

      If you are simply a user of parsers, then maybe you don't need to know how they work.

    • This is an interesting comment. What's the book that's like the Dragon book that comes closest to getting it right?

  • > I disagree. Parsing is the single most important feature of building a programming language. It influences how the language can be structured and also how it should be structured for performance reasons.

    Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago. The necessary treatment of all things parsing is then covered on twenty pages of his Compiler Construction text.

    • > Wirth already settled that question (of both structuring the language and performance) with recursive descent over thirty years ago.

      "Recursive descent" is a whole class of parsing algorithms. Claiming that recursive descent settled the question is a kin to claim that compiled programming languages settled the question of how to develop programming languages.

      Moreover, even if you go as far as believing that a specific class of parsing algorithms settled anything, it's absurd to argue that people who are learning how to write a parser should not learn about them, specially as your personal choice has fundamental limitations such as left-recursice operations and very specific techniques to mitigate them.

      And finally, we still see articles being accepted into academic journals on how to parse concrete languages such as JSON. I'm sure we can agree that slapping together something with a random parser generator is not something that servers the students' best interests.

      3 replies →

> no longer recommend The Dragon Book as it's very outdated

Only if they didn't realize there is a second edition from 2006.