Comment by statictype
2 months ago
This article sums it up perfectly. I was interested in building a compiler long before going to college and this was the most accessible body of work.
Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
>a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
https://en.wikipedia.org/wiki/Niklaus_Wirth
From the Publications section of that Wikipedia page:
>The April 1971 Communications of the ACM article "Program Development by Stepwise Refinement",[22][23] concerning the teaching of programming, is considered to be a classic text in software engineering.[24] The paper is considered to be the earliest work to formally outline the top-down method for designing programs.[25][26] The article was discussed by Fred Brooks in his influential book The Mythical Man-Month and was described as "seminal" in the ACM's brief biography of Wirth published in connection to his Turing Award.[27][28]
Wirth also wrote an extremely accessible book on Compiler Construction, using exactly the hand written recursive descent parsing approach discussed by OP.
The initial edition was published in 1976, in German, but the latest version is available online:
https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...
There are also parser generators like ANTLR (https://en.wikipedia.org/wiki/ANTLR) which take an input not unlike yacc, but generate a LL parser using explicit code, rather than the table driven LALR parsing of yacc.
Thank you. Just to confirm, by "accessible", do you mean easy to understand?
Anyway, I think I had come across that book on the net, but did not check it out at the time. I don't remember the exact reason, maybe it was because I didn't want to go into the subject of compilers at the time, and was only interested in interpreters, because I prefer to take things one step at a time.
Now I will check it out.
2 replies →
"breaking things down into the right primitives" is the real key to programming. There are many books and web pages about algorithms, but I wish there were more searchable and browsable resources for how to approach problems through primitives.
The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Whats blocking me during programming usually are edge cases I had no idea about. Its still hard to find good material on compilers if you are not into reading dry ass books. Thats a me problem though, I simply cant force myself to read boring factual only content (one of the reasons as to why I love beejs guides).
> The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Yes, but with experience that just becomes a matter of recognizing problem and design patterns. When you see a parsing problem, you know that the simplest/best design pattern is just to define a Token class representing the units of the language (keywords, operators, etc), write a NextToken() function to parse characters to tokens, then write a recursive descent parser using that.
Any language may have it's own gotchas and edge cases, but knowing that recursive descent is pretty much always going to be a viable design pattern (for any language you are likely to care about), you can tackle those when you come to them.
There is a good body of literature that is distinct from the academic compiler literature.
https://nanopass.org/
When I need to parse something nowadays I always end up with parser combinators. They just make so much sense.
What language do you use parser combinators in, and what kind of grammar do you parse usually? Nom was terribly verbose and unergonomic even by Rust's standards. Haskell's Megaparsec/Parsec is good but yeah, it's Haskell, you need to handle multiple monads (Parser itself is monadic, then your AST state, and maybe some error handling) at once and that's where I got confused. But I appreciated the elegance.
I experimented with PCs in Haskell and Rust (nom), then moved on to parser generators in Rust (pest.rs), Ocaml (Menhir), Haskell (Happy) and finally ended up with python's Lark - the speed of experimenting with different syntax/grammars is just insane.
Parser combinators is more of a concept than a library. You could make your own supporting the stuff you need. I like writing programs in languages I don't know or I barely know. I usually just take one of the popular libraries in any given language.
For Rust I used Nom and I didn't mind it all that much although I noticed it's quite baroque. If I had more to write I'd probably make some wrappers or macros of my own for most commonly used Nom snippets.
I've used tree-sitter for generating my parsers in Rust, and just working with the untyped syntax tree it generates, and gives you error-tolerance for free. It's a bit of a setup at first tho, requiring an extra crate for the generated parser, but editing it from there saves so much time.
3 replies →
That's just a generalisation of recursive descent.
The right primitives and composability bring in immense value.
That's a good point - recursive descent as a general lesson in program design, in addition to being a good way to write a parser.
Table driven parsers (using yacc/etc) used to be emphasized in old compiler writing books such as Aho & Ullman's famous "dragon (front cover) book". I'm not sure why - maybe part efficiency for the slower computers of the day, and part because in the infancy of computing a more theoretical/algorithmic approach seemed more sophisticated and preferable (the cannonical table driven parser building algorithm was one of Knuth's algorithms).
Nowadays it seems that recursive descent is the preferred approach for compilers because it's ultimately more practical and flexible. Table driven can still be a good option for small DSLs and simple parsing tasks, but recursive descent is so easy that it's hard to justify anything else, and LLM code generation now makes that truer than ever!
There is a huge difference in complexity between building a full-blown commercial quality optimizing compiler and a toy one built as a learning exercise. Using something like LLVM as a starting point for a learning exercise doesn't seem very useful (unless your goal is to build real compilers) since it's doing all the heavy lifting for you.
I guess you can argue about how much can be cut out of a toy compiler for it still to be a useful learning exercise in both compilers and tackling complex problems, but I don't see any harm in going straight from parsing to code generation, cutting out AST building and of course any IR and optimization. The problems this direct approach causes for code generation, and optimization, can be a learning lesson for why a non-toy compiler uses those!
A fun approach I used at work once, wanting to support a pretty major C subset as the language supported by a programmable regression test tool, was even simpler ... Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object. The recursive descent functions just returned these statement or expression values directly. To give a flavor of it, the ForLoopStatement subclass held the initialization, test and increment expression class pointers, and then the ForLoopStatement::Execute() method could just call testExpression->Value() etc.
Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object.
That is basically an AST interpreter.
No - there was no separation of AST and interpreter. You could consider it just as a directly executable AST.