Comment by contificate
3 days ago
> Meanwhile, a compiler is an enormously complicated story.
I don't intend to downplay the effort involved in creating a large project, but it's evident to me that there's a class of "better C" languages for which LLVM is very well suited.
On purely recreational grounds, one can get something small off the ground in an afternoon with LLVM. It's very enjoyable and has a low barrier to entry, really.
Yes, this is fine for basic exploration but, in the long run, I think LLVM taketh at least as much as it giveth. The proliferation of LLVM has created the perception that writing machine code is an extremely difficult endeavor that should not be pursued by mere mortals. In truth, you can get going writing x86_64 assembly in a day. With a few weeks of effort, it is possible to emit all of the basic x86_64 instructions. I have heard aarch64 is even easier but I only have experience with x86_64.
What you then realize is that it is possible to generate quality machine code much faster than LLVM and using far fewer resources. I believe both that LLVM has been holding back compiler evolution and that it is close to if not already at peak popularity. As LLMs improve, the need for tighter feedback loops will necessitate moving off the bloat of LLVM. Moreover, for all of the magic of LLVMs optimization passes, it does very little to prevent the user from writing incorrect code. I believe we will demand more from a compiler backend than LLVM can ever deliver.
The main selling point of LLVM is that you gain access to all of the targets, but this is for me a weak point in its favor. Firstly, one can write a quality self hosting compiler with O(20) instructions. Adding new backends should be trivial. Moreover, the more you are thinking about cross platform portability, the more you are worrying about hypothetical problems as well as the problems of people other than yourself. Get your compiler working well first on your machine and then worry about other machines.
I agree. I've found that, for the languages I'm interesting in compiling (strict functional languages), a custom backend is desirable simply because LLVM isn't well suited for various things you might like to do when compiling functional programming languages (particularly related to custom register conventions, split stacks, etc.).
I'm particularly fond of the organisation of the OCaml compiler: it doesn't really follow a classical separation of concerns, but emits good quality code. E.g. its instruction selection is just pattern matching expressed in the language, various liveness properties of the target instructions are expressed for the virtual IR (as they know which one-to-one instruction mapping they'll use later - as opposed to doing register allocation strictly after instruction selection), garbage collection checks are threaded in after-the-fact (calls to caml_call_gc), its register allocator is a simple variant of Chow et al's priority graph colouring (expressed rather tersely; ~223 lines, ignoring the related infrastructure for spilling, restoring, etc.)
--
As a huge aside, I believe the hobby compiler space could benefit from someone implementing a syntactic subset of LLVM, capable of compiling real programs. You'd get test suites for free and the option to switch to stock LLVM if desired. Projects like Hare are probably a good fit for such an idea: you could switch out the backend for stock LLVM if you want.
>Adding new backends should be trivial.
Sounds like famous last words :-P
And I don't really know about faster once you start to handle all the edge cases that invariably crop up.
Point in case: gcc
It's the classic pattern where you redefine the task as only 80% of the original.
That is why the Hare languages uses QBE instead: https://c9x.me/compile/
Sure it can't do all the optimizations LLVM can but it is radically simpler and easier to use.
Hare is a very pleasant language to use, and I like the way the code looks vs something like Zig. I also like that it uses QBE for the reasons they explained.
That said, I suspect it’ll never be more than a small niche if it doesn’t target Mac and Windows.
If only that was only about emitting byte code in a file then calling the linker... you also have the problem of debug information, optimizers passes, the amount of tests required to prove the output byte code is valid, etc.
>On purely recreational grounds, one can get something small off the ground in an afternoon with LLVM. It's very enjoyable and has a low barrier to entry, really.
Is there something analogous for those wanting to create language interpreters, not compilers? And preferably for interpreters one wants to develop in Python?
Doesn't have to literally just an afternoon, it could be even a few weeks, but something that will ease the task for PL newbies? The tasks of lexing and parsing, I mean.
https://craftinginterpreters.com/introduction.html
AST interpreter in Java from scratch, followed by the same language in a tight bytecode VM in C.
Great book; very good introduction to the subject.
There's quite neat lexer and parser generators for Python that can ease the barrier to entry. For example, I've used PLY now and then for very small things.
On the non-generated side, lexer creation is largely mechanical - even if you write it by hand. For example, if you vaguely understand the idea of expressing a disjunctive regular expression as a state machine (its DFA), you can plug that into skeleton algorithms and get a lexer out (for example, the algorithm shown in Reps' "“Maximal-Munch” Tokenization in Linear Time " paper). For parsing, taking a day or two to really understand Pratt parsing is incredibly valuable. Then, recursive descent is fairly intuitive to learn and implement, and Pratt parsing is a nice way to structure your parser for the more expressive parts of your language's grammar.
Nowadays, Python has a match (pattern matching) construct - even if its semantics are somewhat questionable (and potentially error-prone). Overall, though, I don't find Python too unenjoyable for compiler-related programming: dataclasses (and match) have really improved the situation.
I am a big fan of Ragel[1]. That is a high performance parser generator. In fact, it can generate different types of parsers, very powerful. Unfortunately, it takes a lot of skill to operate. I wrote a parser generator generator to make it all smooth[2], but after 8 years I still can't call it effortless. A colleague of mine once "broke the internet" with a Ragel bug. So, think twice. Still, for weekend activities I highly recommend it, just for the way of thinking it embodies.
[1]: https://www.colm.net/open-source/ragel/
[2]: https://github.com/gritzko/librdx/blob/master/rdx/JDR.lex
Is this the same Ragel that Zed Shaw wrote about in one of his posts back in the day, during Ruby and Rails heydays? I vaguely remwmber that article. I think he used it for Mongrel, his web server.
https://github.com/mongrel/mongrel
The worst part of designing a language is the parsing stage.
Simple enough to do it by hand, but there’s a lot of boilerplate and bureaucracy involved that is painfully time-wasting unless you know exactly what syntax you are going for.
But if you adopt a parser-generator such as Flex/Bison you’ll find yourself learning and debugging and obtuse language that has to be forcefully bent to your needs, and I hope your knowledge of parsing theory is up-to-scratch when you’re facing with shift-reduce conflicts or have to decide whether LR or LALR(1) or whatever is most appropriate to your syntax.
Not even PEG is gonna come to your rescue.
Thanks to those who replied.