Comment by megameter

6 years ago

This kind of nibbles at the edges of the longstanding question: Whether source code is best represented as a linear document.

Because historically, that was true for the obvious reasons of pragmatism; you could put it on a punchcard, print it to paper, save it to a file all using the same character encoding and rendering methods, and when we went to interactive editing that didn't change: half of the joy of plaintext is that you can select, copy, paste, search and modify in mostly standard ways across many tools.

Syntax highlighting is incrementally in line with this paradigm since there's only one syntax for the language(syntax extension and mixed-language documents aside) and it too follows a linear order, although it also recognizes that an AST exists. But as you get into stuff like code folding, Intellisense and other IDE-like features, deviations emerge. More of the selections deal with symbolic meanings(declarations, types) and not really the tokens or the characters they consist of, and only incidentally the delimiters of the AST.

But - when we try to fully re-envision it and express these other concepts directly we tend to end up with something that is unfamiliar, a bit clunky and alienating(most graphical languages). So the accepted developments always seem to come in these little ideas that are mostly-optional add-ons.

Chuck Moore did use semantic color as a language feature in ColorForth, but his specific goals lean towards a holistic bootstrapping minimalism, so I haven't seen it taken up elsewhere.

One of the nice things about code being text is that you can copy and paste unparseable subsets of code without anything getting in your way. For instance, if you need to move an if/else out of a function, you can move the if statement, reindent the body, and then move the else block. If you have syntax highlighting on, it might briefly mis-highlight (e.g., it may not know what to do with the else block when it's briefly unpaired) but it will let you do it, and it will fix itself once the program is properly parseable again.

Compare with, say, a WYSYWIG rich text / HTML editor. If you want to reorganize a bulleted list, it's many more steps, because the tool isn't really set up to accommodate a momentarily-invalid state. I think that's the big difference between syntax-highlighting something that fundamentally remains text and switching the representation to not-text.

This also lets you do various sorts of text manipulation like conflicted merges. There are lots of arguments for an AST-aware merge, but in case of a conflict, a text-based merge system that inserts standard conflict markers will still usually leave you with a file that parses well enough to be syntax-highlighted, even if it won't compile with the conflict markers still in place. Or even imagine converting a program from one language to another (e.g., a shell script that outgrew shell). You can stick the invalid code in your text editor, ignore the highlighting, and turn it line-by-line into the new language.

I think all the highlighting examples in this article work as overlays, just like syntax highlighting, so they'd work acceptably well in these briefly-invalid states. In some cases it'll fail to highlight or it will need aggressive error-recovery that would be dubious in an actual compiler or interpreter (imagine, say, "if you see a conflict market, skip the <<< portion and parse the >>> portion to find what locals exist and what their types are, then go back and try to highlight the <<< portion"), but since the highlighter isn't the real interpreter, that's fine.

  • > the tool isn't really set up to accommodate a momentarily-invalid state.

    I've struggled to explain to students, teachers and others my frustrations with anything that isn't plaintext code. This is it - thank you!

    I think it's also a neat learning concept about why it's important one is _able_ to make mistakes when writing code. So many are overwhelmed by the flexibility or fragility of syntaxes but there's actually a lot of power in that.

  • Years back I had a mentor of sorts very strongly convince me that "degenerate" cases where code doesn't create a valid AST from the perspective of code editing and source control standpoint should be considered something of the "default case". We spend a lot more time on work-in-progress code than we ever do finished compiling code. Invalid states aren't often as "brief" as we think they are, and there are far too many reasons why you want to be able to save and even source control work-in-progress code (including things like "it's the end of the day and I want to make sure I have this backed up" and "maybe my coworker can spot why this isn't parsing because my tired eyes are not seeing it").

    > If you have syntax highlighting on, it might briefly mis-highlight (e.g., it may not know what to do with the else block when it's briefly unpaired) but it will let you do it, and it will fix itself once the program is properly parseable again.

    This intuition that syntax highlighting token streams already are the most generic "semantic" tool we have readily available, are very resilient to work-in-progress states, and are very fast (because we use them in real time in editors), led me to experimenting with a token-stream based diff tool. [1]

    I got some really good results in my experiments with it. It gives you character-based diffs (as opposed to line-based diffs) better (more semantically meaningful) and faster than the other character-based diff tools I compared it to. You could probably use it as diff tool with git projects today if you wanted, but it would mostly just be a UI toy as git is snapshot-based rather than a patch-based source control system. (I explored the idea curious if might be useful to patch-based darcs. Darcs kept exploring the idea of trying to implement a character-based patch format in addition to or in replacement of its line-based patch format, but so far as I saw never did, but if it did, this tool would potentially be quite powerful there.) It's a neat toy/experiment though.

    [1] https://github.com/WorldMaker/tokdiff

> Whether source code is best represented as a linear document.

But source code is NOT represented as a linear document. At a minimum, source code is represented by dozens, hundreds, or thousands of files across a file system.

As such, source code on a filesystem is a highly-connected graph. Source code is hypertext: classes, functions, and data-structures take you to different files in your directory tree.

Take some ASync code you have, and tell me: how many documents / files do you have to read before you really know what the async code does from beginning to end? There's nothing linear about our code layout actually.

--------------

If your question is: "Is an individual code block best represented as a linear document" ?? Then I would argue yes. At a micro-level, code executes from the beginning, and ends at the bottom (except for calls and jumps). But calls / jumps are well represented in our "implicit graph" (function call to another "document" in our file system)

-------------

I guess there's an open question about loops: they do NOT progress from top-to-bottom, but are still represented in linear form. Maybe there's a new representation you can have for them. (But maybe that's why "recursive" calls are so powerful: a "recursive" call actually matches the graph-based indirection of our code).

  • Code IS represented to both the programmer and to the file system as text files. I think your point is that this does not do justice to the complexity of most code bases, and with that I completely agree. There should be much better ways of presenting the actual complexity to programmers.

You can extend this further: git is the wrong representation model for code (text).

It's a useful representation model, but really only because programming languages are mostly designed to be line-oriented (though not enough: i.e. Python should disallow multiple imports on one line because it lightly breaks diff-viewing). Git is the dumbest possible thing that can work, and it does work, but it's also wrong.

What we really need is a generic way to store graph-structured data (which is what an AST really is) into something like git. Because then we could finally get rid of the notion of code-formatting as anything but a user experience peculiarity (the rise of "push button and blackbox fixes code for commit" elucidates why this should be).

But more importantly, it would mean we could reasonably start representing commits with some notion of what they are doing to the control flow and structure of the code. Think git's rename detection but it notices when a function is renamed, or can represent refactoring in a diff by showing that all the control linkages of one block of code have moved to a particular function (and by extension, diffs would now implicitly show when new links or dependencies are added).

The trouble of course is doing any of this generically, is an unsolved problem. I have an idea that you could probably do something interesting like this with git's existing structure and a language like Golang, where you decompose the program into a bunch of files representing the hierarchical AST and commit that rather then actual text maybe.

  • > What we really need is a generic way to store graph-structured data (which is what an AST really is) into something like git.

    git can already store, diff and merge tree-structured data: directory trees. It would be an interesting experiment to encode various tree-structured data as a directory-tree in git and see how it behaves on different version control operations.

    • I mean, git doesn't actually track diffs, so the diff/merge part is separate, and already supports plugins: just write a syntax-oriented diff/merge for git and tell git in your configuration files to use it for .py files or whatever... this isn't a git limitation.

      2 replies →

  • You hit the nail on the head. I would love to create a PoC language/IDE/VCS that solves some of this. Code is very multi-dimensional and there are more ways to view the content than a text based file tree.

  • Interesting that this has come up on the day when Pharo has also made it to the front page.

    Smalltalk uses an image and holds live objects in memory at all times - there's no distinction between the source code representation and the running representation of the code. This allows the "IDE" (which is a misnomer because the development environment and the live environment are one and the same) to introspect on live objects and perform analyses on them.

    TL;DR; Smalltalk has never used text as an internal representation and that removes the separation between the source code and the debug or live environment.

  • Code as text had long bothered me because of my experience with various schematic capture and pcb layout tools. With those the design is stored in form that can be queried directly. And you can change object properties programmatically.

    I feel like the insistence of text based programming languages just leaves the industry completely mired in the mid 1970's.

    • I’ve had to deal with several graphical programming languages over the years.

      They only work on the most basic of use cases. The moment you want to get real work done they just get in the way.

      5 replies →

    • It puzzles me too. I think that portability and the ability to use the same text-based tools even though you use different languages are the best reasons for sticking with text.

      1 reply →

There will never be anything that is as portable as text, and I think that is the sole reason we will never see any other widespread method of storing "code".

Imagine having to update to a new language version if the code is blobs, I feel sick already. And all the language vendor specific software you'd need just to do anything with it in general.

  • This might surprise you, but a huge amount of business logic is already stored in databases etc rather than in text based code repositories. There are lots of enterprises that have platforms (e.g. Salesforce) that allow you to build things without code.

    • Databases can usually be exported as, and rebuilt from, some kind of plain text format - be it CSVs, JSON, DDL+DML scripts.

      Of course it's less efficient than using the vendor's native dumps. But it's still important that the option exists, because it means that, in a worst case scenario, you can unfuck your data with text-based tools ranging from notepad to enterprise distributed buzzword ETL.

      2 replies →

    • These things are very vendor specific and proprietary, I was more thinking about general purpose programming languages.

  • You do realize that text is just streams of binary data. Its just our entire ecosystem/tooling has evolved to interpret continuous streams of that binary data as text documents.

  • There is no reason not to store the text as a canonical representation alongside an alternative AST representation. A valid AST can always be rendered as valid source code and valid source code can always be parsed back into an AST.

    So if the AST format changes incompatibly, you simply need to let the compiler regenerate it from the source code.

> there's only one syntax for the language(syntax extension and mixed-language documents aside) and it too follows a linear order

I'm of the opinion that programming languages should have two "official" representations: the usual text-based representation, and a machine-readable concrete syntax tree. The latter should be an existing format, e.g. JSON, XML, s-expressions, etc.

It should be "official" in the sense that it's standardised in the same way as the text format, language implementations (compilers/interpreters) should accept it alongside the text format, and they should have a mode which converts the text format into the machine-readable format (and optionally the other way). This is important, since lots of code 'in the wild' only makes sense after feeding it through particular preprocessors, specifically-configured build tools, sed-heavy shell scripts, etc. such that the only tool that can even parse the code correctly is the compiler/interpreter (and even that might need a bunch of tool-specific flags, env vars, config files, etc.!). This makes tooling much harder than it needs to be, and any "unofficial" workarounds will need constant work to keep up with changes to the language.

I say concrete syntax trees since we want to impose as little meaning as possible on the tokens, since that makes tooling more robust in the face of things like macros/custom syntax, new language features, incomplete or malformed code, etc.

  • Many Basic implementations had this feature. The SAVE command saves a binary (tokenized) version of the source code. "SAVE file, A" saves in ASCII format.

    • The SAVE command in old BASICs used to tokenize the individual BASIC statements and inbuilt functions, eg PRINT, GOTO, CHR$(). It could also tokenize line numbers. But it certainly didn't do things like tokenize a FOR/NEXT loop or anything that went beyond a line break (eg GOSUB/RETURN).

      Just typing those words sends me back too many years to a TTY on a PDP11/10 and when you "saved" a program by:

      1. Typing LIST but not hitting CR

      2. Start the tape punch and press HERE-IS a few times to get a leader

      3. Hitting CR

      4. Waiting for the listing to finish

      5. Hitting HERE-IS a few times to get a trailer

      6. Folding the paper tape neatly :)

> unfamiliar, a bit clunky and alienating (most graphical languages)

IMO the largest issue with them is UX, specifically input part of that. Keyboard is very fast and very precise, but it's only good if you're working on plain text or something not too far from it.

  • And I hope it stays this way.

    I'll leave the minority report hand waving for you youngins.

    Give me a call when the brain implants are ready for prime time though.

An AST is directly representable via text - via the very syntax you use to write the ‘linear document’. What’s holding us back is not necessarily the text, it’s the editors which are (generally) only able to edit text, not an AST.

It should be possible (if not easy) to create a zoomable UI that lets you interact with code directly as well as zooming out to view and interact with the code structure in a meaningful way.

  • Maybe the problem isn't so much the text representation but the absence of a canonical representation. If there was a well define canonical representation for any program then it would be trivial to view and edit it in my prefered style without messing up the diffs in the repo.

    Code formatters help, some more than others but most are far from producing a canonical format. This means that I either need to edit in the projects prefered style or have my editor carefully track which sections I haven't modified and be sure to write them back exactly as the originally appeared.

  • See I wouldn't agree. I use JetBrains IDEs and I never feel as though I am working on a linear document. Between 'jump to definition', 'see usages', and various 'refactor' commands, it is easy to navigate and edit the tree structure.