← Back to context

Comment by XorNot

6 years ago

You can extend this further: git is the wrong representation model for code (text).

It's a useful representation model, but really only because programming languages are mostly designed to be line-oriented (though not enough: i.e. Python should disallow multiple imports on one line because it lightly breaks diff-viewing). Git is the dumbest possible thing that can work, and it does work, but it's also wrong.

What we really need is a generic way to store graph-structured data (which is what an AST really is) into something like git. Because then we could finally get rid of the notion of code-formatting as anything but a user experience peculiarity (the rise of "push button and blackbox fixes code for commit" elucidates why this should be).

But more importantly, it would mean we could reasonably start representing commits with some notion of what they are doing to the control flow and structure of the code. Think git's rename detection but it notices when a function is renamed, or can represent refactoring in a diff by showing that all the control linkages of one block of code have moved to a particular function (and by extension, diffs would now implicitly show when new links or dependencies are added).

The trouble of course is doing any of this generically, is an unsolved problem. I have an idea that you could probably do something interesting like this with git's existing structure and a language like Golang, where you decompose the program into a bunch of files representing the hierarchical AST and commit that rather then actual text maybe.

> What we really need is a generic way to store graph-structured data (which is what an AST really is) into something like git.

git can already store, diff and merge tree-structured data: directory trees. It would be an interesting experiment to encode various tree-structured data as a directory-tree in git and see how it behaves on different version control operations.

  • I mean, git doesn't actually track diffs, so the diff/merge part is separate, and already supports plugins: just write a syntax-oriented diff/merge for git and tell git in your configuration files to use it for .py files or whatever... this isn't a git limitation.

    • Git actually does ship with parsers for different programming languages, I believe, because when you do a diff, it captions the changed lines with the function call they're from, etc. That said I've also used third-party "semantic diff" tools (generally for Windows) that have integrations with git. It's really nice when Github can't merge something, but you can, easily, locally, with a semantic merge/diff tool called from git.

    • Since golang includes sourcefile parser, it is possible to leverage those libs to reason about and automate stuff regarding sourcefiles. See "go fmt" for a start. Go sourcefiles can just be a storage layer for higher-end purposes then.

      What is suggested in ancestor posts is to change storage format, although I think it better to prototype something like using the above first. Integrating into the object model directly, I suspect one would in the end want to design/evolve a language to build a platform specifically for that. You might end up with something similar to Smalltalk, so would need to think if that's the goal or if there's more to accomplish.

      The higher order question is: What concerns need to be integrated, and what concerns need to be separated? Integration may bring new powers, but also risks of evolutions into "big ball of mud". Decoupling may bring freedoms and independency, but also risks of lack of coherency and unoptimal couplings.

      What are the benefits of the current paradigm, and what are the disadvantages? CVS and text-files have worked for a very long time, is cross-platform and works beyond any single project scope. Maybe the golang-approach reaches some sort of optimal equilibrium.

You hit the nail on the head. I would love to create a PoC language/IDE/VCS that solves some of this. Code is very multi-dimensional and there are more ways to view the content than a text based file tree.

Interesting that this has come up on the day when Pharo has also made it to the front page.

Smalltalk uses an image and holds live objects in memory at all times - there's no distinction between the source code representation and the running representation of the code. This allows the "IDE" (which is a misnomer because the development environment and the live environment are one and the same) to introspect on live objects and perform analyses on them.

TL;DR; Smalltalk has never used text as an internal representation and that removes the separation between the source code and the debug or live environment.

Code as text had long bothered me because of my experience with various schematic capture and pcb layout tools. With those the design is stored in form that can be queried directly. And you can change object properties programmatically.

I feel like the insistence of text based programming languages just leaves the industry completely mired in the mid 1970's.

  • I’ve had to deal with several graphical programming languages over the years.

    They only work on the most basic of use cases. The moment you want to get real work done they just get in the way.

    • Very much disagree, and completely dependent on what you define as "real work". There are many low-cost platforms out there that allow for very rapid development of big projects.

      Disclaimer: I used to work for one of them (Mendix)

      3 replies →

  • It puzzles me too. I think that portability and the ability to use the same text-based tools even though you use different languages are the best reasons for sticking with text.

    • I think is a feeling of loss of control and having no clue what you're missing. Programs stored as unstructured text make it difficult to produce good tooling. Makes programmatic code gen hard, makes diffing hard, make automatically generating commit message impossible. Refactoring is always problematic. Merging even more so.

      But it seems though even mentioning that perhaps structured data is a better way to store code makes most programmers really angry. At least that's my experience.