← Back to context

Comment by palata

9 days ago

I am not convinced.

- Natural languages are ambiguous. That's the reason why we created programming languages. So the documentation around the code is generally ambiguous as well. Worse: it's not being executed, so it can get out of date (sometimes in subtle ways).

- LLMs are trained on tons of source code, which is arguably a smaller space than natural languages. My experience is that LLMs are really good at e.g. translating code between two programming languages. But translating my prompts to code is not working as well, because my prompts are in natural languages, and hence ambiguous.

- I wonder if it is a question of "natural languages vs programming languages" or "bad code vs good code". I could totally imagine that documenting bad code helps the LLMs (and the humans) understand the intent, while documenting good code actually adds ambiguity.

What I learned is that we write code for humans to read. Good code is code that clearly expresses the intent. If there is a need to comment the code all over the place, to me it means that the code is maybe not as good as it should be :-).

Of course there is an argument to make that the quality of code is generally getting worse every year, and therefore there is more and more a need for documentation around it because it's getting hard to understand what the hell the author wanted to do.

> If there is a need to comment the code all over the place, to me it means that the code is maybe not as good as it should be :-)

If good code was enough on its own we would read the source instead of documentation. I believe part of good software is good documentation. The prose of literate source is aimed at documentation, not line-level comments about implementation.

  • > If good code was enough on its own we would read the source instead of documentation.

    That's 100% how I work -- reading the source. If the code is confusing, the code needs to be fixed.

    • Confusing code is one thing, but projects with more complex requirements or edge cases benefit from additional comments and documentation. Not everything is easily inferred from code or can be easily found in a large codebase. You can also describe e.g. chosen tradeoffs.

      14 replies →

  • https://diataxis.fr/

    (originally developed at: https://docs.divio.com/documentation-system/) --- divides documentation along two axes:

    - Action (Practical) vs. Cognition (Theoretical)

    - Acquisition (Studying) vs. Application (Working)

    which for my current project has resulted in:

    - readme.md --- (Overview) Explanation (understanding-oriented)

    - Templates (small source snippets) --- Tutorials (learning-oriented)

    - Literate Source (pdf) --- How-to Guides (problem-oriented)

    - Index (of the above pdf) --- Reference (information-oriented)

    •     README => AGENTS.md
          HOWTO => SKILLS.md
          INFO => Plan/Arch/Guide
          REFERENCE => JavaDoc-ish
      

      I'm very near the idea that "LLM's are randomized compilers" and the human prompts should be 1000% more treated with care. Don't (necessarily) git commit the whole megabytes of token-blathering from the LLM, but keeping the human prompts:

      "Hey, we're going to work on Feature X... now some test cases... I've done more testing and Z is not covered... ok, now we'll extend to cover Case Y..."

      Let me hover over the 50-100 character commit message and then see the raw discussion (source) that led to the AI-generated (compiled) code. Allow AI.next to review the discussion/response/diff/tests and see if it can expose any flaws with the benefit of hindsight!

  • > If good code was enough on its own we would read the source instead of documentation.

    An axiom I have long held regarding documenting code is:

      Code answers what it does, how it does it, when it is used, 
      and who uses it.  What it cannot answer is why it exists.  
      Comments accomplish this.

    • An important addendum: code can sometimes, with a bit of extra thinking of part of the reader, answer the 'why' question. But it's even harder for code to answer the 'why not' question. Ie what were other approaches that we tried and that didn't work? Or what business requirements preclude these other approaches.

      23 replies →

  • Having "grown up" on free software, I've always been quick to jump into code when documentation was dubious or lacking: there is only one canonical source of truth, and you need to be good at reading it.

    Though I'd note two kinds of documentation: docs how software is built (seldom needed if you have good source code), and how it is operated. When it comes to the former, I jump into code even sooner as documentation rarely answers my questions.

    Still, I do believe that literate programming is the best of both worlds, and I frequently lament the dead practice of doing "doctests" with Python (though I guess Jupyter notebooks are in a similar vein).

    Usually, the automated tests are the best documentation you can have!

  • I do read the code instead of the documentation, whenever that is an option.

    Interesting factiod. The number of times I've found the code to describe what the software does more accurately than the documentation: many.

    The number of times I've found the documentation to describe what the software does more accurately than the code: never.

    • You seem to misunderstand the purpose of documentation.

      It's not to be more accurate than the code itself. That would be absurd, and is by definition impossible, of course.

      It's to save you time and clarify why's. Hopefully, reading the documentation is about 100x faster than reading the code. And explains what things are for, as opposed to just what they are.

      3 replies →

  • > If good code was enough on its own we would read the source instead of documentation.

    Uh. We do. We, in fact, do this very thing. Lots of comments in code is a code smell. Yes, really.

    If I see lots of comments in code, I'm gonna go looking for the intern who just put up their first PR.

    > I believe part of good software is good documentation

    It is not. Docs tell you how to use the software. If you need to know what it does, you read the code.

    • > Lots of comments in code is a code smell. Yes, really.

      No, not really. It's actually a sign of devs who are helping future devs who will maintain and extend the code, so they can understand it faster. It's professionalism and respect.

      > If I see lots of comments in code, I'm gonna go looking for the intern who just put up their first PR.

      And I'm going to find them to say good job, keep it up! You're saving us time and money in the future.

      5 replies →

    • > If you need to know what it does, you read the code.

      True.

      But If you need to know why it does what its does, you read the comments. And often you need that knowledge if you are about to modify it.

      6 replies →

    • Not for everything. For code you own, yes this is often the case. For the majority of the layers you still rely on documentation. Take the project you mention going straight to source, did you follow this thread all the way down through each compiler involved in building the project? Of course not.

      1 reply →

> because my prompts are in natural languages, and hence ambiguous.

Legalese developed specifically because natural language was too ambiguous. A similar level of specificity for prompting works wonders

One of the issues with specifying directions to the computer with code is that you are very narrowly describing how something can be done. But sometimes I don't always know the best 'how', I just know what I know. With natural language prompting the AI can tap into its training knowledge and come up with better ways of doing things. It still needs lots of steering (usually) but a lot of times you can end up with a superior result.

  • Yes. LLMs are search engines into the (latent) space or source code. Stuff you put into the context window is the "query". I've had some good results by minimizing the conversational aspect, and thinking in terms of shaping the context: asking the LLM to analyze relevant files, nor because I want the analysis, but because I want a good reading in the context. LLMs will work hard to stay in that "landscape", even with vague prompts. Often better than with weirdly specific or conflicting instructions.

    • But search engines are not a good interface when you already know what you want and need to specify it exactly.

      See for example the new Windows start menu compared to the old-school run dialog – if I directly run "notepad", then I get always Notepad; but if I search for "notepad" then, after quite a bit of chugging and loading and layout shifting, I might get Notepad or I might get something from Bing or something entirely different at different times.

      1 reply →

> Natural languages are ambiguous. That's the reason why we created programming languages. So the documentation around the code is generally ambiguous as well. Worse: it's not being executed, so it can get out of date (sometimes in subtle ways).

I loathe this take.

I have rocked up to codebases where there were specific rules banning comments because of this attitude.

Yes comments can lie, yes there are no guards ensuring they stay in lock step with the code they document, but not having them is a thousand times worse - I can always see WHAT code is doing, that's never the problem, the problems is WHY it was done in this manner.

I put comments like "This code runs in O(n) because there are only a handful of items ever going to be searched - update it when there are enough items to justify an O(log2 n) search"

That tells future developers that the author (me) KNOWS it's not the most efficient code possible, but it IS when you take into account things unknown by the person reading it

Edit: Tribal knowledge is the worst type of knowledge, it's assumed that everyone knows it, and pass it along when new people onboard, but the reality (for me) has always been that the people doing the onboarding have had fragments, or incorrect assumptions on what was being conveyed to them, and just like the childrens game of "telephone" the passing of the knowledge always ends in a disaster

  • > Yes comments can lie ...

    Comments only lie if they are allowed to become one.

    Just like a method name can lie. Or a class name. Or ...

    • Right.

      The compiler ensures that the code is valid, and what ensures that ‘// used a suboptimal sort because reasons’ is updated during a global refactor that changes the method? … some dude living in that module all day every day exercising monk-like discipline? That is unwanted for a few reasons, notably the routine failures of such efforts over time.

      Module names and namespaces and function names can lie. But they are also corrected wholesale and en-masse when first fixed, those lies are made apparent when using them. If right_pad() is updated so it’s actually left_pad() it gets caught as an error source during implementation or as an independent naming issue in working code. If that misrepresentation is the source of an emergent error it will be visible and unavoidable in debugging if it’s in code, and the subsequent correction will be validated by the compiler (and therefore amenable to automated testing).

      Lies in comments don’t reduce the potential for lies in code, but keeping inline comments minimal and focused on exceptional circumstances can meaningfully reduce the number of aggregate lies in a codebase.

      2 replies →

  • I don’t disagree here. I personally like to put the why into commit messages though. It’s my longtime fight to make people write better commit messages. Most devs I see describe what they did. And in most cases that is visible from the change-set. One has to be careful here as similar to line documentation etc everything changes with size. But I prefer if the why isn’t sprinkled between source. But I’m not dogmatic about it. It really depends.

    • https://conventionalcommits.org/en/v1.0.0/

      I <3 great (edit: improve clarity) commit comments, but I am leaning more heavily to good comments at the same level as the dev is reading - right there in the code - rather than telling them to look at git blame, find the appropriate commit message (keeping in mind that there might have been changes to the line(s) of code and commits might intertwine, thus making it a mission to find the commit holding the right message(s).

      edit: I forgot to add - commit messages are great, assuming the people merging the PR into main aren't squashing the commits (a lot of people do this because of a lack of understanding of our friend rebase)

  • IMHO, you shouldn't have to justify yourself ("yeah yeah, this is not optimal, I know it because I am not an idiot"). Just write your code in O(n) if that's good enough now. Later, a developer may see that it needs to be optimised, and they should assume that the previous developer was not an idiot and that it was fine with O(n), but now it's not anymore.

    Or do you think that your example comment brings knowledge other than "I want you to know that I know that it is not optimal, but it is fine, so don't judge me"?

    • A little bit of "Don't judge me" and a little bit of "I nearly fell into a trap here, and started writing O(log n) search, but realised that it was a waste of time and effort (and would actually slow things down) - so to save you from that trap here's a note"

      8 replies →

Docs and code work together as mutually error correcting codes. You can’t have the benefits of error detection and correction without redundant information.

  • > With agents, does it become practical to have large codebases that can be read like a narrative, whose prose is kept in sync with changes to the code by tireless machines?

    I think this is true. Your point supports it. If either the explanation / intention or the code changes, the other can be brought into sync. Beautiful post. I always hated the fact that research papers don't read like novels, eg "ohk, we tried this which was unsuccessful but then we found another adjacent approach and it helped."

    Computer Scientist Explains One Concept in 5 Levels of Difficulty | WIRED

    https://www.youtube.com/watch?v=fOGdb1CTu5c

    Computer scientist Amit Sahai, PhD, is asked to explain the concept of zero-knowledge proofs to 5 different people; a child, a teen, a college student, a grad student, and an expert. Using a variety of techniques, Amit breaks down what zero-knowledge proofs are and why it's so exciting in the world of cryptography.

Programming languages are natural and ambiguous too, what does READ mean? you have to look it up to see the types. The power comes from the fact that it's audit-able, but that you don't need to audit it every time you want to write some code. You think you write good code? try to prove it after the compiler gets through with it.

Natural languages are richer in ideas, it may be harder to get working code going from a purely natural description to code, than code to code, but you don't gain much from just translating code. One is only limited by your imagination the other already exists, you could just call it as a routine.

You only have a SENSE for good code because it's a natural language with conventions and shared meaning. If the goal of programming is to learn to communicate better as humans then we should be fighting ambiguity not running from it. 100 years from now nobody is going to understand that your conventions were actually "good code".

  • > Programming languages are natural and ambiguous too

    Programming languages work because they are artificial (small, constrained, often based on algebraic and arithmetic expressions, boolean logic, etc.) and have generally well-defined semantics. This is what enables reliable compilers and interpreters to be constructed.

    • Exactly. Programming is the art of removing ambiguity and making it formal. And it's why the timelines between getting an EXACT plan of what I need to implement vs hazy requirements are so out of whack.

  • > Programming languages are natural and ambiguous too, what does READ mean?

    "READ" is part of the "documentation in natural language". The compiler ignores it entirely, it's not part of the programming language per se. It is pure documentation for the developers, and it is ambiguous.

    But the part that the compiler actually reads is non-ambiguous. It cannot deal with ambiguity, fundamentally. It cannot infer from the context that you wrote a line of code that is actually ironic, and it should therefore execute the opposite.

  • > Programming languages are natural and ambiguous too, what does READ mean?

    Not nearly in the same sense actual language is ambiguous.

    And ambiguity in programming is usually a bad thing, whereas in language it can usually be intended.

    Good code, whatever that means, can read like a book. Event-driven architectures is a good example because the context of how something came to be is right in the event name itself.

  • What is good code now is only good code because of the bad programming languages we’ve had to accept for the last hundred years because we’re tied to incremental improvements. We’re tied to static brittle types. But look at natural systems - they all use dynamic “languages.” When you get a cut, your flesh doesn’t throw an exception because it’s connected to the wrong “thing.” Maybe AI will redefine what good code means, because it’s better able to handle ambiguity.

>Natural languages are ambiguous. That's the reason why we created programming languages.

Programming languages can be ambiguous too. The thing with formal languages is more that they put a stricter and narrower interpretation freedom as a convention where it's used. If anything there are a subset of human expression space. Sometime they are the best tool for the job. Sometime a metaphor is more apt. Sometime you need some humour. Sometime you better stay in ambiguity to play the game at its finest.

  • Programming languages are non-ambiguous, in the sense that there is no doubt what will be executed. It's deterministic. If the program crashes, you can't say "no but this line was a joke, you should have ignored it". Your code was wrong, period.

I don’t have my LLMs generate literate programming. I do ask it to talk about tradeoffs.

I have full examples of something that is heavily commented and explained, including links to any schemas or docs. I have gotten good results when I ask an LLM to use that as a template, that not everything in there needs to be used, and it cuts down on hallucinations by quite a bit.

"But translating my prompts to code is not working as well, because my prompts are in natural languages, and hence ambiguous."

Not only that, but there's something very annoying and deeply dissatisfying about typing a bunch of text into a thing for which you have no control over how its producing an output, nor can an output be reproduced even if the input is identical.

Agreed natural language is very ambiguous and becoming more ambiguous by the day "what exactly does 'vibe' mean?".

People spoke in a particular way, say 60 years ago, that left very little room for interpretation of what they meant. The same cannot be said today.

  • > People spoke in a particular way, say 60 years ago, that left very little room for interpretation of what they meant. The same cannot be said today.

    Surely you don’t mean everyone in the 1960s spoke directly, free of metaphor or euphemism or nuance or doublespeak or dog whistle or any other kind or ambiguity? Then why are there people who dedicate their entire life to interpreting religious texts and the Constitution?

> That's the reason why we created programming languages.

No, we created programming languages because when computers were invented:

1: They (computers) were incapable of understanding natural language.

2: Programming languages are easier to use than assembly or writing out machine code by hand.

LLMs are a quite recent invention, and require significantly more computing power than early computers had.

Maybe if we had a really terse and unambiguous form of English? Whenever there is ambiguity we insert parentheses and operators to really make it clear what we mean. We can enclose different sentences in brackets to make sure that the scope of a logical condition and so on. Oh wait