The most copied StackOverflow snippet of all time is flawed (2019)

4 years ago (programming.guide)

I'm the author of #6 on the same list. It's definitely interesting to see it has been used thousands of times on GitHub, and who knows how many more in proprietary code. I don't think it's buggy, but I now think it could definitely be improved.

I think this shows an example of a big problem with StackOverflow compared to its initial vision. I remember listening to Jeff and Joel's podcast, and hearing the vision of applying the Wikipedia model to tech Q&A. The idea was that answers would continue to improve over time.

For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture. Probably some of both. I think that having a person's name attached to their answer, along with a visible score really gives a sense of ownership. As a result, other people don't feel enabled to come along and tweak the answer to improve it.

Then, once an answer is listed at the top, it is given more opportunity for upvotes, so other improved answers don't seem to bubble up. This is a larger issue with most websites that sort by ratings. Generally they sort items based on the total number of votes, including hacker news itself. Instead, to measure the quality of an item, we should look at the number of votes, divided by the number of views. It may be tough to measure the number of views of an item, but we should be able to get a rough estimate based on the position on a page, for example.

If the top comment on a HN discussion is getting 100 views in a minute and 10 upvotes, but the 10th comment down gets 20 views and 5 upvotes, the 10th comment is likely a better quality comment. It should be sorted above the top ranked comment! There would still need to be some smoothing and promotion of new comments to get them enough views to measure their quality as well.

Such a policy on StackOverflow would also help newer, but better answers sort to the top.

  • An idea I've had for a long time is that "the community" can vote to override an accepted answer. There are many times when the accepted answer is incorrect, or a newer answer is now more correct, but the only person who can change an accepted answer is the OP.

    I think community-based changes to the accepted answer would go a long way to solving your problem too, but it requires someone to be reviewing newer answers and identifying when there's another that would be more appropriate.

    It'd incentivise writing newer answers to older questions. Correcting accepted answers that probably weren't ideal to begin with. A new "role" where users hunt through older questions and answers looking for improvements to make.

    Stack Overflow answers are supposed to be community-based, but we unfairly prioritise the will of the original questioner *forever*. I don't think that's optimal.

    • As a side gig I teach an intro to web development class online. Every semester I get students asking for help about why their code isn’t working. Nine times out of ten, they are trying to use some jQuery code they copied from stackoverflow because it is the accepted answer. They don’t yet know enough to recognize that it isn’t vanilla JavaScript (which they are required to use).

      2 replies →

    •   > but the only person who can change an accepted answer is the OP.
      

      This system makes the person arguably _least qualified_ to understand the situation the single arbitrator as to which answer is accepted.

      Was it the most efficient? First to answer? Copied-and-pasted right in with no integration work? Written by someone with an Indian username? Got the most upvotes? Made a Simpsons reference? Written by someone with an Anime avatar?

      10 replies →

    • Currently the only incentive to post a new answer to an old question is you get a special badge. That's neat but limited. I've gone through old R questions and posted answers with a more modern syntax and my answers rarely get much attention.

      I'd be cautious about overriding an accepted answer. Imagine a situation where there's an easy-to-understand algorithm that's O(n^2) and the "Correct" algorithm that's O(n). If OP only has a dozen datapoints, the former might be the best answer for her specific problem, despite it clearly not being the right approach for most people finding the thread via Google in the future.

    • They actually recently added this feature - you have a "this answer is outdated" button you can press. Note sure what the reputation threshold to see it is.

      1 reply →

    • "An idea I've had for a long time is that "the community" can vote to override an accepted answer."

      I don't know if this is still a thing, but for some time in the past when an answer was edited more than a certain amount of times it automatically turned into what was called a "community wiki" answer.

    • Or you could just edit the accepted answer if it’s wrong? I’ve seen a few posts where the top contains an “UPDATE” that, in summary, links to another answer.

  • One of the things that baffles me the most about SO is that I can't sort answers by _newest first_.

    If I search for something related to javascript for example, I know there will be a ton of answers for older versions that I am most likely not interested in. However I can only sort by oldest first (related to date).

    Old answers are definitely useful a lot of times, but the fact that there's not even the option to sort them the other way around tells me that SO somehow, at it's core, considers new answers less important.

    A strange decision if you ask me, considering software changes so much over time.

    If anyone has a possible explanation for this I'd love to hear it.

    • There are three buttons that act as sorting directions at the top of the answers section: "Votes," "Oldest," and "Active." The "Active" option sorts by most recently modified, which is _usually_ what you'd want instead of strictly newest. (i.e. an edit would update the timestamp, making that answer have a more recent activity date)

      So, I guess the answer to your question of "why can't I" is "good news! you can" :)

      1 reply →

    • > If I search for something related to javascript for example

      As someone that's been learning a little JS over the last year, I quickly came to the realization that you skip over the SO links that come up in the search, and you go to one of the many other sites. I've had good luck with w3schools and mdn. SO is a lost cause for JS.

      1 reply →

  • > we should look at the number of votes, divided by the number of views

    Closer, but still not quite what you want probably or a few stray votes can make a massive impact just from discretization effects. What you really care about is which answer is "best" by some metric, and you're trying to infer that as best as possible from the voting history. Average votes do a poor job. Check out this overview from the interblags [0].

    [0] https://www.evanmiller.org/how-not-to-sort-by-average-rating...

    • This isn't just a statistical problem, it's also a classical exploration/exploitation trade-off. You want users to notice and vote on new answers (exploration), but users only want to see the best answers (exploitation). The order you show will influence future votes (and future answers).

      In addition, it's a social engineering problem. At least people with a western psychology seem to respond very strongly when a score is attributed to their person (as opposed to a group success like in a wiki). So you better make the score personal and big and visible, and do not occasionally sort by random just to discover the true score.

    • I think that's a great example of the "smoothing" that I was alluding to, though not in a format accessible to most programmers. However it is still just using a function of upvotes and downvotes. I think true rating can be much better when you also incorporate number of opportunities to vote. Because having the opportunity to vote (by viewing an item, or purchasing it, or whatnot) and choosing not to vote is still a really useful piece of data about the quality of an item. Especially when you are comparing old items that have had millions of opportunities against new items with only thousands.

      1 reply →

  • > I think this shows an example of a big problem with StackOverflow compared to its initial vision. I remember listening to Jeff and Joel's podcast, and hearing the vision of applying the Wikipedia model to tech Q&A. The idea was that answers would continue to improve over time.

    Interessting. As a random visitor this was something that never came to me from the way SO presents itself.

    > For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture.

    I think it's more a problem of communication and UI. SO is not really the kind of site that animates people to answer or improve things. The overall design is also more technical and strange, not motivating and userfriendly.

    Today for the first time I realized that there is a history for answers and an "improve"-Button that seems to allow me to change someone else answer. I only saw that because I expliciet looked for this because of this thread.

    Wikipedia in the beginning was very vocal and motivating to engage all kind people to help and improve articles. SO never had that vibes for me. Additionally, it simply has not the interface that makes it simple to do this stuff. There are only this aweful comments under each answer, which are not really useful to discus an answer in all lenght and from all sides. Might be better to change them to a full fletched forum with some kollaboration editing and some small wiki-functionality or something like that.

    I remember they tried to do some kind of wiki with high quality-code-parts, what happend to that?

  • One of the really frustrating things about SO is that once you reach a certain rep threshold, you lose the ability to suggest edits, and instead gain the ability to just make the edits directly. I'm a lot more likely to do the former, because it helps ensure that if I actually made a mistake, it will be caught by the people voting on it. And so SO has lost out on a bunch of my suggested edits because they took away my ability to suggest edits.

  • What would really help with the vision here is some way to comment and associate tests against posted code. I have corrected algorithms on Wikipedia that were obviously wrong with even a cursory test. Then people can adjust the snippet, debate the test parameters, or whatever else they need to do while maintaining some sort of sanity check. If it’s good enough for random software projects used by a dozen people, it’s probably good enough for snippets used by thousands of developers and even more users.

    • This post made me think the same thing. It would be nice to have a StackOverflow that was actually more code focused. People could write tests or code and actually run them.

  • I always try and improve existing answers with edits. Often just adding important context when the answer is just a line of bash and adding links to source documentation.

    There's very little gamification incentive to do so and often the edit queue is full. Still, there are lots of times where important caveats and information is pointed out in the comments and never added to the answer

    • The other day I asked a question about the c/c++ plugin of vscode, somebody swooped in to edit it to just be c++ because “c/c++ is not a programming language”. The question wasn’t answered. I wonder what’s the incentive for people to do something like that.

  • > As a result, other people don't feel enabled to come along and tweak the answer to improve it.

    It's worse than that. Edits have to go through a review process that is much more selective and often arbirarily rejects good edits.

    • Editing answers is a complete waste of time. You can post a correction along with a copy and paste of the relevant section from the documentation, yet have your edit disappear without explanation.

  • To correctly measure the quality of an item one needs to take something like Google's PageRank algorithm and apply it to people. That is, there needs to be some measure of the reputation of the person posting. This doesn't mean that a person who was correct in the past is necessarily correct right now, but it is true that people who are often correct tend to go on being correct, and people who are often wrong tend to go on being wrong. Careful people tend to continue to be careful, and sloppy people tend to continue to be sloppy. It's important to capture that reality and use it as a weight given to any particular answer.

  • Potentially a stupid question; why is it not possible to just make a MediaWiki site explicitly for SO questions? Does it exist already?

    • The technical cost/effort for someone like you or me to do that is minimal. The expensive part is the ongoing social maintenance fee aka moderation. As evident by the stack overflow drama re: Monica, it’s an unsolved (non-technical) problem that you could make your own mint to print money on, if you were able to fix any tiny part of it.

      2 replies →

  • Wouldn't a simple TTL - Time to live, solve that problem, of course with an option to see the graveyard.

    This would mean that the same questions would get answered again and again over the years, but I think that could also solve the negative reputation problem of the website.

    Two bird with one stone, or if you're Slovenian, two flies with one swat. ^^

  • >> For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture.

    Classic example of "good is the enemy of best".

What’s wrong with a simple loop (like the one near the top)? Why does it have to branchless? Wouldn’t the IO take longer than missed branches/pipeline flushes?

Not to mention that the fixed version now has branches as well…

  • Not sure why some programmers these days have aversion to simple loops and other boring - but readable - code.

    Instead we have overused lambdas and other tricks that started out clever but become a nightmare when wielded without prudence. In this article, the author even points out why not to use his code:

    Note that this started out as a challenge to avoid loops and excessive branching. After ironing out all corner cases the code is even less readable than the original version. Personally I would not copy this snippet into production code.

    • I'm not against using for loops when what you need is an actual loop. The thing is most of the times, previously, for loops where actually doing something for which there are concepts that express exactly what was being done - though not in all languages.

      For instance, map - I know that it will return a new collection of exactly the same number of items the iterable being iterated has. When used correctly it shouldn't produce any side-effects outside the mapping of each element.

      In some languages now you have for x in y which in my opinion is quite ok as well, but still to change the collection it has to mutate it, and it's not immediate what it will do.

      If I see a reduce I know it will iterate again a definite number of times, and that it will return something else than the original iterable (usually), reducing a given collection into something else.

      On the other hand forEach should tell me that we're only interested in side-effects.

      When these things are used with their semantic context in mind, it becomes slightly easier to grasp immediately what is the scope of what they're doing.

      On the other hand, with a for (especially the common, old school one) loop you really never know.

      I also don't understand what is complex about the functional counterparts - for (initialise_var, condition, post/pre action) can only be simpler in my mind due to familiarity as it can have a lot of small nuances that impact how the iteration goes - although to be honest, most of the times it isn't complex either - but does seem slightly more complex and with less contextual information about the intent behind the code.

      25 replies →

    • I can't comment on the social phenomenon here, but there is indeed a decent technical argument for avoiding for loops when possible.

      In a nutshell, it's kind of like "prinicple of least priviledge" applied to loops. Maps are weaker than Folds which are weaker than For loops, meaning that the stronger ones can implement the weaker ones but not vice-versa. So it makes sense to choose the weakest version.

      More specifically, maps can be trivially parallelized; same for folds, but to a lesser degree, if the reducing operation is associative; and for-loops are hard.

      In a way, the APL/J/K family takes this idea and explores it in fine detail. IMHO, for loops are "boring and readable" but only in isolation; when you look at the system as a whole lots of for loops make reasoning about the global behaviour of your code a lot harder for the simple reasone that for-loops are too "strong", giving them unweildy algebraic properties.

      12 replies →

    • Very often processes are naturally modelled as a series of transformations. In those cases, writing manual loops is tedious, error-prone, harder to understand, less composable and potentially less efficient (depending on language and available tools) than using some combination of map, filter and reduce.

    • > Not sure why some programmers these days have aversion to simple loops and other boring - but readable - code.

      Like goto, basic loops are powerful, simple constructs that tell you nothing at all about what the code is doing. For…in loops in many languages are a little better, but map, reduce, or comprehensions are much more expressive as to what the code is doing, but mostly address common cases of for loops.

      While loops are weakly expressive (about equal to for…in), but except where they are used as a way (in language without C-style for loops) but there is less often a convenient replacement.

    • Disclamer: amateur developer for 25 years, no formal education in that area

      a loop that iterates over indices when I want elements is not readable, e.g. I prefer

          for element in elements:
      

      rather than

          for (i = 0 , i < len(elements), i++) { element = elements[i] ...
      

      This is maybe where this aversion comes from, people usually [citation needed] want to iterate over elements, rather than indices.

      2 replies →

    • Yes, this plagues JDK8+ code. Every fashionable Java coder has to use an overly complex, lazy stream vs a simple loop in every case.

  • The irony is that a single log computation is going to take longer than the loop. (No idea if implementing a log approximation involves loops either.)

  • Besides log()'s implementation is certainly not branch-less.

    It's the ostrich approach: if you don't see the branches they don't matter.

  • Simplicity FTW. The simple loop version is very easy to understand. It's probably really fast, as it's just a loop over seven items. And more importantly it's more correct. It doesn't use floating point arithmetic, so you don't have to worry about precision issues.

    The logarithmic approach is harder to reason about, prone to bugs (as proven by this post). I'm baffled at the fact that tons of people considered it a more elegant solution! It's completely the opposite!

  • the original version had branches too, in fact a majority of the lines had them! ? is just shorthand for if.

    • This isn't true, this form of conditionals can be compiled into cmov type of instructions, which is faster than regular jump if condition.

      17 replies →

  • Exactly. As the article itself mentions:

    > Granted it’s not very readable and log / pow probably makes it less efficient

    So, the "improved" solution is both less readable and probably less efficient... where is the improvement then?

  • If it were me in my programming language, I would just use Humanizr and be freaking done with it.

  • The real question is why is it a bug to report 1 mB instead of 999.9 kB for human readable output? It seems like a nice excursion to FP related pitfalls, but i don't think this is a problem to get entangled in that.

As part of the Stack Overflow April Fools' prank, we did some data analysis on copy behavior on the site [0]. The most copied answer during the collection period (~1 month) was "How to iterate over rows in a DataFrame in Pandas" [1], receiving 11k copies!

[0] https://stackoverflow.blog/2021/04/19/how-often-do-people-ac...

[1] https://stackoverflow.com/a/16476974/16476924

  • That’s sad, as when you find yourself iterating over rows in pandas you’re almost invariably doing some wrong or very very sub optimally.

    • To me it's an means to an end. I don't care if my solution takes 100ms instead of 1ms, it's the superior choice for me if it takes me 1 minute to do it instead of 10 minutes to learn something new.

      9 replies →

    • I iterate over rows in pandas fairly often for plotting purposes. Anytime I want to draw something more complicated than a single point for each row, I find it's simple and straight-forward to just iterrows() and call the appropriate matplotlib functions for each. It does mean some plots that are conceptually pretty simple end up taking ~5 seconds to draw, but I don't mind. Is there really a better alternative that isn't super complicated? Keep in mind that I frequently change my mind about what I'm plotting, so simple code is really good (it's usually easier to modify) even if it's a little slower.

    • >That’s sad, as when you find yourself iterating over rows in pandas you’re almost invariably doing some wrong or very very sub optimally.

      Humans writing code is suboptimal. I can't wait for the day when robots/AI do it for us. I just hope it leads to a utopia and not a dystopia.

    • I'm glad that DataFrames don't iterate by default. It's good design to make suboptimal features hard to access.

  • I got bitten by that prank when copying code from a question, to see what it did (it was something obviously harmless). I was rather annoyed for about two seconds before I realized what date it was. :)

> return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);

As a human, the first thing that I hate about this interpretation of "human readable" format is inconsistency in the number of significant digits. One digit after decimal separator is simply wrong, as when you jump from 999.9 MB to 1.0 GB you go from 4 significant digits to 2, instead it should be 1.000 GB, 10.00 GB and so on. This annoys me enormously when I upload things to Google Drive from Android phone and look at the number of data transferred as as soon it becomes bigger than 1 GB digits stop changing and I become anxious that it stopped the transfer and my Windows Phone nostalgia jumps over the roof (as WP was never infected with this problem by virtue of not using Java, and OneDrive on WP explicitly showed current connection speed, and frozen connection never caused any strange problems with uploaded files like it does on Google Drive on Android).

As a human not from US, the second thing I hate here is lack of locale parameter to pass to formatter as decimal separator is different in different cultures, and in the world of cloud computing the locale of the machine where the code is run is often different from the one where the message is displayed.

As a human from a culture using non latin alphabet, the third thing I hate here should be obvious for a reader.

  • I don't think it makes sense to talk about significant digits here. And while you are correct that you should not go from 999.9MB to 1.0 GB you are incorrect about your reasoning and your correction is also incorrect. Significant digits signify reliability of the numbers. So if your measurement is accurate to the +/- 50kB as indicated by 999.9MB you should then move to 1.0000 GB (5 significant digits). So it should be 10.00GB and 1.00GB not 1.000GB, because the reliability should not change between your measurements.

  • > instead it should be 1.000 GB, 10.00 GB and so on

    I had a hard time mentally parsing that sequence even when I knew what your point was so imagine regular users seeing that.

  • As a bonus, the thing I don't care any more here is that there is no option to output binary SI prefixes.

> Key Takeaways:

> [...]

> Floating-point arithmetic is hard.

I have successfully avoided FP code for most of my career. At this point, I consider the domain sophisticated enough to be an independent skill on someone's resume.

  • There are libraries that offer more appropriate ways of dealing with it, but last time I ran into a FP-related bug (something to do with parsing xlsx into MySQL) I fixed it quickly by converting everything to strings and doing some unholy procedure on them. It worked but it wasn’t my proudest moment as a programmer.

  • As long as you're using it to represent what could be physical measurements of real-valued quantities, it's nearly impossible to go wrong. Problems happen when you want stupendous precision or human readability.

    Numerically unstable algorithms are a problem too but again, intuitively so if you think of the numbers as physical measurements.

    • I am regularly reminded of William Kahan's (the godfather of IEEE-754 floating point) admonition: A floating-point calculation should usually carry twice as many bits in intermediate results as the input and output deserve. He makes this observation on the basis of having seen many real world numerical bugs which are corrupt in half of the carried digits.

      These bugs are so subtle and so pervasive that its almost always cheaper to throw more hardware at the problem than it is to hire a numerical analyst. Chances are that you aren't clever enough to unit test your way out of them, either.

    • Yep, floating point numbers are intended for scientific computation on measured values; however many gotchas they hsve when used as intended, there are even MORE if you start using them for numbers that are NOT that. money or any kind of "count" rather than measurement (like, say, a number of bytes).

      The trouble is that people end up using them for any non-integer ("real") numbers. It turns out that in modern times scientific calculations with measured values are not necessarily the bulk of calculations in actually written software.

      In the 21st century, i don't think there's any good reason for literals like `21.2` to represent IEEE floats instead of a non-integer data representation that works more how people expect for 'exact' numbers (ie, based on decimal instead of binary arithmetic; supporting more significant digits than an IEEE float; so-called "BigDecimal"), at the cost of some performance that you can usually afford.

      And yet, in every language I know, even newer ones, a decimal literal represents a float! It's just asking for trouble. IEEE float should be the 'special case' requiring special syntax or instantiation, a literal like `98.3` should get you a BigDecimal!

      IEEE floats are a really clever algorithm for a time when memory was much more constrained and scientific computing was a larger portion of the universe of software. But now they ought to be a specialty tool, not the go-to for representing non-integer numbers.

      3 replies →

    • Notably, this is only true of 64-bit floats. Sticking to 32-bit floats saves memory and sometimes are faster to compute with, but you can absolutely run into precision problems with them. When tracking time, you'll only have millisecond precision for under 5 hours. When representing spacial coordinates, positions on the Earth will only be precise to a handful of meters.

    • I do a lot of floating point math at work and constantly run into problems either from someone else's misunderstanding, my own misunderstanding, or we just moved to a new microarchitecture and CPU dispatch hits a little different manifesting itself as rounding error to write off (public safety industry).

      1 reply →

    • Unfortunately, that doesn't work when you have to do:

      1 - quantity2 / (quantity1 - quantity2)

      ... or some such thing. If quantity1 and 2 are similar, ouch!

      1 reply →

    • So you have problems if you want a precise answer, you want to display your answer, or if you want to use any of a large number of useful algorithms? That sounds like it’s quite easy to go wrong.

      1 reply →

  • Just this week I watched someone discover that computing summary statistics in 32-bit on a large dataset is a bad idea. The computer science curricula needs to incorporate more computational science. It's a shame to charge someone tens of thousands of USD and to not warn them that floating point has some obvious footcanons.

    • > Just this week I watched someone discover that computing summary statistics in 32-bit on a large dataset is a bad idea. The computer science curricula needs to incorporate more computational science.

      Sadly, I suspect too many "computer science" courses have turned into "vocational coding" courses, and now those people are computing summary statistics on large datasets in Javascript...

> At the very least, the loop based code could be cleaned up significantly.

Seems like the loop based code wasn't so bad after all...

  • This! If I had to choose between the two snippets I would have taken the loop based one without a second though, because of its simplicity. The second snippet is what usually happens when people try to write "clever" code.

    • The loop by itself isn't entirely clear on what it's doing. Stuff like the direction of the > comparison and what to do vs. >= and the byteCount / magnitudes[i] at the end really do require you to pause & do mental analysis to check correctness. I think the real solution here is to define an integer log (ilog()?) function based on division and use that in the same manner as the log(). That way you only do do the analysis the first time you write that function, and after that you just call the function knowing that it's correct.

    • I was reading this and thought it sounded familiar. A few months ago I needed a human readable bytes format, ended up on that stack overflow article and, plot twist, copied the while loop one.

    • In his defences, he did admit at the start of the blog post that he was code golfing.

There might be an opportunity somewhere around this area to combine the versioning, continuous improvement, and dependency management of package repositories with the Q&A format of StackOverflow.

Something like "cherry pick this answer, with attribution, and notifications when flaws and/or improvements are found".

Maybe that's a terrible idea (there's definitely risk involved, and the potential to spread and create bad software), but equally I don't know why it would be significantly worse than unattributed code snippets and trends towards single-function libraries.

  • NodeJS did something a lot like this by having packages that are just short snippets, but half the ecosystem flipped out when someone messed up `leftpad`.

    • Well that and because having 20,000 packages in your project is a PITA in various ways.

      Mostly but not entirely because NPM handled things poorly in various ways.

  • Sadly updates don't just remove bugs, but sometimes also add them. Silently adding a bug to previously working code is a lot more bad than silently fixing a bug you didn't know you had is good, so I wouldn't want to have a load of self-updating code snippets in my codebase.

> Sebastian then reached out to me to straighten it out, which I did: I had not yet started at Oracle when that commit was merged, and I did not contribute that patch. Jokes on Oracle. Shortly after, an issue was filed and the code was removed.

Good thing it wasn't a range check function. I hear those are expensive.

> I wrote almost a decade ago was found to be the most copied snippet on Stack Overflow. Ironically it happens to be buggy.

I don’t find it ironic, I find it quite normal that even small snippets of code contains bugs (given the daily review requests I receive).

I think when copying code literally from StackOverflow what’s more important is understanding what the code does, and why , rather than copying it ad-verbatim by copy & pasting it into your production code.

I also often find on StackExchange et al that quite often the most upvoted is the one that ‘fixes it’ for ‘most people’ yet the correct answer is down at number 3 or 4. Again, understanding the answer and why it applies, helps give you the context to understand if this is actually the solution to your problem or just treats the symptom.

  • What I realized years ago is that the upvote on Stack Overflow don't mean "I tried this and it works for me" or "I'm an expert and this is the answer". No, the upvotes on Stack Overflow are along the line of the upvotes/likes one would find on Reddit or HN. More like "you sound confident" or "I was looking for this but I haven't tried it yet"

    • > No, the upvotes on Stack Overflow are along the line of the upvotes/likes one would find on Reddit or HN. More like "you sound confident"

      I think you're right that online scoring systems tend to incentivise false confidence. This happens with blog posts too, where a student of some topic writes a confident and subtly incorrect blog post, and it then ends up on the HN front-page. Only someone with a relatively deep knowledge of the topic can then call out the errors. Ideally it should always be made clear upfront that the author is new to the material.

      Somewhat related: Stack Overflow's unfortunate norm of calling out mistakes in answers in a way that goes beyond confidence and strays into condescension and borderline hostility. For a lot of people it seems it's not enough to be seen to be right, they also feel the need to paint someone else as clueless, while just about passing as acceptably polite by keeping the aggression passive. If challenged, they'll brush it off as 'directness'.

      3 replies →

    • The number of times I've seen the only correct answer being a terse explanation with a short code snippet and having zero upvotes astounds me.

      They may not have been the attention seekers like other posters. But they provided exactly what was asked for. And when I come across their post years later I upvote.

    • > More like "you sound confident"

      Meh, I'm usually there looking for how to do something, and if a response helps me do whatever I was looking to acomplish, or at least on the right track, it was helpful and worth an upvote. I've never upvoted just because someone sounded confident.... at least not on SO.

      1 reply →

    • It's a user experience issue and it's hard to solve. You can't possibly expect people to come back to one of their SO tabs AFTER they get code to work.

      2 replies →

  • One of the best tips I have gotten from the internet is to never copy and paste code you have not written yourself. Even rewriting it verbatim makes you think about what it is you are actually copying.

    It's a pretty neat rule to have in mind.

    • Ladislav Vagner, a legendary programming tutor at FIT CTU, is a known proponent of being extremely cautious when copying code, even your own. He gives programming proseminars where students guide him as he codes the solution to some problem, e.g. mathjax-like typesetting in C++. It is a common theme in the proseminars that a bug is introduced by copying code. Probably on purpose, like many of the other bugs that students are supposed to point out.

    • I think that’s true if you’re trying to learn a new tool or technology. You probably won’t learn as much following the Rails or Django tutorials if you’re just pasting all the code. But if you’re just looking for some esoteric workaround for some very specific tool and use case, I think it’s fine to paste. And the latter makes up the overwhelming majority of my Stack Overflow visits.

    • It's also good legal advice. It's now legally possible for you to copy and paste code directly from stack overflow because they made an effort to assert a compatible license over works published on their site. However, the same can't be said for most other code snippets flying around out there.

      1 reply →

    • When you get into this habit it also makes it easier to translate solutions from other languages too

    • Yes, and:

      I didn't really grok Test Driven Development until I worked thru the book, line-by-line, experiencing the workflow.

      Knowledge vs experience.

  • > I also often find on StackExchange et al that quite often the most upvoted is the one that ‘fixes it’ for ‘most people’

    And also first, or at least early, and subject to a reinforcing cycle of 'sufficiently good' or 'fixed it enough' that it achieves stratospherically more votes than an 'even more good' or 'fixes it properly' answer that came in too late for the same traction.

    • > also first, or at least early, and subject to a reinforcing cycle of 'sufficiently good' or 'fixed it enough'

      So exactly the solution most project managers are after? /s

      1 reply →

  • This works better when the problem does not 100% match the issue you are tackling. It makes you think about how you can reshape what you found into something useful.

IMHO any code that tries to perform floating-point arithmetic on integer values and then produce exact output should be considered suspect in and of itself...too many edge cases.

> The most copied StackOverflow snippet of all time

Maybe, but sounds like it's merely the Java snippet from SO found most often on github. Not sure why blog author didn't include the word "Java" in his title or the first paragraph:

> an answer I wrote almost a decade ago was found to be the most copied snippet on Stack Overflow

There is no evidence for this claim in the blog post, just that it's the "most copied Java snippet". And it's just based on occurrences in github. Maybe the most-copied snippet is an AWK or ffmpeg one-liner? Something that wouldn't find its way into a github repo. Or maybe something undetectably vanilla, like answers to "How do you write loops in language X?" Is there a way of finding out what actually is the most-copied snippet?

  • I don’t know how you’d find it, but if I had to bet it would be something to do with git.

This is a bit of a tangent, but while it may be conventional to round to the value with the smallest difference, is that convention good? In a case such as this where it's fine for the prescision to vary with magnitude, then I'd argue it makes sense to round to the value with the smallest ratio.

The thing that jumped out at me, as I've seen the same kind of thing on the job, is the assumption that, eg, log(1000)/log(10) is exactly 3. Does the standard guarantee that the rounded approximation of one transcendental number by the rounded approximation of a related transcendental number will give 3.0 and not 2.999999999?

  • Yeah that seems like a serious flaw to me too. On my Python:

      >>> math.log(1000)/math.log(10)
      2.9999999999999996
      >>> int(math.log(1000)/math.log(10))
      2
    

    But I don't know about the guarantees provided in the JavaScript standard (or more importantly those offered by actual browsers).

    • Floating point math is IEEE 754 in pretty much all cases, so you should see this result in most languages. `math.log(1000, 10)` gives the same result because it's implemented using natural logs internally as it is in most languages.

      In this case, there's only about six boundary cases to consider so you can just manually verify it works as expected.

I don't think this works very well as a cautionary tale because honestly, I would not even care about a bug like this. It's something very few people would even figure out exists at all, it's incredibly inconsequential.

  • You’ve made a judgement call that correctness isn’t a priority in this circumstance, and as long as your judgement is sound, this approach will serve you well.

    The weakness of your approach is that no one’s judgment is sound 100% of the time.

    Alternatively, folks who always prioritize correctness may occasionally “waste their time”, but two things to consider: 1) their judgement is no longer an issue, and 2) in the long run they have spent more time training their correctness muscles and are in better shape.

    • However this "bug" is still entirely within spec. The whole article hinges on "the 1,000 “significand” is out of range according to spec", but it really isn't. 999999 Bytes are indeed around 1000kB. Sure, calling it 1MB would be preferable, but saying 1000kB doesn't violate any rule, is perfectly acceptable according to SI (just like I can say 1000g or 1kg interchangeably), fulfills the task of being human readable, and in all the examples given in the task the code behaves as requested.

      Sure, the code doesn't do exactly what the programmer wanted. But that doesn't necessarily make it incorrect.

This reminds me of my favourite SO answer:

https://stackoverflow.com/a/40429822/864112

It boggles the mind that anyone could ever suggest this as a solution.

  • Wow. Nice comment...

    > This is akin to answering "how do I bake a cake?" with "open up a bakery, walk inside, and ask for a cake"

    Only it's more like answering "how much does this cake cost" by purchasing the cake and looking at the receipt.

  • You'd think that is so completely wrong that no competent programmer would ever do it...

    Except the java.net.URL.equals and java.net.URL.hashcode methods do almost the same thing: they issue DNS requests (!)

    "Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null."

    See https://docs.oracle.com/en/java/javase/11/docs/api/java.base...

    There is a bug raised[1], but it can't be fixed for backwards compatibility reasons.

    I'll never forget this now, after debugging a very horrible and severe and very intermittent performance issue in some code over 20 years ago. A (slow) DNS resolver occasionally caused 1000x performance degradation on remote sites. That was horrible to work out.

    [1] https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4434494

  • I'm intrigued, why? I find multiple places where this is the most basic implementation. https://realpython.com/python-requests/#query-string-paramet... https://www.geeksforgeeks.org/get-post-requests-using-python...

    etc

    • The question was how to build a URL, not how to send off a request to it. The answer sends off a request and then inspects the response to see what URL was used. If you wanted to send off the request, inspecting the URL on the result is probably not useful. If you didn't want to send off the request, doing it this was is wasteful or even harmful.

      2 replies →

    • Your parent poster makes the mistake that is rife among Python programmers, which is to assume that in a delightfully simple[0] language like Python, it's "obvious" what a given piece of code does even without explanation. Among people who don't delight in the poor taste of Python's design and have limited exposure to its standard library, of course, it's not always obvious what a given piece of code is doing despite the belief to the contrary, so there are plenty of people who aren't going to pick up on the fact that requests.get doesn't just construct a GET request for the caller, but instead constructs such a request and then goes out and performs it, too.

      Shame about the hostile reaction from others towards your question. Keep asking questions (especially things that are presented without comment), and don't be afraid that doing so will make you look stupid or that you should feel like you should be punished for it.

      0. https://blog.imgur.com/wp-content/uploads/2017/06/mocking.jp...

      2 replies →

Say what you want about the stability of the npm ecosystem, but if this were JS, a new SemVer patch release could be cut, and it would be fixed in thousands of code bases essentially instantly.

seems that most comments here missed the end of the article , where he points to the "production ready" version of the solution, that is indeed very close to the original one, including a while loop.

  • > ..."production ready" version of the solution, that is indeed very close to the original one, including a while loop.

    What's missing still is a comprehensive set of test cases to check against.

    If such cases were spec'ed to go along with the original code, fthen at least one could have seen the applicability range and perhaps other people would have added some challenging corner cases (just as mentioned in the OP).

  • It's especially ironic given that this is about a StackOverflow code snippet that many people probably also copied without reading.

  • Neither is production ready because they have no code comments. And if ever there was code requiring comments, this is it.

This specific example is a good example why you should keep code as simple as possible, and why even "simple" math needs special care when testing (Because floats are not simple.)

When I'm explaining logarithms, I find it helps to relate it to the number of digits. This code is a good example of the concept: you don't need log, just convert the int to a string and check its length. A string with 1-3 digits is bytes, 4-6 is kb, etc.

> almost no branches

I wonder whether the author is suggesting that (potentially) nine branches is a small number, or they overlooked ternary expressions and function calls and are just counting the if statement.

  • There are tons of branches in those log and pow calls. Programmers are lost in branch free religion.

Why are you writing code? The question was for a static method in Apache Commons, not your "I'm so clever" implementation. Think the reading comprehension is flawed.

(Of course, this static method exists in Apache Commons, going back at least 20 years. But the fellow "code golfers" of the author voted someone to the first answer who similarly had the irresistible urge to try to be very clever. It's a scourge on StackOverflow.)

  • I think that the answer is what I would expect based on the question title (which doesn’t mention Apache Commons, only Java), if I was another user searching for the solution to this problem. Maybe the question should have been renamed to indicate this, but as it stands, I do think a library-agnostic solution is more helpful to people finding the question than an answer which only works for Apache Commons.

Something about this comes off as amateurish. The obsession with minimization. Just use a switch statement. Now where is the bug going to hide? The solution doesn't need to generalize, there is only a small handful of different solutions. Just break them all out. It's more maintainable and readable and requires less thinking.

  • I don't know why you are being downvoted. The best solution here is for the author to admit using log was a bad idea and rewrite an entirely different version.

Coincidentally I had to write this today, and perplexingly none of the answers on that popular question seem to be as efficient and readable as what I wrote:

  function prettyPrintBytesSI(x: number): string {
    const magnitude = Math.abs(x)
    if (magnitude < 1e3) return `${x} B`
    else if (magnitude < 1e6) return `${x/1e3} kB`
    else if (magnitude < 1e9) return `${x/1e6} MB`
    else if (magnitude < 1e12) return `${x/1e9} GB`
    else if (magnitude < 1e15) return `${x/1e12} TB`
    else if (magnitude < 1e18) return `${x/1e15} PB`
    else return `${x} B`
  }

Honestly. The code is fine.

This is for presenting stuff in a user interface. Who cares if you can find some weird edge case using MAX_LONG and MAX_DOUBLE which never will occur in practice.

  • Until it does.

    • The original doesn't have types, but the modified version of humanReadableByteCount() uses a "long bytes" and as such will fail if the file size is (Long.MAX_VALUE+1) because it cannot even accept the correct size as its argument in the first place. Implementing these edge cases makes adds one more working case to 2^63 (2^64 if negative file sizes are valid) when the bigger problem is using the type "long" when files of that size are possible on the target system.

My top Stack Overflow answer of all time is a now rather dated two lines of JavaScript: how to tell if a variable is undefined. I posted this in 2010 and have been getting points for it steadily for a decade, now exceeding the amount I garnered from all other activities while actively using the site. I don’t do js anymore but from what I understand the answer hasn’t been accurate since 2016 or something.

The proposed solution is crazy. Logarithms are extremely expensive to compute. No way is this code more efficient than the loop it "replaced".

That there then are numeric stability issues and a pretty gross fudge factor is used to fix them worsens the situation. I would have a quiet word with any programmer I worked with that came up with this "solution".

Fun fact: The reason Safari/Chrome/Firefox had to freeze the MacOS version in the User-Agent string is because of a code-snippet from StackOverflow that assumed the version started with "10" that snuck its way all over the place, including the Unity web player and a major wordpress templete.

It disturbs me that the authors answer became the top answer. It didn’t have a loop but was less efficient than the already accepted answer, and worse, much more complicated for a human to read and understand. It seems we are always drawn to be clever, when perfection is found in simplicity.

Overkill. KISS. n counts moving the decimal left 3 places until there are <=3 digits left. Keep one place to the right.

"BKMGTPE". n==0 => 'B'; n==4 => 'T'.

810 has 3 digits. n==0, 810B.

999950 has 6 digits. n==1, you've got 999.9K

1100000 has 7 digits. n==2, 1.1M

1234567890 has 10 digits. n==3, 1.2G

What sticks out for me is that people don’t even rename functions when copypasting from SO.

  • I know everyone is on their high horse about copying code but if you were/are to copy code keeping it the same is super valuable because invertible when someone comes a long later googling it will take them to where it was copied from originally.

    • When I'm "inspired" by code from somewhere, I go ahead and put the URL in there in a comment

I must admit I smiled at seeing that I edited the question, back in the day. :) Can't say I remember the question, and didn't know it has that epic feature of being the most-copied. Cool!

You have to be a terrible programmer to even consider using a base 10 logarithm for this.

Their proposed improvement is also terrible, since it divides multiple times unnecessary, and checks for negativity multiple times unnecessarily.

The proper simple solution is of course a handwritten binary search with if-else blocks that starts with the most likely range, annotated with "likely" annotations, and a single division.

If this is the main task of the program for a while, and thus a large fraction of the cache can be dedicated to it, then solutions with large lookup tables are worth trying (obviously optimizing string formatting is also essential in this case).

This is why software is so often broken, there's a lot of incompetent people programming.

  • "You have to be a terrible programmer to even consider using a base 10 logarithm for this."

    That's not a fair statement.

    Good programmers are programmers who deliver value - who build robust, maintainable features in reasonable time that address user needs.

    Whether or not you would quickly find the correct approach to this specific problem is a miniscule, pedantic detail in a giant ocean of programming skills and experiences.

  • You may think that, but I’m partial to the ‘for-loop version’ that everybody and their dog can understand.

  • About the "terrible" aspect of it, to quote: "Granted it’s not very readable and log / pow probably makes it less efficient than other solutions. But there were no loops and almost no branching which I thought was pretty neat." No need to insult the author over it.

    About your "proper simple solution": I don't think that's a good idea either. Based on your next paragraph, the version you suggest with the handwritten binary search and "likely" annotations is for the case where the code isn't performance critical: for where the code is performance critical, you suggest a different solution. If the code isn't performance critical, please do not turn it into an unreadable mess over what would become a negligible overall performance gain. Write it in a simple, obviously correct way, keep it boring, and you'll keep it stable; you can use the time you save on fixing bugs in your super optimised version on improving more critical parts of your program.

  • Wait. I always use logarithm for this kind of tasks, how exactly does it make me incompetent and terrible? This is literally what logarithm represents?

  • You're right, however I think they were going for a "branchless" version just to see if they could.

This is why it's a good idea to have a real integer type.

  • Isn't it impossible? Integers go arbitrarily large but computers don't.

    • It is possible within the limits of available memory on the computer. Rather than the usual limit of a fixed number of bits.

      I've ironically found that big integer libraries sometimes optimize math routines more than string conversion. This was quite annoying for me when I optimized the factorial function in Ruby, and found that generating the string was my bottleneck. I then optimized that as well. :-)

    • They can go large enough for anything that matters.

      A quick Google says there's an estimate of 10^78 to 10^82 atoms in the universe. That number would be able to be stored in well under 300 bits.

      2 replies →

Does Math.log really faster than simple loop with just a couple of iterations after JIT compilation?

When the author started introducing all the calculations to figure out if the printed version would round up, my first thought was, why not just look at the printed version? Print it as %.1f, then chop off anything past the period and re-parse it as an integer. Now you can trivially tell if it rounded up past the threshold and you need to bump the suffix.

is it reasonable to assume a strong correlation between "copied code" and number of upvotes? #trolling #kiddingnotkidding

So it's not flawed (it does compute the correct result).

The author just thinks a completely unreadable (but supposedly faster) variant using logarithms is "better" than the simple loop used in the original snippet?

Write your code for junior devs in their first week at your company, not for academic journals.

  • I think you might have misread the post. His logarithm code became the most used snippet and had the bug.

  • His code snippet had rounding errors on the boundaries towards the next unit.

    However he notes:

    > FWIW, all 22 answers posted, including the ones using Apache Commons and Android libraries, had this bug (or a variation of it) at the time of writing this article.

  • Sibling commenters have already pointed out that you seem to have misread the post, but tbh I found it quite confusing to follow myslf, so here's a summary:

    - the first answer posted on SO was a simple loop

    - the author posted a 2nd (supposedly faster but less readable) answer. The author didn't think this answer was better than the loop, but it seems the community did and it became accepted (and extremely popular). THIS is the version that was buggy.

    The author later went back and fixed their own buggy version.

    So yes there's an argument to be made that the very first simple loop was better, but that's orthogonal to the point of the story.

  • Can you please read the article all the way before commenting next time?

    The log approach _is_ the most copied snippet.

  • > Write your code for junior devs in their first week at your company, not for academic journals.

    Hard and fast rules about coding style are silly. There's a time and place for clever code, and there's a time and place for verbose and straightforward code.

    I write performance-critical code. Juniors shouldn't be mucking about there, because it's performance critical. I also write non-performance-critical code with some effort. I write that stuff for the juniors.

    When writing for academic journals, it looks like the stuff I write for juniors. I'll drop a hint here or there so experts can reproduce less-obvious optimizations.

  • He ends the blog post with this: "Personally I would not copy this snippet into production code."

    He isn't trying to get people to use the log version.

  • You should almost _always_ focus on code readability and simplicity over inventiveness and cleverness.

    Very few people I have encountered have complained about code being 'too simple' or 'too readable', but the opposite happens on a near daily/weekly basis.

    Write comments, use a for loop, avoid global state, keep your nesting limited to 2-3 levels, be kind to your junior devs.

Floating point is really really hard to get right, especially if you want the numbers to be stable. Which begs the question, why the heck does JavaScript, the most used language in the world, not have an integer type? Sure, there's BigInt but that's quite clunky to use. I know it's virtually impossible to add by now, but I'd love a integer type for all my bit twiddling, byte munching needs.

  • I just feel if you have bit twiddling, byte munching needs JavaScript shouldn't be the language of choice. Doing that is a rather rare edge case and if you're doing it for performance reasons, working in Javascript is the much bigger performance problem.

    • If you're already using JavaScript for some other reasons but occasionally have bit twiddling, byte munching needs, then trying to do that in JavaScript makes perfect sense. Is it the fastest option? No. But according to https://benchmarksgame-team.pages.debian.net/benchmarksgame/... it is generally within a factor of 4-5 of C++.

      For an application area where this applies, consider a web-based game. Using JavaScript keeps you from shipping another application. But occasionally you may have bit twiddling and/or byte munching needs. Which you need to do in JavaScript.

      1 reply →

    • That may be true but another case where floating point should be avoided is with money. Now think about all the times a web developer innocuously used a JS number to represent a price. I wouldn’t be surprised if floating point errors affected billions of dollars of transactions.

The author's lookup table is incorrect.

The question being answered clearly wanted base2 engineering prefix units, rather than the standard base10 engineering prefix units.

suffixes = [ "EB", "PB", "TB", "GB", "MB", "KB", "B" ]

magnitudes = [ 2^60, 2^50, 2^40, 2^30, 2^20, 2^10, 2^0 ] // Pseudocode, also 64 bit integers required. (Compilers might assume unsigned 32 for int)

  • That is not the author's code. That is pseudocode for one of the example answers that he is improving on.

    The author's code gives an option for the units:

    int unit = si ? 1000 : 1024;

  • If you do this you should add an 'i' to the prefixes to denote that you mean the binary notation. e.g: kiB, MiB, GiB, TiB, etc