I find it interesting that all the answers using hardcoded values / if statements (or while) are all doing up to five comparisons.
It goes B, KiB, MiB, GiB, TiB, EiB and no more than that (in all the answers) so that can be solved with three if statements at most, no five.
I mean: if it's greater or equal to GiB, you know it won't be B, KiB or MiB. Dichotomy search for the win!
Not a single of the hardcoded solutions do it that way.
Now let's go up to ZiB and YiB: still only three if statements at most, vs up to seven for the hardcoded solutions.
I mention it because I'd personally definitely not go for the whole log/pow/floating-points if I had to write a solution myself (because I precisely know all too well the SNAFU potential).
I'd hardcode if statements... But while doing a dichotomy search. I must be an oddball.
P.S: no horse in this race, no hill to die on, and all the usual disclaimers
I would expect your binary search solution is possibly slower than just doing 6 checks because the latter is only going to take 1 branch. Branching is very slow. You want to keep code going in a straight line as much as possible.
Yup, know your hardware and know problem. Dichotomic search is wonderful when your data can't fit in RAM and it starts being more efficient to cut down on number of nodes traversed.
for a problem space limited by your input size (signed 64 bit number) to a 6 entry dictionary? At best you may want to optimize some in-lining or compiler hints if your language supports it. maybe setup some batching operations if this is called hundreds of times a frame so you're not creating/desrtoying the stack frame everytime (even then, the compiler can probably optimize that).
But otherwise, just throw that few dozen byte lookup table into the registers and let the hardware chew through it. Big N notations aren't needed for data at this scale.
Your comment and mine are basically the same. This is what I call terrible engineering judgement. A random co-worker could review the simple solution without much effort. They could also see the corner cases clearly and verify the tests cover them. With this code, not so much. It seems like a lot of work to write slower, more complex, harder to test and harder to review code.
I don't understand. There are 7 suffixes, can't you pick the right one with binary search? That would be 3 comparisons. Or just do it the dumb way and have 6 comparisons. How are two log() calls, one pow() call and ceil() better than just doing it the dumb way? The bug being described is a perfect example of trying to be too clever.
The author says at the beginning that it’s not actually better than the loop.
Also 6 comparisons is only if you’d have the max value which seems unlikely in actual usage. Linear could be better if most of the time values are in B or KB ranges
Shameless plug: another option to format sizes in a human readable format quickly and correctly (other than copying from S/O), you can use one of our open source PrettySize libraries, available for rust [0] and .NET [1]. They also make performing type-safe logical operations on file sizes safe and easy!
The snippet from S/O may be four lines but these are much more extensive, come with tests, output formatting options, conversion between sizes, and more.
I understand where you're coming from here, but the whole point of this article is at the 4-line solution is wrong (and the author specifically mentioned that every other answer on the stack overflow post was wrong in the same way as well). "Seemingly-simple problem where every naïve solution contains a subtle bug" is exactly the right use case for a well-designed library method.
Yeah, copying an incorrect answer from SO thousands of times is much better!
(The subject at hand isn't whether libraries are good or not, it's whether copying something off the internet is. In the post, it turns out it isn't. If it was a library, the author could have fixed and updated the library, and the issue would be fixed for everyone that uses it. left-pad isn't an issue with libraries per se, it's an issue with library management)
Out of curiosity, is there a sizable number of developers that just copy and paste untrusted code from StackOverflow into their applications?
The conjecture that people just copy from StackOverflow is obviously popular but I always thought this was just conjecture and humor until I saw someone do it. Don't get me wrong, I use StackOverflow to give me a head start on solving a problem in an area I'm not as familiar with yet, but I've never just straight copied code from there. I don't do that because rarely does the snippet do exactly and only exactly what I need. It requires me to look at the APIs and form my own solution from the explained approach. StackOverflow has pointed me in the direction of some niche APIs that are useful to me, especially in Python.
I once worked with a developer who wouldn’t let anything come between him seeing an answer and copying it into his code. He wasn’t even reading the question to make sure it was the same problem he was having, let alone the answer. He would literally go Google => follow the first link to Stack Overflow he saw => copy and paste the first code block he saw. Sometimes it wasn’t even the right language. People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.
Now he was an extreme case, but yes, there are a lot of developers out there with the mindset of “I need code; Stack Overflow has code; problem solved!” that don’t put any thought at all into whether it’s an appropriate solution.
A hiring round nearly two decades ago we realised something was off with the answers to the usual pre-phone interview screening questions. They were simple, and we asked people to only spend like 20 minutes on them. We knew people would "cheat", but they were only there to lighten our load a little bit, so it was ok if they let through some bad candidates.
But for whatever reason, in one hiring round the vast majority had cut and pasted answers from search results verbatim (we dealt with a new recruiter, and I frankly suspected this new recruiter was telling them this was ok despite the instructions we'd given).
These were not subtle. But the very worst one was one who did like the developer you described: He'd found a forum post about a problem pretty close to the question, had cut and pasted the code from the first answer he found.
He'd not even bothered to read a few comments further down in the replies where the answer in question was totally savaged by other commenters explaining why it was entirely wrong.
This was someone who was employed as a senior developer somewhere else, and it was clear in retrospect looking at his CV that he probably kept "fleeing the scene of the crime" on a regular basis before it was discovered he was a total fraud. We regularly got those people, but none that delivered such obviously messed up answers.
For ever developer like this, you're probably right there will be a lot more that are less extreme about it, and more able to make things work well enough that they're not discovered.
If you're paying a developer by the hour, and want your app released in the app store using as few hours as possible, then this approach can be the most cost efficient one.
Sure, it isn't good practice. Sure, it probably isn't what NASA should be doing. But if you're literally building yet another uber-like app, you probably shouldn't be spending too long thinking about details.
> People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.
Yes, and it happens more for things that feel out of scope for the part of the program that I'm interested in. After all, we import library code from random strangers into our programs all the time for the parts we consider "plumbing" and beneath notice. If I wanted to dig in and understand something, I would be more likely to write my own. But if I want this part over here to "just work" so I can get on with the project, it's compiler-error-driven development.
Same, and even more so if it's something that feels like it should be in the library code in the first place.
My most copy-pasted code is projecting a point onto a line segment. I end up needing it all the time, it's never in whatever standard library for vector math I'm using, and it's faster to find on SO than to find and translate the code out of whatever my last project that needed it is. Way faster than re-deriving it.
Your vector math library is probably already code imported from random strangers, likely even imported by random strangers, so adding one more function from a random stranger feels entirely appropriate.
I hardly ever just copy and paste for the exact reason the author talks about. Instead, I try to make sense of the solution, and if I have to, I'll hand-copy it down line-by-line to make sure I properly understand and refactor from there. I also rename variables, since often times there are so many foos and bars and bazes that it's completely unreadable by a human.
Also if I come across the problem a second time, I'll have better luck remembering what I did (as opposed to blindly copying).
Yes, people do that. After looking at a huge number of incorrect TLS related code and configuration at SO, I’m now pretty sure that most systems run without validating certificates properly.
This was more true when libraries and tooling defaulted to not checking.
Somewhere in my history is a recent HN (or maybe Reddit) post where somebody insists Curl has been 100% compatible from day one, and like, no, originally curl ignores certificates, today you need to specify that explicitly if it's what you want.
I think (but don't take my word for it) that Requests (the Python library) was the same. Initially it didn't check, then years back the authors were told that if you don't check you get what you didn't pay for (ie nothing) and they changed the defaults.
Python itself is trickier because it was really hard to convince Python people that DNS names, the names we actually care about in certificates, aren't Unicode. I mean, they can be (IDNs), but not in a way that's useful to a machine. If your job is "Present this DNS name to a user" then sure, here's a bunch of tricky and maybe flawed code to best efforts turn the bytes into human Unicode text, but your cert checking code isn't a human, it wants bytes and we deliberately designed the DNS records and the certificate bytes to be identical, so you're just doing a byte-for-byte comparison.
The Python people really wanted to convert everything messily to Unicode, which is - at best if you do it perfectly - slower with the same results and at worst a security hole for no reason.
OpenSSL is at least partly to blame for terrible TLS APIs. OpenSSL is what I call a "stamp collector" library. It wants to collect all the obscure corner cases, because some of its authors are interested. Did the Belgian government standardise a 54-bit cipher called "Bingle Bongle" in 1997? Cool, let's add that to our library. Does anybody use it? No. Should anybody use it? No. But it exists so we added it. A huge waste of everybody's time.
The other reason people don't validate is that it was easier to turn it off and get their work done, which is a big problem that should be addressed systemically rather than by individually telling people "No".
So I'd guess that today out of a thousand pieces of software that ought to do TLS, maybe 750 of them don't validate certificates correctly, and maybe 400 of those deliberately don't do it correctly because the author knew it would fail and had other priorities.
To be fair that might be partly the fault of TLS libraries. There should be a single sane function that does the least surprising thing and then lower level APIs for everything else. Currently you need a checklist of things that must be checked before trusting a connection.
Well. You (collective you) start by copying and pasting a code snippet first, and then modifying it as needed. Does that count? If no modifications are needed, then it stays.
Plenty of developers paste arbitrary bash commands posted on sites like GitHub without thinking because they look "legit", I suppose. I see it similarly as you do: StackOverflow (and Copilot) can be helpful to start but it's.
Had an exchange like this some time ago:
Me: Hey, I'm reviewing your PR. Looks pretty fine to me. Except for this function which looks like it was copy-pasted from SO: I literally found the same function in an answer on SO (it was written in pure JS while we were using TS in our project).
Dev: Yes, everyone copies from SO.
Me: Well, in that case I hope you always copy the right thing. Because this code might run but it is not good enough (e.g. the variable names are inexpressive, it creates DOM elements without removing them after they are not needed anymore).
I wouldn't do it in most professional settings due to licensing...
But for personal projects where I just want to get something running, then yes, I would copy paste and barely even read the code.
I don't really care about bugs like this either - I'm happy to make something that works 99% of the time, and only fix that last 1% if it turns out to be an issue.
> I wouldn't do it in most professional settings due to licensing...
Underrated comment. I think most tech companies' General Counsel would have a heart attack if they were aware of StackOverflow copy-pasting by their developers. I highly doubt some rando-engineer who pastes bubblesort code into their company's code base gave even a passing though to what license the SO code was under, what license his own company's code was under, and whether they were compatible.
The big (FAANG) tech companies I've worked at all have written policies about copying and pasting code from external sources (TLDR: Don't), but I've seen even medium-sized (~1000+) companies with zero guidance for their developers.
In the server side JavaScript world absolutely, it seems like it's standard practice, people are injecting entire dependencies without even remotely looking at the code. Bringing in an entire library for a single function that could be accomplished in a couple lines and usually is posted below the fold.
not long ago I worked on a team who actively chose libraries and frameworks based on the likelihood they felt their questions would be answered on StackOverflow.
This is why PHP got such a bad reputation. A lot of new developers where copy and pasting quick example code from stack overflow, or code from other new developers who only kind of knew what they were doing.
I don't understand why you'd use floating point logarithms if you want log 2?
Unless I'm missing something, this gives you an accurate value of floor(log2(value)) for anything positive less than 2^63 bytes, and it's much faster too:
The original SO question did actually state they wanted powers of two (kilobyte as 1024 bytes). Although, they should have used KiB, GiB, instead to be pedantic.
I took one look at the snippet, saw a floating-point log operation and divisions applied to integers, and mentally discarded the entire snippet as too clever by half and inherently bug-prone.
Knowledge cascades all the way down; it goes to show how difficult it is to 'holster' even the smallest piece of knowledge once its drawn.
I wonder with the rate Stack Exchange is losing active contributors, what it would take for 'fastest gun' answers to be corrected that are later found to be off mark, and what it would mean for our collective knowledge once these 'slightly off' answers are further cemented in our annals of search and increasingly, LLM history.
This reminds me of when I was in basic training. The drill sgts would give us new recruits a task that none of us knew how to do, purposefully without guidance, and then leave. One guy would try and start doing it, always the incorrect way, and everyone else would just copy that person.
I wonder if this is exacerbated by human tendencies to not want to look bad relative to others, even if it leads to silly outcomes like intelligent people following a bad or rushed idea.
Something similar happens in public economic forecasts because those who get it wrong when others get it right are treated much more harshly than those who get it wrong when others get it wrong too.
"Don't jump off a cliff just because everyone else is doing it" basically
I guess the next logical exercise would be asking them to do something with instructions that are complete, but incorrect or at least inefficient, to teach the lesson of questioning superior orders rather than just peers. Actually, I'm honestly not sure it that's desired in military discipline or not (no direct experience here)
In a way, I don't even consider floating point errors to be "flaws" with an algorithm like this. If the code defines a logical, mathematically correct solution, then its "right". Solving floating point errors is a step above this, and only done in certain circumstances where it actually matters.
You can imagine some perfect future programming language where floating point errors don't exist, and don't have to be accounted for. Thats the language I'm targeting with 99% of my algorithms.
This reminds me of a weirdness with some sat navs: the distance to your exit/destination is displayed as: 12 ... 11 ... 10 ... 10.0 ... 9.9 ... 9.8 ... with the value 10.0 shown only while the distance is between 9.95 and 10. It's not really a bug but it's strange seeing the display update from 10 to 10.0 as you pass the imaginary ten-mile milestone so perhaps it's a distraction worth avoiding.
Almost every top stack overflow answer is wrong. The correct one is usually at rank 3. The system promotes answers which the public believes to be correct (easy to read, resembles material they are familiar with, follows fads, etc).
Pay attention to comments and compare a few answers.
Years ago I tried to answer a comment on StackOverflow, but I didn’t have enough points to comment. So I tried to answer some questions so that I could get enough points to comment. But when looking at the new questions, it seemed to be mostly a pile of “I have a bug in my code please fix it” type stuff. Relatively simple answers to “What is the stack and the heap?” had thousands of points, but also already had tons of answers (though I suppose one of the reason why people keep answering is to harvest points). I was able to answer a question on an obscure issue that no one had answered yet, but received no points.
Then I saw that you could get points for editing answers. OK, I thought, I can get some points by fixing some bugs. I found a highly upvoted post that had code that didn’t work, found that it was because one section had used the wrong variable, and tried to fix it. Well, the variable name was too short to meet the necessary 6 characters to edit the code (something like changing “foo” to “bar”).
I went to see what other people did in these situations, and they suggested just adding unnecessary edits in order to reach the character limit.
At that point, I just left the bug in, and gave up on trying to contribute to Stack Overflow.
I was active on the statistics Stack Exchange for a while in grad school. There were generally plenty of interesting questions to answer, but the obsession some people (the most active people, generally) had with the points system became really unpleasant after a while.
My breaking point was when I saw a question with an incorrect answer. I posted a correct answer, explained why the other answer was incorrect, and downvoted the incorrect answer. The author of the incorrect answer then posted a rant as a comment on my answer about how I shouldn't have downvoted their answer because they were going to fix it, and a couple other people chimed in agreeing that it was inconsiderate or inappropriate of me to have downvoted the other answer.
I decided Stack Exchange was dumb and stopped spending time there, which was probably good for my PhD progress.
> I suppose one of the reason why people keep answering is to harvest points
It's interesting to see some of the top (5- or 6-digit SO scores) people's activity charts.
They usually have a 3-5-digit answer history, and a 1-digit question history, with the digit frequently being "0."
In my case, I have asked almost twice as many questions, as I have given answers[0].
For a long time, I had a very low SO score (I've been on the platform for many years), but some years ago, they decided to award questions the same score as answers (which pissed a lot of people off), and my score suddenly jumped up. It's still not a top score, but it's a bit less shabby.
Over the years, I did learn to ask questions well (which means they get ignored, as opposed to insulted -an improvement), but these days, I don't bother going there, anymore.
If you get enough points on one of the more niche and less toxic StackExchange sites, it'll also let you comment, vote, etc. network-wide.
I had gotten most of my points by asking and answering things about Blender workflow/API/development specifics, so I got to skip some of the dumb gatekeeping on StackOverflow.
Worldbuilding's fun, too— Codegolf's not bad either, if you can come up with an interesting way to do it— Arquade looks good, and so does Cooking— Literature, English, Scifi, etc look interesting— If you program software, I suppose CodeReview might be a safe bet.
Yeah ... the extra critical nature of SO is why their lunch is being eaten by LLMs. I once had a buddy who is now super duper senior at Amazon working on the main site to ask his Q on SO and he flat out said no because he'd had hostile interactions before when asking questions. Right or wrong the reputation that they've developed has hurt them a ton.
You need to focus on niche tags to find worthwhile unanswered questions. Browsing the $foolang tag is just for the OCD FOMO types who spend their day farming rep.
What if that was the goal all along? Time traveling freedom fighters set up SO so that the well for AI would be poisoned, freeing us from our future overlords!
A couple months ago, someone commented that one of my answers was wrong. Well, sure, in the years since answering, things changed. It was correct when I wrote it. Otherwise it wouldn't have taken so long for someone to point out that it's wrong. The public may have believed it to be the correct answer because it was at that time.
Mmm... no? StackOverflow is powered by voting. Not all forums work like that (it was a questionable choice at the time StackOverflow started).
I've been a moderator on a couple of ForumBB kind of forums and the idea of karma points was often brought up in moderator meetings. Those with more experience in this field would usually try to dissuade the less experienced mods from implementing any karma system.
Moderators used to have ways of promoting specific posts. In the context of ForumBB you had a way to mark a thread as important or to make it sticky. Also, a post by a moderator would stand out (or could be made to stand out), so that other forum users would know if someone speaks from a position of experience / authority or is this yet to be determined.
Social media went increasingly in the direction of automating moderator's work by extracting that information from the users... but this is definitely not the only (and probably not the best) way of approaching this problem. Moderators are just harder to make and are more expensive to keep.
I hold little hope that LLM's will help us to reason through "correctness." If these AI's scourge through the troves of idiocy on the internet believing what it will according to patterns and not applying critical reasoning skills, it too will pick up the band-wagon's opinions and perpetuate them. Ad Populum will continue to be a persistent fallacy if we humans don't learn appropriate reasoning skills.
people are writing entire programs with ChatGPT. these are the same people that previously would copy&paste multiple SO answers cobbled together. now, it's just a copy&paste the entire script from a single response.
Sounds like you're counting that as a negative. Obviously it depends on the use case, but more often than not I'll lean towards the easier to read code than the most optimal one.
Long time ago, when ActionScript was a thing, there was this one snippet in ActionScript documentation that illustrated how to deal with events dispatching, handling etc. In order to illustrate the concept the official documentation provided a code snippet that created a dummy object, attached handlers to it, and in those handlers defined some way of processing... I think it was XML loading and parsing, well, something very common.
The example implied that this object would be an instance of a class interested in handling events, but didn't want to blow up the size of this example with not so relevant bits of code.
There was a time when I very actively participated in various forums related to ActionScript. And, as you can imagine, loading of XML was paramount to success in that field. Invariably, I'd encounter code that copied the documentation example and had this useless dummy object with handlers defined (and subsequently struggled to extract information thus loaded).
It was simply amazing how regardless of the overall skill of the programmer or the purpose of the applet, the same exact useless object would appear in the same situation -- be it XML socket or XML loaded via HTTP, submitted and parsed by user... it was always there.
----
Today, I often encounter code like this in unit tests in various languages. Often programmers will copy some boilerplate code from example in the manual and will create hundreds or even thousands of unit tests all with some unnecessary code duplication / unnecessary objects. Not sure why in this specific area, but it looks like programmers both treat these kinds of test as some sort of magic but also unimportant, worthless code that doesn't need attention.
----
Finally, specifically on the subject of human-readable encoding of byte sizes. Do you guys like parted? Because it's so fun to work with it because of this very issue! You should try it, if you have some spare time and don't feel misanthropic enough for today.
There is still the chance that the person that created the 4 line dependency also just copy pasted it from the flawed StackOverflow answer. Or is the same person or is also just a random person creating the package like the random person that created the SO answer. I'm not sure why random_person1 should be more trustworthy to produce non flawed code than random_person2.
OTO: It's at least easily upgrade able so it has an advantage.
The most impressive suggestion Copilot has given me was a solution to this that used a loop to divide and index further into an array of units..
It never dawned on me to approach it that way and I had never seen that solution(not that I ever looked). Not sure where it got that from but was pretty cool and.... Yeah, it gets simple stuff wrong all the time haha.
When StackOverflow was new, it was an incredible resource. Unfortunately, so much cruft has accumulated that it is now nearly useless. Even if an answer was once correct (and many are not), it is likely years out of date and no longer applicable.
While reading I was thinking why aren’t stackoverflow “mandating” that solutions have tests, so that this problem isn’t left to everyone else, ref. to the comment at the end of the article:
Test all edge cases, especially for code copied from Stack Overflow.
How does the author determine this is the "most copied snippet" on SO? The Question/Answer has only been Viewed 351k times. There are posts with many millions of views e.g: https://stackoverflow.com/questions/927358/how-do-i-undo-the... which have definitely been copy-pasted more times. Yes, there may be many instances of this Java function on GitHub. But only because the people doing the copying are too lazy to think about how it works never mind alter the function name. If there's a bug, just update the SO answer and fix the problem. No need to write a lengthy self-promoting post about it.
> A PhD student by the name Sebastian Baltes publishes a paper in the journal of Empirical Software Engineering. The title is Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects [...] As part of their analysis they extracted code snippets from the Stack Overflow data dump and matched them against code from public GitHub repos.
Well - I suppose it makes sense. SO isn't built for correctness, it's built for upvotes that just depend on whether the people upvoting like the answer or not (regardless of correctness).
Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().
Why 300K+ people copied this in the first place shows some basic level of ignorance about what's happening under the hood.[1]
As someone who's been at this for decades now and knows my own failings better than ever, it also shows how developers can be too attracted by shiny things (ooh look, you can solve it with logs instead, how clever!) at the expense of readable, maintainable code.
[1] But hey, maybe that's why we were all on StackOverflow in the first place
> Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().
And here's something to consider -- if you're converting a number to human readable format it's more likely than not your about to do I/O with the resulting string, which is probably going to be an order of magnitude more expensive than the little function here.
Great point, I wish I'd mentioned it. The expense of the printf dwarfs the log / log (double divided by a double then cast to an int), which itself is greater than some repeated comparisons in a for loop.
It's key to be able to recognize this when thinking about performant code.
In other words, the entire exercise is silliness because the eventual printf is going to blow away any nanoseconds of savings by a smarter/shorter routine.
It's not that we think it's arcane or that we are in our own "bubbles of thought", it's that we aren't doing math. We're programming a computer. And a competent programmer would know, or at least suspect, that doing it with logarithms will be slower and more complicated for a computer. The author even points out that even he wouldn't use his solution.
> what are you people even programming that you need to know so absolutely little about how anything else in the entire world works
Feoren, your comment takes an incredibly superior attitude and accuses its reader, every reader, of being stupid.
When taking the log of a number, the value in general require an infinite number of digits to represent. Computing log(100) / log(10) should return 2.0 exactly, but since log(100) returns a fixed number of digits and log(10) returns a fixed number of digits, are you 100% confident that the ratio will be exactly 2.0?
Maybe you test it and it does return exactly 2.0 (to the degree floating point can be exactly any value). Are you confident that such a calculation will also work for any power of 10? Maybe they all work on this intel machine -- does it work on every Arm CPU? Every RISCV CPU? Etc. I wouldn't be, but if I wrote dumb "for" loop I'd be far more confident that I'd get the right result in every case.
> You're all literally writing CRUD React front-end javascript by copy-pasting "for" loops from StackOverflow?
To an approximation, yes.
The underlying calculations at my bank were probably written once in 1970 in COBOL and haven't changed meaningfully since. But the front-end UI to access it has gone from teletypes and punch cards to glass terminals to networked DOS to Win32 to ActiveX to Web 2.0 to React and mobile apps. Lots and lots of churn and work on the CRUD part, zero churn and work on the "need to remember logarithms" part.
AI? You have core teams building ChatGPT, Midjourney, etc. Then huge numbers of people accessing those via API, building CRUD sites to aggregate midjourney results and prompts, etc etc. Even Apple has made a drag-and-drop UI to train an AI object classifier, the ratio of people who had to know the math to make that vs the people using it is probably way above 1:100,000
Well, maybe not exactly unmaintainable but I think most of us have learned that floating point operations are not to be trusted, especially if it needs to run on different processors.
Furthermore, calling such math operations is an overkill most of the time. I would definitely never consider it for such a simple operation. I actually agree with you that it might look cleaner and easier to understand, but in my mind it would be such a heavy weight overkill I would never use it.
I didn't downvote, but I would guess it's due to the general idea that if you just approve or disapprove of a post you should simply vote that way instead of expressing it in a comment. Personally while I agree there's a logic to that, I find it a little cold for positive sentiments. I couldn't find it, but I think that's there's a PG or Dang comment to the effect that "I like this" as a comment is explicitly not discouraged on HN, but obviously that doesn't mean everyone agrees.
I find it interesting that all the answers using hardcoded values / if statements (or while) are all doing up to five comparisons.
It goes B, KiB, MiB, GiB, TiB, EiB and no more than that (in all the answers) so that can be solved with three if statements at most, no five.
I mean: if it's greater or equal to GiB, you know it won't be B, KiB or MiB. Dichotomy search for the win!
Not a single of the hardcoded solutions do it that way.
Now let's go up to ZiB and YiB: still only three if statements at most, vs up to seven for the hardcoded solutions.
I mention it because I'd personally definitely not go for the whole log/pow/floating-points if I had to write a solution myself (because I precisely know all too well the SNAFU potential).
I'd hardcode if statements... But while doing a dichotomy search. I must be an oddball.
P.S: no horse in this race, no hill to die on, and all the usual disclaimers
I would expect your binary search solution is possibly slower than just doing 6 checks because the latter is only going to take 1 branch. Branching is very slow. You want to keep code going in a straight line as much as possible.
Yup, know your hardware and know problem. Dichotomic search is wonderful when your data can't fit in RAM and it starts being more efficient to cut down on number of nodes traversed.
for a problem space limited by your input size (signed 64 bit number) to a 6 entry dictionary? At best you may want to optimize some in-lining or compiler hints if your language supports it. maybe setup some batching operations if this is called hundreds of times a frame so you're not creating/desrtoying the stack frame everytime (even then, the compiler can probably optimize that).
But otherwise, just throw that few dozen byte lookup table into the registers and let the hardware chew through it. Big N notations aren't needed for data at this scale.
It depends on the input distribution. If it’s very common to have smaller values then the linear search could be superior.
Your comment and mine are basically the same. This is what I call terrible engineering judgement. A random co-worker could review the simple solution without much effort. They could also see the corner cases clearly and verify the tests cover them. With this code, not so much. It seems like a lot of work to write slower, more complex, harder to test and harder to review code.
(2019)
Past discussions:
https://news.ycombinator.com/item?id=27533684
Thanks! Macroexpanded:
The most copied StackOverflow snippet of all time is flawed (2019) - https://news.ycombinator.com/item?id=21693431 - Dec 2019 (3 comments)
I don't understand. There are 7 suffixes, can't you pick the right one with binary search? That would be 3 comparisons. Or just do it the dumb way and have 6 comparisons. How are two log() calls, one pow() call and ceil() better than just doing it the dumb way? The bug being described is a perfect example of trying to be too clever.
The author apparently went back to using a loop after recognizing that it's not readable: https://programming.guide/java/formatting-byte-size-to-human...
Notably, it's still slightly better than the first code example in the original article, as it takes the rounding bug into account.
The author says at the beginning that it’s not actually better than the loop.
Also 6 comparisons is only if you’d have the max value which seems unlikely in actual usage. Linear could be better if most of the time values are in B or KB ranges
Shameless plug: another option to format sizes in a human readable format quickly and correctly (other than copying from S/O), you can use one of our open source PrettySize libraries, available for rust [0] and .NET [1]. They also make performing type-safe logical operations on file sizes safe and easy!
The snippet from S/O may be four lines but these are much more extensive, come with tests, output formatting options, conversion between sizes, and more.
[0]: https://github.com/neosmart/prettysize-rs
[1]: https://github.com/neosmart/PrettySize.net
Replacing 4 line solutions with extensive libraries is what caused left-pad.
I understand where you're coming from here, but the whole point of this article is at the 4-line solution is wrong (and the author specifically mentioned that every other answer on the stack overflow post was wrong in the same way as well). "Seemingly-simple problem where every naïve solution contains a subtle bug" is exactly the right use case for a well-designed library method.
3 replies →
Yeah, copying an incorrect answer from SO thousands of times is much better!
(The subject at hand isn't whether libraries are good or not, it's whether copying something off the internet is. In the post, it turns out it isn't. If it was a library, the author could have fixed and updated the library, and the issue would be fixed for everyone that uses it. left-pad isn't an issue with libraries per se, it's an issue with library management)
No. left-pad was placing a 4-line solution in a library. prettysize is well deserving of library status.
What caused left-pad is the the ability to delete published code
You should see the implementation of `std::midpoint`[1].
Accounting for correctness even in edge-cases is what large libraries do better than throwaway bits of code.
[1]: https://github.com/microsoft/STL/blob/6735beb0c2260e325c3a4c...
Out of curiosity, is there a sizable number of developers that just copy and paste untrusted code from StackOverflow into their applications?
The conjecture that people just copy from StackOverflow is obviously popular but I always thought this was just conjecture and humor until I saw someone do it. Don't get me wrong, I use StackOverflow to give me a head start on solving a problem in an area I'm not as familiar with yet, but I've never just straight copied code from there. I don't do that because rarely does the snippet do exactly and only exactly what I need. It requires me to look at the APIs and form my own solution from the explained approach. StackOverflow has pointed me in the direction of some niche APIs that are useful to me, especially in Python.
I once worked with a developer who wouldn’t let anything come between him seeing an answer and copying it into his code. He wasn’t even reading the question to make sure it was the same problem he was having, let alone the answer. He would literally go Google => follow the first link to Stack Overflow he saw => copy and paste the first code block he saw. Sometimes it wasn’t even the right language. People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.
Now he was an extreme case, but yes, there are a lot of developers out there with the mindset of “I need code; Stack Overflow has code; problem solved!” that don’t put any thought at all into whether it’s an appropriate solution.
A hiring round nearly two decades ago we realised something was off with the answers to the usual pre-phone interview screening questions. They were simple, and we asked people to only spend like 20 minutes on them. We knew people would "cheat", but they were only there to lighten our load a little bit, so it was ok if they let through some bad candidates.
But for whatever reason, in one hiring round the vast majority had cut and pasted answers from search results verbatim (we dealt with a new recruiter, and I frankly suspected this new recruiter was telling them this was ok despite the instructions we'd given).
These were not subtle. But the very worst one was one who did like the developer you described: He'd found a forum post about a problem pretty close to the question, had cut and pasted the code from the first answer he found.
He'd not even bothered to read a few comments further down in the replies where the answer in question was totally savaged by other commenters explaining why it was entirely wrong.
This was someone who was employed as a senior developer somewhere else, and it was clear in retrospect looking at his CV that he probably kept "fleeing the scene of the crime" on a regular basis before it was discovered he was a total fraud. We regularly got those people, but none that delivered such obviously messed up answers.
For ever developer like this, you're probably right there will be a lot more that are less extreme about it, and more able to make things work well enough that they're not discovered.
4 replies →
If you're paying a developer by the hour, and want your app released in the app store using as few hours as possible, then this approach can be the most cost efficient one.
Sure, it isn't good practice. Sure, it probably isn't what NASA should be doing. But if you're literally building yet another uber-like app, you probably shouldn't be spending too long thinking about details.
11 replies →
That's not software development. That's wild guessing.
4 replies →
Just out of curiosity… what was his salary and how long did it take to fire him? Did they fire the HR manager as well?
1 reply →
this is basically how GitHub copilot works
1 reply →
> People had to physically take the input away from him if they were pairing with him because there was nothing anybody could say to stop him, and if you tried to tell him it wasn’t right then he’d just be pasting the second code snippet on the page before you could get another word out. He was freakishly quick at it.
Sounds like this guy understands concurrency. :)
Just wait til that guy discovers ChatGPT.
2 replies →
Yes, and it happens more for things that feel out of scope for the part of the program that I'm interested in. After all, we import library code from random strangers into our programs all the time for the parts we consider "plumbing" and beneath notice. If I wanted to dig in and understand something, I would be more likely to write my own. But if I want this part over here to "just work" so I can get on with the project, it's compiler-error-driven development.
Same, and even more so if it's something that feels like it should be in the library code in the first place.
My most copy-pasted code is projecting a point onto a line segment. I end up needing it all the time, it's never in whatever standard library for vector math I'm using, and it's faster to find on SO than to find and translate the code out of whatever my last project that needed it is. Way faster than re-deriving it.
Your vector math library is probably already code imported from random strangers, likely even imported by random strangers, so adding one more function from a random stranger feels entirely appropriate.
I hardly ever just copy and paste for the exact reason the author talks about. Instead, I try to make sense of the solution, and if I have to, I'll hand-copy it down line-by-line to make sure I properly understand and refactor from there. I also rename variables, since often times there are so many foos and bars and bazes that it's completely unreadable by a human.
Also if I come across the problem a second time, I'll have better luck remembering what I did (as opposed to blindly copying).
Yes, people do that. After looking at a huge number of incorrect TLS related code and configuration at SO, I’m now pretty sure that most systems run without validating certificates properly.
This was more true when libraries and tooling defaulted to not checking.
Somewhere in my history is a recent HN (or maybe Reddit) post where somebody insists Curl has been 100% compatible from day one, and like, no, originally curl ignores certificates, today you need to specify that explicitly if it's what you want.
I think (but don't take my word for it) that Requests (the Python library) was the same. Initially it didn't check, then years back the authors were told that if you don't check you get what you didn't pay for (ie nothing) and they changed the defaults.
Python itself is trickier because it was really hard to convince Python people that DNS names, the names we actually care about in certificates, aren't Unicode. I mean, they can be (IDNs), but not in a way that's useful to a machine. If your job is "Present this DNS name to a user" then sure, here's a bunch of tricky and maybe flawed code to best efforts turn the bytes into human Unicode text, but your cert checking code isn't a human, it wants bytes and we deliberately designed the DNS records and the certificate bytes to be identical, so you're just doing a byte-for-byte comparison.
The Python people really wanted to convert everything messily to Unicode, which is - at best if you do it perfectly - slower with the same results and at worst a security hole for no reason.
OpenSSL is at least partly to blame for terrible TLS APIs. OpenSSL is what I call a "stamp collector" library. It wants to collect all the obscure corner cases, because some of its authors are interested. Did the Belgian government standardise a 54-bit cipher called "Bingle Bongle" in 1997? Cool, let's add that to our library. Does anybody use it? No. Should anybody use it? No. But it exists so we added it. A huge waste of everybody's time.
The other reason people don't validate is that it was easier to turn it off and get their work done, which is a big problem that should be addressed systemically rather than by individually telling people "No".
So I'd guess that today out of a thousand pieces of software that ought to do TLS, maybe 750 of them don't validate certificates correctly, and maybe 400 of those deliberately don't do it correctly because the author knew it would fail and had other priorities.
3 replies →
To be fair that might be partly the fault of TLS libraries. There should be a single sane function that does the least surprising thing and then lower level APIs for everything else. Currently you need a checklist of things that must be checked before trusting a connection.
Oh boy, where to begin. You obviously haven't had the pleasure of working in a codebase written by Adderall-fueled 23-year-olds.
What about Adderall-fueled 35 year olds?
3 replies →
I think the section “ A Study on Attribution” and associated paper might be as good of an answer as you’ll get to that
Well. You (collective you) start by copying and pasting a code snippet first, and then modifying it as needed. Does that count? If no modifications are needed, then it stays.
That's what I do. I almost always rename things to match the coding style of the codebase I'm working on, though.
Plenty of developers paste arbitrary bash commands posted on sites like GitHub without thinking because they look "legit", I suppose. I see it similarly as you do: StackOverflow (and Copilot) can be helpful to start but it's.
Had an exchange like this some time ago:
Me: Hey, I'm reviewing your PR. Looks pretty fine to me. Except for this function which looks like it was copy-pasted from SO: I literally found the same function in an answer on SO (it was written in pure JS while we were using TS in our project).
Dev: Yes, everyone copies from SO.
Me: Well, in that case I hope you always copy the right thing. Because this code might run but it is not good enough (e.g. the variable names are inexpressive, it creates DOM elements without removing them after they are not needed anymore).
There really is, but people do give it a cursory read. See also: https://en.wikipedia.org/wiki/Underhanded_C_Contest
Yes. I was told from a reliable source that at one point they tried to log all the copy and paste events and it brought their systems to their knees.
I wouldn't do it in most professional settings due to licensing...
But for personal projects where I just want to get something running, then yes, I would copy paste and barely even read the code.
I don't really care about bugs like this either - I'm happy to make something that works 99% of the time, and only fix that last 1% if it turns out to be an issue.
> I wouldn't do it in most professional settings due to licensing...
Underrated comment. I think most tech companies' General Counsel would have a heart attack if they were aware of StackOverflow copy-pasting by their developers. I highly doubt some rando-engineer who pastes bubblesort code into their company's code base gave even a passing though to what license the SO code was under, what license his own company's code was under, and whether they were compatible.
The big (FAANG) tech companies I've worked at all have written policies about copying and pasting code from external sources (TLDR: Don't), but I've seen even medium-sized (~1000+) companies with zero guidance for their developers.
In the server side JavaScript world absolutely, it seems like it's standard practice, people are injecting entire dependencies without even remotely looking at the code. Bringing in an entire library for a single function that could be accomplished in a couple lines and usually is posted below the fold.
...you would not believe...
not long ago I worked on a team who actively chose libraries and frameworks based on the likelihood they felt their questions would be answered on StackOverflow.
Yes.
This is why PHP got such a bad reputation. A lot of new developers where copy and pasting quick example code from stack overflow, or code from other new developers who only kind of knew what they were doing.
> This is why PHP got such a bad reputation.
I don't think that's the only reason, lol.
What? SO launched in 2008 and PHP had a bad reputation prior to that.
3 replies →
Less and less every day. Now they are using ChatGPT.
when i had to used python i felt like copy pasting anything was out of scope due to indentation errors.
Millions.
Wait til you find out about chatGPT
I don't understand why you'd use floating point logarithms if you want log 2?
Unless I'm missing something, this gives you an accurate value of floor(log2(value)) for anything positive less than 2^63 bytes, and it's much faster too:
The “common” units are powers of 10 so this doesn’t work
The original SO question did actually state they wanted powers of two (kilobyte as 1024 bytes). Although, they should have used KiB, GiB, instead to be pedantic.
But you can avoid binary search because there are at most one power of tens between 2^k and 2^(k+1). So you can turn it into a lookup table problem.
I took one look at the snippet, saw a floating-point log operation and divisions applied to integers, and mentally discarded the entire snippet as too clever by half and inherently bug-prone.
That’s basically the point of the article
Knowledge cascades all the way down; it goes to show how difficult it is to 'holster' even the smallest piece of knowledge once its drawn.
I wonder with the rate Stack Exchange is losing active contributors, what it would take for 'fastest gun' answers to be corrected that are later found to be off mark, and what it would mean for our collective knowledge once these 'slightly off' answers are further cemented in our annals of search and increasingly, LLM history.
This reminds me of when I was in basic training. The drill sgts would give us new recruits a task that none of us knew how to do, purposefully without guidance, and then leave. One guy would try and start doing it, always the incorrect way, and everyone else would just copy that person.
I wonder if this is exacerbated by human tendencies to not want to look bad relative to others, even if it leads to silly outcomes like intelligent people following a bad or rushed idea.
Something similar happens in public economic forecasts because those who get it wrong when others get it right are treated much more harshly than those who get it wrong when others get it wrong too.
What was the goal of this?
"Don't jump off a cliff just because everyone else is doing it" basically
I guess the next logical exercise would be asking them to do something with instructions that are complete, but incorrect or at least inefficient, to teach the lesson of questioning superior orders rather than just peers. Actually, I'm honestly not sure it that's desired in military discipline or not (no direct experience here)
1 reply →
The usual goal of anything in military training, being cruel to new recruits?
In a way, I don't even consider floating point errors to be "flaws" with an algorithm like this. If the code defines a logical, mathematically correct solution, then its "right". Solving floating point errors is a step above this, and only done in certain circumstances where it actually matters.
You can imagine some perfect future programming language where floating point errors don't exist, and don't have to be accounted for. Thats the language I'm targeting with 99% of my algorithms.
This reminds me of a weirdness with some sat navs: the distance to your exit/destination is displayed as: 12 ... 11 ... 10 ... 10.0 ... 9.9 ... 9.8 ... with the value 10.0 shown only while the distance is between 9.95 and 10. It's not really a bug but it's strange seeing the display update from 10 to 10.0 as you pass the imaginary ten-mile milestone so perhaps it's a distraction worth avoiding.
Mercedes for awhile had a fuel gauge that showed 1/4 1/2 3/4 1/1
They had another one that went R 2/4 4/4
I'm still undecided which was more weird. You can see them both on eBay.
There's nothing weird here. Those are very common fractions used across several domains, including cooking.
But one thing that I would really love to see are actual liters or gallons (depending on the country where I am at the moment).
Almost every top stack overflow answer is wrong. The correct one is usually at rank 3. The system promotes answers which the public believes to be correct (easy to read, resembles material they are familiar with, follows fads, etc).
Pay attention to comments and compare a few answers.
Years ago I tried to answer a comment on StackOverflow, but I didn’t have enough points to comment. So I tried to answer some questions so that I could get enough points to comment. But when looking at the new questions, it seemed to be mostly a pile of “I have a bug in my code please fix it” type stuff. Relatively simple answers to “What is the stack and the heap?” had thousands of points, but also already had tons of answers (though I suppose one of the reason why people keep answering is to harvest points). I was able to answer a question on an obscure issue that no one had answered yet, but received no points.
Then I saw that you could get points for editing answers. OK, I thought, I can get some points by fixing some bugs. I found a highly upvoted post that had code that didn’t work, found that it was because one section had used the wrong variable, and tried to fix it. Well, the variable name was too short to meet the necessary 6 characters to edit the code (something like changing “foo” to “bar”).
I went to see what other people did in these situations, and they suggested just adding unnecessary edits in order to reach the character limit.
At that point, I just left the bug in, and gave up on trying to contribute to Stack Overflow.
I was active on the statistics Stack Exchange for a while in grad school. There were generally plenty of interesting questions to answer, but the obsession some people (the most active people, generally) had with the points system became really unpleasant after a while.
My breaking point was when I saw a question with an incorrect answer. I posted a correct answer, explained why the other answer was incorrect, and downvoted the incorrect answer. The author of the incorrect answer then posted a rant as a comment on my answer about how I shouldn't have downvoted their answer because they were going to fix it, and a couple other people chimed in agreeing that it was inconsiderate or inappropriate of me to have downvoted the other answer.
I decided Stack Exchange was dumb and stopped spending time there, which was probably good for my PhD progress.
6 replies →
> I suppose one of the reason why people keep answering is to harvest points
It's interesting to see some of the top (5- or 6-digit SO scores) people's activity charts.
They usually have a 3-5-digit answer history, and a 1-digit question history, with the digit frequently being "0."
In my case, I have asked almost twice as many questions, as I have given answers[0].
For a long time, I had a very low SO score (I've been on the platform for many years), but some years ago, they decided to award questions the same score as answers (which pissed a lot of people off), and my score suddenly jumped up. It's still not a top score, but it's a bit less shabby.
Over the years, I did learn to ask questions well (which means they get ignored, as opposed to insulted -an improvement), but these days, I don't bother going there, anymore.
[0] https://stackoverflow.com/users/879365/chris-marshall
If you get enough points on one of the more niche and less toxic StackExchange sites, it'll also let you comment, vote, etc. network-wide.
I had gotten most of my points by asking and answering things about Blender workflow/API/development specifics, so I got to skip some of the dumb gatekeeping on StackOverflow.
Worldbuilding's fun, too— Codegolf's not bad either, if you can come up with an interesting way to do it— Arquade looks good, and so does Cooking— Literature, English, Scifi, etc look interesting— If you program software, I suppose CodeReview might be a safe bet.
Yeah ... the extra critical nature of SO is why their lunch is being eaten by LLMs. I once had a buddy who is now super duper senior at Amazon working on the main site to ask his Q on SO and he flat out said no because he'd had hostile interactions before when asking questions. Right or wrong the reputation that they've developed has hurt them a ton.
>it seemed to be mostly a pile of “I have a bug in my code please fix it” type stuff.
it's mostly people asking you to do their comp sci homework.
The edit queue was sitting at over 40k at one point.
Unfortunately people trying to game the system creates enormous work for those who can review.
(Not saying you were doing anything wrong just pointing out why there are automated guards)
You need to focus on niche tags to find worthwhile unanswered questions. Browsing the $foolang tag is just for the OCD FOMO types who spend their day farming rep.
1 reply →
Back in ye olden days, almost every answer involving a database contained a SQL injection vulnerability.
To their credit, a lot of people went back a decade later and fixed those. Although it doesn't stop people from repeating the mistakes.
I just got beaten up in HN for asking how the hell sql injection is still a problem. People get defensive, apparently.
36 replies →
If you ever have an issue with the Requests library in Python, just try again with verify=false.
5 replies →
Good thing we trained all those AIs with these answers.
What if that was the goal all along? Time traveling freedom fighters set up SO so that the well for AI would be poisoned, freeing us from our future overlords!
StackOverflow and those AIs optimise for the same thing - something that looks correct regardless of how actually correct it is.
A couple months ago, someone commented that one of my answers was wrong. Well, sure, in the years since answering, things changed. It was correct when I wrote it. Otherwise it wouldn't have taken so long for someone to point out that it's wrong. The public may have believed it to be the correct answer because it was at that time.
> The system promotes answers which the public believes to be correct
Well.. duh?
Until AI takes over the world, this will be correct for everything. News, comments, everything.
Mmm... no? StackOverflow is powered by voting. Not all forums work like that (it was a questionable choice at the time StackOverflow started).
I've been a moderator on a couple of ForumBB kind of forums and the idea of karma points was often brought up in moderator meetings. Those with more experience in this field would usually try to dissuade the less experienced mods from implementing any karma system.
Moderators used to have ways of promoting specific posts. In the context of ForumBB you had a way to mark a thread as important or to make it sticky. Also, a post by a moderator would stand out (or could be made to stand out), so that other forum users would know if someone speaks from a position of experience / authority or is this yet to be determined.
Social media went increasingly in the direction of automating moderator's work by extracting that information from the users... but this is definitely not the only (and probably not the best) way of approaching this problem. Moderators are just harder to make and are more expensive to keep.
I hold little hope that LLM's will help us to reason through "correctness." If these AI's scourge through the troves of idiocy on the internet believing what it will according to patterns and not applying critical reasoning skills, it too will pick up the band-wagon's opinions and perpetuate them. Ad Populum will continue to be a persistent fallacy if we humans don't learn appropriate reasoning skills.
1 reply →
AI isn’t going to do better in current paradigms, it has exactly the same flaw.
Of course, consensus is a difficult philosophical topic. But not every system is based on public voting.
I sure hope people don’t copy stuff from SO before they understand what the code does.
people are writing entire programs with ChatGPT. these are the same people that previously would copy&paste multiple SO answers cobbled together. now, it's just a copy&paste the entire script from a single response.
ROFLMAO!
Please, tell me that was sarcastic.
1 reply →
Yeah, I never look at just the top comment. If it isn’t wrong, it’s suboptimal.
> easy to read
Sounds like you're counting that as a negative. Obviously it depends on the use case, but more often than not I'll lean towards the easier to read code than the most optimal one.
Easy to read is good, but it doesn’t trump correct.
1 reply →
> The correct one is usually at rank 3
This has generally been my experience.
[dead]
Long time ago, when ActionScript was a thing, there was this one snippet in ActionScript documentation that illustrated how to deal with events dispatching, handling etc. In order to illustrate the concept the official documentation provided a code snippet that created a dummy object, attached handlers to it, and in those handlers defined some way of processing... I think it was XML loading and parsing, well, something very common.
The example implied that this object would be an instance of a class interested in handling events, but didn't want to blow up the size of this example with not so relevant bits of code.
There was a time when I very actively participated in various forums related to ActionScript. And, as you can imagine, loading of XML was paramount to success in that field. Invariably, I'd encounter code that copied the documentation example and had this useless dummy object with handlers defined (and subsequently struggled to extract information thus loaded).
It was simply amazing how regardless of the overall skill of the programmer or the purpose of the applet, the same exact useless object would appear in the same situation -- be it XML socket or XML loaded via HTTP, submitted and parsed by user... it was always there.
----
Today, I often encounter code like this in unit tests in various languages. Often programmers will copy some boilerplate code from example in the manual and will create hundreds or even thousands of unit tests all with some unnecessary code duplication / unnecessary objects. Not sure why in this specific area, but it looks like programmers both treat these kinds of test as some sort of magic but also unimportant, worthless code that doesn't need attention.
----
Finally, specifically on the subject of human-readable encoding of byte sizes. Do you guys like parted? Because it's so fun to work with it because of this very issue! You should try it, if you have some spare time and don't feel misanthropic enough for today.
I feel like there ought to be a software analogue to that aphorism about models (if it doesn’t exist already) — maybe something like:
All code is wrong, but some is useful.
Agreed, but is code not a model?
Why do you need a 4-line dependency?
This is the reason.
There is still the chance that the person that created the 4 line dependency also just copy pasted it from the flawed StackOverflow answer. Or is the same person or is also just a random person creating the package like the random person that created the SO answer. I'm not sure why random_person1 should be more trustworthy to produce non flawed code than random_person2.
OTO: It's at least easily upgrade able so it has an advantage.
> There is still the chance
There's no chance if you avoid random_person1 and use known_oss_provider’s package instead. At the very least, look at the tests.
Any package with tests is guaranteed to be more correct than a never-before-run SO answer.
3 replies →
The most impressive suggestion Copilot has given me was a solution to this that used a loop to divide and index further into an array of units..
It never dawned on me to approach it that way and I had never seen that solution(not that I ever looked). Not sure where it got that from but was pretty cool and.... Yeah, it gets simple stuff wrong all the time haha.
I was surprised to find log implementations are loopless. Cool.
https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...
It basically has the loop unrolled. But it looks like it’s evaluating a polynomial approximation so I suppose it makes sense
When StackOverflow was new, it was an incredible resource. Unfortunately, so much cruft has accumulated that it is now nearly useless. Even if an answer was once correct (and many are not), it is likely years out of date and no longer applicable.
While reading I was thinking why aren’t stackoverflow “mandating” that solutions have tests, so that this problem isn’t left to everyone else, ref. to the comment at the end of the article:
Test all edge cases, especially for code copied from Stack Overflow.
How does the author determine this is the "most copied snippet" on SO? The Question/Answer has only been Viewed 351k times. There are posts with many millions of views e.g: https://stackoverflow.com/questions/927358/how-do-i-undo-the... which have definitely been copy-pasted more times. Yes, there may be many instances of this Java function on GitHub. But only because the people doing the copying are too lazy to think about how it works never mind alter the function name. If there's a bug, just update the SO answer and fix the problem. No need to write a lengthy self-promoting post about it.
Third paragraph of the post:
It's according to this paper: https://link.springer.com/article/10.1007/s10664-018-9650-5
> How does the author determine this is the "most copied snippet" on SO?
According to [this paper](https://link.springer.com/article/10.1007/s10664-018-9650-5) it's the most copied *from SO java answers*.
It's mentioned in the article
> A PhD student by the name Sebastian Baltes publishes a paper in the journal of Empirical Software Engineering. The title is Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects [...] As part of their analysis they extracted code snippets from the Stack Overflow data dump and matched them against code from public GitHub repos.
It's described in the article...
Read the article. The methodology is flawed. It should say most copy-pasted Java function on GitHub.
3 replies →
Well - I suppose it makes sense. SO isn't built for correctness, it's built for upvotes that just depend on whether the people upvoting like the answer or not (regardless of correctness).
Read: The most common answer to that question from LLMs is flawed.
Sounds like someone bumped into Zeno's paradox...
https://www.youtube.com/watch?v=VI6UdOUg0kg
Should have just stuck with the loop. You could change the thresholds to 95% of 10^whatever to accommodate the desired output rounding.
Plot twist: they were hired by Oracle since they were the author of the most copied StackOverflow snippet (!)
just divide by 1000 until x < 1000 and return int(x) plus a map of number of times divided by 1,000 to MB, GB,... string.
Its a O(1) operation because of limited size allowed for numeric types
I'm curious what answer GPT will return.
Probably this one as this is the most common on the corpus it was used to train it.
GPT-3.5 returns:
So the code from the dude in the blog post here
1 reply →
Given how unreliable it is probably, 418 - I'm a teapot.
Classic off by 1 :)
tl;dr When in the 999+ petabyte range, it gives inappropriately rounded results.
And the key takeaway is "Stack Overflow snippets can be buggy, even if they have thousands of upvotes."
I don't disagree, but is this really the example to prove it.....
Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().
Why 300K+ people copied this in the first place shows some basic level of ignorance about what's happening under the hood.[1]
As someone who's been at this for decades now and knows my own failings better than ever, it also shows how developers can be too attracted by shiny things (ooh look, you can solve it with logs instead, how clever!) at the expense of readable, maintainable code.
[1] But hey, maybe that's why we were all on StackOverflow in the first place
> Processors are inherently awesome at branching, adding, adding, shifting, etc. And shifting to get powers of 2 (i.e., KB vs. GB) is a superpower of its own. They're a little less awesome when it comes to math.pow(), math.log(), and math.log() / math.log().
And here's something to consider -- if you're converting a number to human readable format it's more likely than not your about to do I/O with the resulting string, which is probably going to be an order of magnitude more expensive than the little function here.
Great point, I wish I'd mentioned it. The expense of the printf dwarfs the log / log (double divided by a double then cast to an int), which itself is greater than some repeated comparisons in a for loop.
It's key to be able to recognize this when thinking about performant code.
In other words, the entire exercise is silliness because the eventual printf is going to blow away any nanoseconds of savings by a smarter/shorter routine.
[flagged]
It's not that we think it's arcane or that we are in our own "bubbles of thought", it's that we aren't doing math. We're programming a computer. And a competent programmer would know, or at least suspect, that doing it with logarithms will be slower and more complicated for a computer. The author even points out that even he wouldn't use his solution.
P.S. Please look up the word literally.
2 replies →
> what are you people even programming that you need to know so absolutely little about how anything else in the entire world works
Feoren, your comment takes an incredibly superior attitude and accuses its reader, every reader, of being stupid.
When taking the log of a number, the value in general require an infinite number of digits to represent. Computing log(100) / log(10) should return 2.0 exactly, but since log(100) returns a fixed number of digits and log(10) returns a fixed number of digits, are you 100% confident that the ratio will be exactly 2.0?
Maybe you test it and it does return exactly 2.0 (to the degree floating point can be exactly any value). Are you confident that such a calculation will also work for any power of 10? Maybe they all work on this intel machine -- does it work on every Arm CPU? Every RISCV CPU? Etc. I wouldn't be, but if I wrote dumb "for" loop I'd be far more confident that I'd get the right result in every case.
1 reply →
> You're all literally writing CRUD React front-end javascript by copy-pasting "for" loops from StackOverflow?
To an approximation, yes.
The underlying calculations at my bank were probably written once in 1970 in COBOL and haven't changed meaningfully since. But the front-end UI to access it has gone from teletypes and punch cards to glass terminals to networked DOS to Win32 to ActiveX to Web 2.0 to React and mobile apps. Lots and lots of churn and work on the CRUD part, zero churn and work on the "need to remember logarithms" part.
AI? You have core teams building ChatGPT, Midjourney, etc. Then huge numbers of people accessing those via API, building CRUD sites to aggregate midjourney results and prompts, etc etc. Even Apple has made a drag-and-drop UI to train an AI object classifier, the ratio of people who had to know the math to make that vs the people using it is probably way above 1:100,000
Is this that surprising?
Well, maybe not exactly unmaintainable but I think most of us have learned that floating point operations are not to be trusted, especially if it needs to run on different processors. Furthermore, calling such math operations is an overkill most of the time. I would definitely never consider it for such a simple operation. I actually agree with you that it might look cleaner and easier to understand, but in my mind it would be such a heavy weight overkill I would never use it.
Obligatory, my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454
And yet it’s wrong like all the rest
How so?
1 reply →
Pretty awesome stuff. This is what Hacker News is for!
[flagged]
wtf, why would someone downvote this? This is prime hacker News shit and why I come here!
I didn't downvote, but I would guess it's due to the general idea that if you just approve or disapprove of a post you should simply vote that way instead of expressing it in a comment. Personally while I agree there's a logic to that, I find it a little cold for positive sentiments. I couldn't find it, but I think that's there's a PG or Dang comment to the effect that "I like this" as a comment is explicitly not discouraged on HN, but obviously that doesn't mean everyone agrees.