Comment by vidarh
5 hours ago
> Most of the tests are BS too.
Why are you creating BS tests?
> And nobody is talking about verifying if the AI bubble sort is correct or not - but recognizing that if the AI is implementing it’s own bubble sort, you’re waaaay out in left field.
Verifying time and space complexity is part of what your tests should cover.
But this is also a funny example - I'm willing to bet the average AI model today can write a far better sort than the vast majority of software developers, and is far more capable of analyzing time and space complexity than the average developer.
In fact, I just did a quick test with Claude, and asked for a simple sort that took into account time and space complexity, and "of course" it knows that it's well established that pure quicksort is suboptimal for a general-purpose sort, and gave me a simple hybrid sort based on insertion sort for small arrays, heapsort fallback to stop pathological recursion, and a decently optimized quicksort - this won't beat e.g. timsort on typical data, but it's a good tradeoff between "simple" (quicksort can be written in 2-20 lines of code or so depending on language and how much performance you're willing to sacrifice for simplicity) and addressing the time/space complexity constraints. It's also close to a variant that incidentally was covered in an article in DDJ ca. 30 years ago because most developers didn't know how to, and were still writing stupidly bad sorts manually instead of relying on an optimized library. Fewer developers knows how to write good sorts today. And that's not bad - it's a result of not needing to think at that level of abstraction most of the time any more.
And this is also a great illustration of the problem: Even great developers often have big blind spots, where AI will draw onresults they aren't even aware of. Truly great developers will be aware of their blind spots and know when to research, but most developers are not great.
But a human developer, even a not so great one, might know something about the characteristics of the actual data a particular program is expected to encounter that is more efficient than this AI-coded hybrid sort for this particular application. This is assuming the AI can't deduce the characteristics of the expected data from the specs, even if a particular time and space complexity is mandated.
I encountered something like this recently. I had to replace an exact data comparison operation (using a simple memcmp) with a function that would compare data and allow differences within a specified tolerance. The AI generated beautiful code using chunking and all kinds of bit twiddling that I don't understand.
But what it couldn't know was that most of the time the two data ranges would match exactly, thus taking the slowest path through the comparison by comparing every chunk in the two ranges. I had to stick a memcmp early in the function to exit early for the most common case, because it only occurred to me during profiling that most of the time the data doesn't change. There was no way I could have figured this out early enough to put it in a spec for an AI.
> But a human developer, even a not so great one, might know something about the characteristics of the actual data a particular program is expected to encounter that is more efficient than this AI-coded hybrid sort for this particular application.
Sure. But then that belongs in a test case that 1) documents the assumptions, 2) demonstrates if a specialised solution actually improves on the naive implementation, and 3) will catch regressions if/when those assumptions no longer holds.
In my experience in that specific field is that odds are the human are likely making incorrect assumptions, very occasionally are not, and having a proper test harness to benchmark this is essential to validate the assumptions whether or not the human or an AI does the implementation (and not least in case the characteristics of the data end up changing over time)
>There was no way I could have figured this out early enough to put it in a spec for an AI.
This is an odd statement to me. You act like the AI can only write the application once and can never look at any other data to improve the application again.
>only occurred to me during profiling
At least to me this seems like something that is at far more risk of being automated then general application design in the first place.
Have the AI design the app. Pass it off to CI/CD testing and compile it. Send to a profiling step. AI profile analysis. Hot point identification. Return to AI to reiterate. Repeat.
> At least to me this seems like something that is at far more risk of being automated then general application design in the first place.
This function is a small part of a larger application with research components that are not AI-solvable at the moment. Of course a standalone function could have been optimised with AI profiling, but that's not the context here.