Comment by kypro
1 day ago
In a few years hopefully the AI reviewers will be far more reliable than even the best human experts. This is generally how competency progresses in AI...
For example, at one point a human + computer would have been the strongest combo in chess, now you'd be insane to allow a human to critic a chess bot because they're so unlikely to add value, and statistically a human in the loop would be far more likely to introduce error. Similar things can be said in fields like machine vision, etc.
Software is about to become much higher quality and be written at much, much lower cost.
My prediction is that for that to happen we’ll need to figure out a way to measure software quality in the way we can measure a chess game, so that we can use synthetic data to continue improving the models.
I don’t think we are anywhere close to doing that.
Agreed, chess is for all intents and purposes a ‘solved’ game, it’s merely a matter of processing power, and even then we have shortcuts that are easily ‘good enough’ to play perfectly against human opponents.
But how do you reduce the requirements for software to something so simple and elegant as chess rules? Is it foolish to assume that if we could have, we already would have? Even for humans, the process of writing software includes a lot of guess-and-check most of the time - the idea that you could sit down and think through every aspect of software, then describe it immaculately, then translate that description to a working solution with no bugs or review or need for course correction is just… it’s a pipe dream.
Not really... If you're an average company you're not concerned about producing perfect software, but optimising for some balance between cost and quality. At some point companies via capitalist forces will naturally realise that it's more productive to not have humans in the loop.
A good analogy might be how machines gradually replaced textile workers in the 19th century. Were the machines better? Or was there a was to quantitatively measure the quality of their output? No. But at the end of the day companies which embraced the technology were more productive than those who didn't, and the quality didn't decrease enough (if it did at all) that customers would no longer do business with them – so these companies won out.
The same will naturally happen in software over the next few years. You'd be an moron to hire a human expert for $200,000 to critic a cybersecurity optimised model which costs maybe a 100th of the cost of employing a human... And this would likely be true even if we assume the human will catch the odd thing the model wouldn't because there's no such thing as perfect security – it's always a trade off between cost and acceptable risk.
Bookmark this and come back in a few years. I made similar predictions when ChatGPT first came out that within a few years agents would be picking up tickets and raising PRs. Everyone said LLMs were just stochastic parrots and this would not happen, well now it has and increasingly companies are writing more and more code with AI. At my company it's a little over 50% at the mo, but this is increasing every month.
Almost none of what you said about the past is true. Automated looms, and all of the other automated machinery that replaced artisans over the course of the industrial revolution produced items of much better quality than what human craftsman could produce by the time it started to be used commercially because of precision and repeatability. They did have quantitative measurements of quality for textiles and other goods and the automated processes exceeded human craftsman at those metrics.
Software is also not remotely similar to textiles. A subtle bug in the textile output itself won’t cause potentially millions of dollars in damages, they way a bug in an automated loom itself or software can.
No current technology is anywhere close to being able to automate 50% of PRs on any non trivial application (that’s not close to the same as saying that 50% of PRs merged at your startup happens to have an agent as author). To assume that current models will be able to get near 100% without massive model improvements is just that—an assumption.
My point about synthetic data is that we need orders of magnitude more data with current technology and the only way we will get there is with synthetic data. Which is much much harder to do with software applications than with chess games.
The point isn’t that we need a quantitative measure of software in order for AI to be useful, but that we need a quantitative measure in order for synthetic data to be useful to give us our orders of magnitude more training data.