Comment by hathawsh
2 months ago
Are you sure? The third section of each review lists the “Most prescient” and “Most wrong” comments. That sounds exactly like what you're looking for. For example, on the "Kickstarter is Debt" article, here is the LLM's analysis of the most prescient comment. The analysis seems accurate and helpful to me.
https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...
phire
> “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”
With hindsight:
Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.
That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.
Until someone publishes a systematic quality assessment, we're grasping at anecdotes.
It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.
Yes. And the article is a perfect example of the dangerous sort of automation bias that people will increasingly slide into when it comes to LLMs. I realize Karpathy is sort of incentivized toward this bias given his career, but he doesn't even spend a single sentence even so much as suggesting that the results would need further inspection, or that they might be inaccurate.
The LLM is consulted like a perfect oracle, flawless in its ability to perform a task, and it's left at that. Its results are presented totally uncritically.
For this project, of course, the stakes are nil. But how long until this unfounded trust in LLMs works its way into high stakes problems? The reign of deterministic machines for the past few centuries has ingrained a trust in the reliability of machines in us that should be suspended when dealing with an inherently stochastic device like an LLM.
I get what you're saying, but looking at some examples, they look kinda of right, but there are a lot of misleading facts sprinkled, making his grading wrong. It is useful, but I'd suggest to be careful to use this to make decisions.
Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.
Some examples:
sebastiank123 got a C-, and was quoted by the LLM as saying:
Now, let's read his full comment:
I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context.
Full quote:
"Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be.
That's not what the user was saying:
He was praising him and he did miss opportunities at first. OC did not make predictions of his later days.
Full quote:
Full quote:
I thought the debate was useful and so did pjbrunet, per his update.
I mean, we could go on, there are many others like these.