Comment by nisten
17 hours ago
While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.
Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?
It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.
Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.
So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?
No comments yet
Contribute on Hacker News ↗