Comment by simianwords
13 hours ago
I remember when I tried to set something up with the ChatGPT equivalent like "notify me only if there are traffic disruptions in my route every morning at 8am" and it would notify me every morning even if there was no disruption.
This is because for some reason all agentic systems think that slapping cron on it is enough, but that completely ignores decades of knowledge about prospective memory. Take a look at https://theredbeard.io/blog/the-missing-memory-type/ for a write-up on exactly that.
“A programmer is going to the store and his wife tells him to buy a gallon of milk, and if there are eggs, buy a dozen. So the programmer goes shopping, does as she says, and returns home to show his wife what he bought. But she gets angry and asks, ‘Why’d you buy 13 gallons of milk?’ The programmer replies, ‘There were eggs!’”
You need to write a clearer prompt.
"I need to fly to NY next weekend, make the necessary arrangement".
Your AI assistant orders an experimental jetpack from a random startup lab. Would you have honestly guessed that the prompt was "ambiguous" before you knew how the AI was going to act on it ?
Did GP edit their comment? Or did you read the prompt they used somewhere else?
Why not set your own evals and something like pi-mono for that? https://github.com/badlogic/pi-mono/
You'll define exactly what good looks like.
Me too. It doesn't have ability to alert only on true positive. I has to also alert on true negative. So dumb
This doesn't seem to hard to solve except for the ever so recurring llm output validation problem. If the true positive is rare you don't know if the earthquake alert system works until there's an earthquake.
... just force the data into a structured format, then use "hard code" on the structure.
"Generate the following JSON formatted object array representing the interruptions in my daily traffic. If no results, emit []. Send this at 8am every morning. {some schema}. Then run jsonreporter.py"
Then just let jsonreporter.py discriminate however it likes. Keep the LLMs doing what they are good at, and keep hard code doing what it's good at.