Comment by bilsbie
3 days ago
I just thought of a good test. Anyone have feedback?
We completely remove a couple simple, obvious inventions from the training data and then see if the AI can come up with it. Perhaps a toothbrush for example. Or a comb? But there could be better examples that would also have minimal effect on the final Ai.
Training is expensive so we wouldn’t want to leave anything important out like the wheel.
It’s very, very hard to remove things from the training data and be sure there is zero leakage.
Another idea would be to use, for example, a 2024 state of the art model to try to predict discoveries or events from 2025.
Ilya Sutskever suggested the same basic idea but for testing for consciousness.
I have no idea why this is a PDF, but here's a transcript: https://ecorner.stanford.edu/wp-content/uploads/sites/2/2023...
LLM companies try to optimize their benchmark results, not to test the capabilities of their systems. This is why all the benchmarks are so utterly useless.
Ok, you do it. Here’s the internet: https://internet Make sure you don’t miss any references while you’re combing through, though.
I see your point but off the top of my head: a simple regex on each document for a list of dental related words that then gets earmarked for a small LLM to determine if it includes a toothbrush concept.
I forgot to mention you’ll have to do this for every language and every possible phrasing. Good luck.
1 reply →