← Back to context

Comment by crazylogger

7 days ago

I think next year's AI benchmarks are going to be like this project: https://www.anthropic.com/research/project-vend-1

Give the AI tools and let it do real stuff in the world:

"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.

Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.

This sounds like a terrible idea to me, you're training intelligent computer to aim for power. It's fine as long as they're bad but if they get good then we have a problem