Comment by abdullin

1 day ago

I’m working on a platform to run a friendly competition in “who builds the best reasoning AI Agent”.

Each participating team (got 300 signups so far) will get a set of text tasks and a set of simulated APIs to solve them.

For instance the task (a typical chatbot task) could say something like: “Schedule 30m knowledge exchange next week between the most experienced Python expert in the company and 3-5 people that are most interested in learning it “

AI agent will have to solve through this by using a set of simulated APIs and playing a bit of calendar Tetris (in this case - Calendar API, Email API, SkillWill API).

Since API instances are simulated and isolated (per team per task), it becomes fairly easy to automatically check correctness of each solution and rank different agents in a global leaderboard.

Code of agents stays external, but participants fill and submit brief questionnaires about their architectures.

By benchmarking different agentic implementations on the same tasks - we get to see patterns in performance, accuracy and costs of various architectures.

Codebase of the platform is written mostly in golang (to support thousands of concurrent simulations). I’m using coding agents (Claude Code and Codex) for exploration and easy coding tasks, but the core has still to be handcrafted.

2 comments

abdullin

hattmall 1 day ago

Ooooh, neat, I had a similar idea, like an AI olympics that could be live streamed where they have to do several multi-stepped tasks

abdullin 17 hours ago

Yep, exactly the same concept. Except not live-streaming, but giving out a lot of multi-step tasks that require reasoning and adaptation.
Here is a screenshot of a test task: https://www.linkedin.com/posts/abdullin_ddd-ai-sgr-here-is-h...
Although… since I record all interactions, could replay all them as if they were streamed.