Comment by deevus

21 days ago

This is fantastic. I haven't got any local inference as I can't afford it right now, but tool calling has been a concern for me with these smaller models through OpenRouter.

I've been working on a pytest-first acceptance testing framework called Dokimasia (do-kee-ma-see-ah) that I'd love to get your thoughts on: https://github.com/deevus/dokimasia

Acceptance testing might not be what you need for Forge, but since you're deep in AI tool building I thought you may have opinions.

Oh, interesting idea. Formalizing an abstraction layer for testing all the integration types out there in the AI ether, essentially? MCP, skills, etc.

I think this sits a level higher than Forge - maybe testing the workflow proper and integration points that it might surface (if some tools are giving access to an MCP or something).

Could likely layer both together without much trouble.

Only thing I'd be curious about is how you handle the non-deterministic nature of these models. Sometimes they get the tool call right, sometimes they barf bad json. Does the suite run multiple trials?