Comment by behnamoh

1 month ago

I couldn't care less, tbh. This speed is ridiculously high, to the point where tool calls, not inference, become the bottleneck.

Also, I'd rather run a large model at slower speeds than a smaller at insanely high speeds.

5 comments

behnamoh

spiderfarmer 1 month ago

You care enough to comment, so you could in fact have cared even less.

Also, the entire industry profits from the work that’s done at the bleeding edge. That’s the case in every industry.

menaerus 1 month ago

Well the thing is that the trajectory of people utilizing the models is only increasing so getting the most out of your HW becomes a particularly interesting optimization point for companies doing the inference at massive scale.

est 1 month ago

Have you considered parrallel processing? I always have 2-3 Cursor IDE open because I don't like wait either.

bob1029 1 month ago

Parallel tool calls do not work for my scenario. I can't ask a copy of my agent a question about something until a dependent call has resolved.
Tool use that changes the mode of the environment is a good example where you cannot go parallel. I've built a recursive agent that can run a Unity editor and I can't just blindly run whatever it wants in parallel or combos like SwitchScene -> GetSceneOverview won't interleave correctly. You'll wind up with 15 calls that loop over every scene and then you grab the overview from the last scene you switched to 15 times.
There are ways to hack around it a bit, but at some level the underlying narrative does need to be serialized or you'll be wasting an incredible amount of resources.
Depth-first search doesn't guarantee the best solution, but on average it's guaranteed to find a solution faster than breadth-first search. It's worth waiting for those dependent calls and going super deep if you want some reasonable answer quickly.