← Back to context

Comment by russdill

14 hours ago

I've experimented with having different sized LLMs cooperating. The smaller LLM starts a response while the larger LLM is starting. It's fed the initial response so it can continue it.

The idea of having an LLM follow and continuously predict the speaker. It would allow a response to be continually generated. If the prediction is correct, the response can be started with zero latency.

Google seems to be experimenting with this with their AI Mode. They used to be more likely to send 10 blue links in response to complex queries, but now they may instead start you off with slop.

(Meanwhile at OpenAI: testing out the free ChatGPT, it feels like they prompted GPT 3.5 to write at length based on the last one or maybe two prompts)

  • This is more of a "Are all the windows closed upstairs?"

    "The windows upstairs..."

    "...are all closed except for the bedroom window"

    The first portion of the response requires a couple of seconds to play but only a few tens of milliseconds to start streaming using a small model. Currently I just break the small model's response off at whatever point will produce about enough time to spin up the larger model.

    But all responses spin up both models.

    • Whoa, that thing's fast. Very nice! Bet that's fun to play with, least probably fun the first time you saw it working :)