← Back to context

Comment by SpaceManNabs

7 days ago

not affiliated with them and i might be a little out of date but here are my guesses

1. prompt caching

2. some RAG to save resources

3. of course lots model optimizations and CUDA optimizations

4. lots of throttling

5. offloading parts of the answer that are better served by other approaches (if asked to add numbers, do a system call to a calculator instead of using LLM)

6. a lot of sharding

One thing you should ask is: What does it mean to handle a request with chatgpt? It might not be what you think it is.

source: random workshops over the past year.