Comment by reissbaker
1 month ago
Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.
No comments yet
Contribute on Hacker News ↗