Comment by jedbrooke

13 hours ago

I’d been thinking about if something like this would be possible for https://chatjimmy.ai/ . The underlying model is only llama 3 8B but I’m curious what coding harnesses would be like at 17k tok/s

11 comments

jedbrooke

tomashubelbauer 11 hours ago

If you're on macOS you can try the built in LLM which I think is similar in size. There's a project called Apfel that wraps it in a CLI. Also Chrome ships with a web API called Prompt API that gives you offline access to Gemini Nano which can do both text and images at the input. Also tiny. I've integrated these into my workflows where a tiny but non zero amount of reasoning is needed in between the otherwise fully deterministic steps.

jedbrooke 2 hours ago

looks like the macOS one is Tahoe only. I’ve been putting of upgrading to tahoe but this might be enough to tempt me
stogot 4 hours ago
What kind of reasoning makes this worthwhile?
- tomashubelbauer 3 hours ago
  
  I have a personal, fully offline and local version of Windows Recall basically, but good, made using macOS built-in OCR and LLM. The reasoning requirements are tiny (just interpret the screen based on the OCR, do rolling de-duplication and summarization), but they are non-zero. The tool is valuable to me and it being dep-free and fully offline and local just gives me a good feeling.
  
  2 replies →

golph 10 hours ago

I actually tried building a harness around their constraints, just to find out if it was possible, but the combination of small context window, no tool calls and just small model, made me understand, that it’s not going to work.

If you find a way to do it, I’d love to hear it!

haellsigh 6 hours ago

I added it in my oh-my-pi configuration before (it's OpenAI compatible), but Llama 3 8B is just absolutely unusable for anything coding related. It is very fast and the latency is very good however.

venusenvy47 3 hours ago

I tried the site and can't find any information about what it is. What is it?

npilk 3 hours ago

They make custom chips with a model's weights and parameters "hard-coded" which allows for much, much faster inference.

rbinv 7 hours ago

Codex offers a -spark model that runs on Cerebras. Not quite 17k tok/s, but _very_ fast nonetheless. Worth a look.