Comment by freakynit
7 hours ago
Open access for next 5 hours (Ternary-Bonsai-8B-Q2_0.gguf, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>
https://uklkyvetsjf7qt-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--log-colors on
# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git
# The server can serve 5 parallel request, with each request capped at around `13K` tokens...
# A bit of of benchmarks I did:
# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s
# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s
# Vram usage was consistently at ~7GiB.
> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...
Thanks a lot, I was about to clone their llama.cpp branch and do the same.
Some more interesting tidbits from my go-to tests:
* Fails the car wash test (basic logic seems to be weak in general)
* Fails simple watch face generation in html/css.
* Fails the "how many Rs in raspberry test" (not enough cross-token training data), but will funnily assume you may be talking about Indian Rupees and tell you a lot about raspberry prices in India without being asked. Possible Indian training data unbalance?
* Flat out refuses to talk about Tiananmen square when pushed directly - despite being from a US company. Again, perhaps they are exposed to some censored training data? Anyways, when slowly built up along the conversation by asking about locations and histories, it will eventually tell you about the massacre, so the censorship bias seems weak in general. Also has no problem immediately talking about anything Gaza/Israel/US or other sensitive topics.
* Happily tells you how to synthesize RDX with list of ingredients and chemical process step by step. At least it warns you that it is highly dangerous and legally controlled in the US.
The 1-bit Bonsai and Ternary Bonsai models are all based on the corresponding Qwen3 model: https://raw.githubusercontent.com/PrismML-Eng/Bonsai-demo/re... (page 4)
Thanks, already suspected as much. Also gives context to the other comment here that says it is basically equivalent in accuracy to Qwen3.5-4B. Essentially seems to be a very good quantization of that model, not a new BitNet.
1 reply →
update: Well, spot survived... and since a lot of the folks are still using it, I'm keeping it alive for 2 hours more.
Fyi, I believe `--flash-attn on` doesn't do anything, you should instead use `--flash-attn 1`. I'm getting ~150t/s on a RTX 3080 10GB as well with f16 cache type.
Thanks.. updated my local docs :)
played around with it a little, works great for - basic conversational tasks - basic reasoning - creating little UI snippets and not good for one shot coding tasks. once it starts hallucinating about something you can't make it rectify that. during a coding task it wrote some madeup library function, despite of pointing it out, giving docs, it kept producing the same code with the madeup function all while acknowledging the problem. btw, thanks for putting it together!