Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

2 days ago (github.com)

I've been building ZSE (Z Server Engine) for the past few weeks — an open-source LLM inference engine focused on two things nobody has fully solved together: memory efficiency and fast cold starts.

The problem I was trying to solve: Running a 32B model normally requires ~64 GB VRAM. Most developers don't have that. And even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — which kills serverless and autoscaling use cases.

What ZSE does differently:

Fits 32B in 19.3 GB VRAM (70% reduction vs FP16) — runs on a single A100-40GB

Fits 7B in 5.2 GB VRAM (63% reduction) — runs on consumer GPUs

Native .zse pre-quantized format with memory-mapped weights: 3.9s cold start for 7B, 21.4s for 32B — vs 45s and 120s with bitsandbytes, ~30s for vLLM

All benchmarks verified on Modal A100-80GB (Feb 2026)

It ships with:

OpenAI-compatible API server (drop-in replacement)

Interactive CLI (zse serve, zse chat, zse convert, zse hardware)

Web dashboard with real-time GPU monitoring

Continuous batching (3.45× throughput)

GGUF support via llama.cpp

CPU fallback — works without a GPU

Rate limiting, audit logging, API key auth

Install:

----- pip install zllm-zse zse serve Qwen/Qwen2.5-7B-Instruct For fast cold starts (one-time conversion):

----- zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse zse serve qwen-7b.zse # 3.9s every time

The cold start improvement comes from the .zse format storing pre-quantized weights as memory-mapped safetensors — no quantization step at load time, no weight conversion, just mmap + GPU transfer. On NVMe SSDs this gets under 4 seconds for 7B. On spinning HDDs it'll be slower.

All code is real — no mock implementations. Built at Zyora Labs. Apache 2.0.

Happy to answer questions about the quantization approach, the .zse format design, or the memory efficiency techniques.

8 comments

zyoralabs

7777777phil 2 days ago

32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.

I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.

(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...

reconnecting 2 days ago

Discussion on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1rewis9/removed...

7777777phil 2 days ago
Sorry, this post has been removed by the moderators of r/LocalLLaMA.
Classic reddit..
- NitpickLawyer 2 days ago
  
  > Classic reddit..
  That sub used to be the absolute best place to get the latest in LLM developments. The worst thing that happened to the sub was karpathy making it popular with a tweet. Since then it's been overrun by a whole bunch of drama, toxic behaviour and useless bots, and the quality content has cratered.
  There was a mod crisis and new mods came in, with really weird stuff (integrations with discord and such), lots of bots became active with useless posts and "engagement" bait, the chinese labs are all fighting eachother on who's better every time there's a release, claude-induced-manias on "papers" this and "zenodo" that (everyone is a researcher now, everyone is inventing a subquadratic attention, led by claude hallucinated stuff), they have an obsession with "local only", leading to removing any discussion about SotA (which is entirely counter productive) and so on.

cipher-108 2 days ago

This seems excellent if not revolutionary, just what I've been looking for, but GPU support didn't work on my M1 and M1 Max. Is there a way to support Apple M series processors? That would be greatly appreciated. I don't have experience about this kind of programming and didn't get very far with ChatGPT.

On M1 Max, it says 14.8GB free / 32.0 GB total, but " No GPU detected" and "What Can You Run? (ZSE Ultra Mode)" only says "7B GPU + CPU Hybrid", nothing else.

HanClinto 1 day ago

If you don't mind a stupid question, is this essentially dynamic quantization? I'm trying to understand how this is different from using a regular quantized model to squeeze more parameters into less RAM.

medi_naseri 2 days ago

This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.

Will try getting this deployed.

Does cold start timings advertised for a condition where there is no other model loaded on GPUs?

mzl 2 days ago

Are you using the Model GPU memory snapshotting for this?