Comment by se4u

1 month ago

Building VizPy, a prompt optimizer we've been working on for a while now.

The problem it's solving is one most people building with LLMs know well. Your prompt fails on some inputs, you don't really know why, and you end up just tweaking and re-running until something sticks. We kept hitting this ourselves and it felt like there had to be a better way than guessing.

What we figured out after a lot of research: prompt failures almost always follow a pattern. The model isn't failing randomly, it's consistently failing on a particular type of input or reasoning step. VizPy finds that pattern, distills it into a plain English rule you can actually read, and then rewrites your prompt around it. You also get the rule itself so you can review it, tweak it, or just drop it into your existing prompt directly. DSPy-compatible, no pipeline rewrite needed.

We have compared it extensively against baselines such as GEPA on benchmarks like BBH, HotPotQA, GPQA Diamond, and GDPR-Bench and VizPy wins on all of them. We'll have more benchmarks on cyber-security and chip-design coming out soon.