Show HN: I benchmarked our AI tool from 30% to 100% success

6 hours ago (plotly.com)

Hi HN, I'm a Senior SDET at Plotly. My company just launched Plotly Studio, a new tool that uses AI to build data visualizations and analytics apps. My job was to answer the big question: does it actually work with real, messy data? When I first started testing it against our collection of 100+ diverse datasets, our success rate was around 30%. The problem I faced was that you can't just unit-test an AI that generates code for a desktop app. You have to test the full, end-to-end user experience. So, I led the effort to build our own internal benchmark system to validate performance at scale. Every day, our CI (GitHub Actions) kicks off a job that: Generates a full data app from each of our 100+ test datasets Launches each app in a real browser using Playwright Asserts that the app loads without any Python or JavaScript errors Takes screenshots to verify the visual output Runs each test 3 times to detect "flakiness" (inconsistent results) This gave me and the rest of the team a clear, actionable metric. The dev team used the failure reports to improve the backend, and we just hit a 100% success rate on our latest test run. I wrote an article about the architecture of this benchmarking system. We're now expanding it with user-donated datasets to make it even more challenging. I'd love to hear your feedback. You can read my full technical write-up here: Link: https://plotly.com/blog/chasing-nines-on-ai-reliability-benc...