Experiments, findings, and things we learned while building Promptzilla 3000.
We ran a 4-agent data analysis system through adversarial testing across three iterations. One failure mode scored 42/100. Here's what we found — and what almost shipped.