Comment by vunderba

1 month ago

Thanks for the feedback.

> The bigger models are way more capable than you make them out to be.

No test suite is ever going to be perfect. GenAI Showdown was started with the goal of focusing on a very narrow spectrum of testing (prompt adherence) because as a creator that's the one of the most interest to me.

> Pure t2i is not a good benchmark anymore

Just FYI Image Editing is already a separate benchmark (see the navbar at the top).

> Your testing suite is the perfect example that structured data implies false confidence

Again - the headline is "Specific prompts and challenges with a strong emphasis placed on adherence". If I tried to capture every possible aspect of GenAI models (multimodal, texture maps, periodic motion, tiling, etc) - I'd be at it until the heat death of the universe.

Incidentally - which model (specifically) do you think is ranked unfairly? While Flux.2 [dev] did only score a single point above ZiT, it's weighted score is much higher (1442 points vs 911 points).

0 comments

vunderba

No comments yet

Contribute on Hacker News ↗