← Back to context

Comment by dragonwriter

2 years ago

Measuring LLM quality is problematic (and may not even be meaningful in a general sense, the idea that there is a measurable strict ordering of general quality that is applicable to all use cases, or even strongly predictive of utiity for particular uses, may be erroneous.)

If you trust Winogrande scores (one of the few where I could find GPT3.5 and GPT4 [0] ratings that is also on the HuggingFace leaderboard [1]), there are a lot of models between GPT3.5 and GPT4 with some of them being 34B parameter models (Yi-34b and its derivatives), and una_cybertron_7b comes close to GPT3.5.

[0] https://llm-leaderboard.streamlit.app/

[1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...