Comment by nsagent

2 years ago

If this does indeed beat all the closed source models, then I'm flabbergasted. The amount of time and resources Google, OpenAI, and Anthropic have put into improving the models to only be beaten in a couple weeks by two people (who as far as I know do not have PhDs and years of research experience) would be a pretty crazy feat.

That said, I'm withholding judgment on how likely the claims are. A friend who developed NoCha [1] is running the model on that benchmark, which will really stress test its ability to reason over full novels. I'll reserve judgement until then.

[1]: https://novelchallenge.github.io/

9 comments

nsagent

winddude 2 years ago

PhDs aren't relevant. It's more just a certificate that you can learn to learn and stay committed to hard and challenging things. It does give bonus points to VCs, because it's seems to be easier to market to other VCs, same applies for hedge funds.

And with fine tuning, there's zero math needed, it's a bit of common sense, and a lot's of data optimization.

phs318u 2 years ago
I wouldn't say that PhD's aren't relevant. Remember a lot of this subsequent "bumps, steps and leaps" advancement has come _after_ the initial work by the OpenAI's etc. "Standing on the shoulders of giants" is a thing.
- sabbaticaldev 2 years ago
  
  and these phds used some tools developed by teenager hackers. Standing on the shoulders of giants, indeed

moralestapia 2 years ago

>A friend who developed NoCha [1] is running the model on that benchmark [...]

Please do update us on the result.

nsagent 2 years ago

Not looking good. Apparently the model was broken when they released it yesterday. The version they uploaded 8hrs ago only has an 8k context length, so we can't test it on the novels.
Here's the updates to the model config on huggingface:
https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B/c...

yaj54 2 years ago

Anyone have or know of a list of LLM challenges like this? Targeted use cases with unpublished test data?

polotics 2 years ago

One question about the Novels challenge: as there are two true/false questions, a random pick of answer will give a 25% success rate right? How do some model manage to be below 25?

JustAndy 2 years ago

They know which answer is correct, they just don't want to say it.

m3kw9 2 years ago

Fine tuning needs $$$ and knowledge on how fine tuning works.