← Back to context Comment by anticensor 6 months ago -pro models appear to be a best-of-10 sampling of the original full size model 5 comments anticensor Reply Szpadel 6 months ago how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timebut it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them anticensor 6 months ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent joshstrange 6 months ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 6 months ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer. spott 6 months ago I believe it is a majority vote kinda thing, rather than a best single result.
Szpadel 6 months ago how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timebut it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them anticensor 6 months ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent joshstrange 6 months ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 6 months ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer. spott 6 months ago I believe it is a majority vote kinda thing, rather than a best single result.
anticensor 6 months ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
joshstrange 6 months ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 6 months ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
anticensor 6 months ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
> if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
I think the idea is they use another/same model to judge all the results and only return the best one to the user.
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
I believe it is a majority vote kinda thing, rather than a best single result.