← Back to context Comment by anticensor 6 days ago -pro models appear to be a best-of-10 sampling of the original full size model 5 comments anticensor Reply Szpadel 6 days ago how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timebut it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them anticensor 6 days ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent joshstrange 6 days ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 5 days ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer. spott 6 days ago I believe it is a majority vote kinda thing, rather than a best single result.
Szpadel 6 days ago how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timebut it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them anticensor 6 days ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent joshstrange 6 days ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 5 days ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer. spott 6 days ago I believe it is a majority vote kinda thing, rather than a best single result.
anticensor 6 days ago > if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every timeremember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
joshstrange 6 days ago I think the idea is they use another/same model to judge all the results and only return the best one to the user. anticensor 5 days ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
anticensor 5 days ago I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
how do you sample it behind the scenes? usually best of X means you generate X outputs and you choose best result.
if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
but it's not practical because you are the bottleneck as you have to read all 5 solutions and compare them
> if you could do this automatically, it would be game changer as you could run top 5 best models in parallel and select best answer every time
remember they have access to the RLHF reward model, against which they can evaluate all N outputs and have the most "rewarded" answer picked and sent
I think the idea is they use another/same model to judge all the results and only return the best one to the user.
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
I believe it is a majority vote kinda thing, rather than a best single result.