Comment by thorum

5 months ago

This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

3 comments

thorum

singularity2001 5 months ago

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

barrenko 5 months ago

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.
hodapp 5 months ago

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.