Comment by thorum
14 days ago
This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.
I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning
not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.
You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.