← Back to context

Comment by thorum

14 days ago

This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

  • not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

  • You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.