Comment by srean

3 days ago

In the realm of data science, Linear models and SAT solvers used cleverly will get you a surprisingly long way.

I thought the OCR was one of the obvious examples where we have a classical technology that is already working very well but in the long-run I don't see it surviving. _Generic_ AI models already can do the OCR kinda good but they are not even trained for that purpose, it's almost incidental - they've never been trained to extract the, let's say name/surname from some sort of a document with a completely unfamiliar structure, but the crazy thing is that it does work somehow! I think that once somebody finetunes the AI model only for this purpose I think there's a good chance it will outperform classical approach in terms of precision and scalability.

  • Yes, that's correct about OCR. I work as a machine vision engineer in the semiconductor industry, where each wafer usually has both OCR text and machine-readable codes such as barcodes, QR codes, or data matrix codes. The OCR typically uses the SEMI font standard.

    To achieve accurate OCR results, I need to preprocess the image by isolating each character, sorting them from left to right, and using regular expressions (regex) to verify the output. However, I prefer machine-readable codes because they are simpler to use, feature built-in error detection, and are much more reliable. While deep-learning OCR solutions often perform well, they cannot guarantee the 100 percent accuracy required in our applications.

    This approach is similar to how e-wallet payments use cameras to scan QR codes instead of OCR text, as QR codes provide greater reliability and accuracy.

  • In general I agree. For OCR I agree vehemently. Part of the reason is the structure of the solution (convolutions) match the space so well.

    The failure cases are those where AI solutions have to stay in a continuous debug, train, update mode. Then you have to think about the resources you need, both in terms of people as well as compute to maintain such a solution.

    Because of the way the world works, it's endemic nonstationarity, the debug-retrain-update is a common state of affairs even in traditional stats and ML.

    • I see. Let's take another example here, I hope I understood you - imagine you have an AI model which is connected to all of your company's in-house data generation sources such as wiki, chat, jira, emails, merge requests, excel sheets, etc. Basically everything that can be deemed useful to query or to create business inteligence on top of. These data sources are continously generating more and more data every day, and given their nature they are more or less unstructured.

      Yet, we have such systems in place where we don't have to retrain the model with ever-growing data. This is one example I could think of but it kinda suggests that models, at least for some purposes, don't have to be retrained continuously to keep them running well.

      I also use a technique of explaining something to the AI model he has not seen before (according to the wrong answer I got from it previously), and it manages to evolve the steps, whatever they are, so that it gives me the correct answer in the end. This also suggests that capacity of the models is larger than what they have been trained on.

      4 replies →

I've seen a lot of uses for SAT solvers, but what do you use them for in data science? I can't find many references to people using them in that context.

  • Root causing from symptoms is one case where SAT or their ML analogue -- graphical models are quite useful.