Comment by LegionMammal978

11 hours ago

One thing that I spent a particularly long time trying to get working was learning near-minimum-size exact languages from positive and negative samples. DFAMiner [0] has a relatively concise formulation for this in terms of variables and clauses, though I have no way to know if some other reformulation would be better suited for SAT solvers (it uses CaDiCaL by default).

It usually starts taking a few seconds around the ~150-variable mark, and hits the absolute limit of practicality by 350–800 variables; the number of clauses is only an order of magnitude higher. Perhaps something about the many dependencies in a DFA graph puts this problem near the worst case.

The annoying thing is, there do seem to be heuristics people have written for this stuff (e.g., in FlexFringe [1]), but they're all geared toward probabilistic automata for anomaly detection and similar fuzzy ML stuff, and I could never figure out how to get them to work for ordinary automata.

In any case, I eventually figured out that I could get a rough lower bound on the minimum solution size, by constructing a graph of indistinguishable strings, generating a bunch of random maximal independent sets, and taking the best of those. That gave me an easy way to filter out the totally hopeless instances, which turned out to be most of them.

[0] https://github.com/liyong31/DFAMiner

[1] https://github.com/tudelft-cda-lab/FlexFringe