← Back to context

Comment by dalke

7 months ago

Here is a problem I've been noodling with. If you are a decent programmer, how does your LLM help you solve this problem?

Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.

For example, the Klekota-Roth patterns defined in their supplemental data (and also available from CDK at https://github.com/cdk/cdk/blob/main/descriptor/fingerprint/...) contain patterns like:

    "CC(=NNC=O)C",
    "CC(=NNC=O)C(=O)O",
    "CC(=NNC=O)C=C",
    "CC(=NNC=O)C=Cc1ccccc1",

Clearly if 'CC(=NNC=O)C' does not exist in the molecule to fingerprint then there is no reason to test for the subsequent three patterns.

Similarly, there are patterns like:

    "FC(F)(C=O)C1(F)OC(F)(F)C(F)(F)C1(F)F",
    "FC(F)(F)C(F)(F)C(F)(F)OC(F)(C=O)C(F)(F)F",
    "FC(F)(F)C(F)(F)C(F)(F)S",

which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.

So one stage might be to construct a list of element counts;

   ele_counts = [0]*200
   seen = set()
   for atom in mol.GetAtoms():
      ele_counts[eleno:=atom.GetAtomicNum()] += 1
      seen.add(eleno)

then have a lookup table for each element, based on the patterns which have at least that count of the given element type;

   ele_patterns = [
     # max known count, list of set of matching patterns
     (0, [set()]), # element 0
     (0, [set()]), # hydrogen
     ..
     (20, [{all patterns which contain no carbon},
           {all patterns which require at most 1 carbon}, ...
           {all patterns which require at most 19 carbons}],
     (10, [{all patterns which contain no fluorine}, ..
           {all patterns which contain at most 9 fluorines}], 
      ...]

so one reduction can be

   def get_possible_patterns(seen, ele_counts):
     for eleno in seen:
        max_count, match_list = ele_patterns[eleno]
        count = min(ele_counts[eleno], max_count)
        yield match_list[count]
   patterns = set.intersect(*get_possible_patterns(seen, ele_counts))

and only test that subset of patterns.

However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.

And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.