Comment by dalke
7 months ago
Here is a problem I've been noodling with. If you are a decent programmer, how does your LLM help you solve this problem?
Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.
For example, the Klekota-Roth patterns defined in their supplemental data (and also available from CDK at https://github.com/cdk/cdk/blob/main/descriptor/fingerprint/...) contain patterns like:
"CC(=NNC=O)C",
"CC(=NNC=O)C(=O)O",
"CC(=NNC=O)C=C",
"CC(=NNC=O)C=Cc1ccccc1",
Clearly if 'CC(=NNC=O)C' does not exist in the molecule to fingerprint then there is no reason to test for the subsequent three patterns.
Similarly, there are patterns like:
"FC(F)(C=O)C1(F)OC(F)(F)C(F)(F)C1(F)F",
"FC(F)(F)C(F)(F)C(F)(F)OC(F)(C=O)C(F)(F)F",
"FC(F)(F)C(F)(F)C(F)(F)S",
which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.
So one stage might be to construct a list of element counts;
ele_counts = [0]*200
seen = set()
for atom in mol.GetAtoms():
ele_counts[eleno:=atom.GetAtomicNum()] += 1
seen.add(eleno)
then have a lookup table for each element, based on the patterns which have at least that count of the given element type;
ele_patterns = [
# max known count, list of set of matching patterns
(0, [set()]), # element 0
(0, [set()]), # hydrogen
..
(20, [{all patterns which contain no carbon},
{all patterns which require at most 1 carbon}, ...
{all patterns which require at most 19 carbons}],
(10, [{all patterns which contain no fluorine}, ..
{all patterns which contain at most 9 fluorines}],
...]
so one reduction can be
def get_possible_patterns(seen, ele_counts):
for eleno in seen:
max_count, match_list = ele_patterns[eleno]
count = min(ele_counts[eleno], max_count)
yield match_list[count]
patterns = set.intersect(*get_possible_patterns(seen, ele_counts))
and only test that subset of patterns.
However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.
And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.
No comments yet
Contribute on Hacker News ↗