Comment by orph

1 year ago

Everyone is a decent programmer now who can solve nearly any problem with help from LLM.

3 comments

orph

Everyone? Most people are incapable of expressing a problem in reasonably clear terms. They often don't even know the right questions to ask.

LLMs are pretty good at giving you what you ask for. Not so good at telling you that you're asking for the wrong thing.

drewcoo 1 year ago

> LLMs are pretty good at giving you what you ask for. Not so good at telling you that you're asking for the wrong thing.
So they're comparable to rubber ducks. I would like to see data from a comparative study with rubber ducks, LLMs, and a control group.

dalke 1 year ago

Here is a problem I've been noodling with. If you are a decent programmer, how does your LLM help you solve this problem?

Given a cheminformatics fingerprint definition based on SMARTS substructure patterns, come up with a screening filter, likely using a decision tree, which uses intermediate feature tests to prune search space faster than simply testing each pattern one-by-one.

For example, the Klekota-Roth patterns defined in their supplemental data (and also available from CDK at https://github.com/cdk/cdk/blob/main/descriptor/fingerprint/...) contain patterns like:

    "CC(=NNC=O)C",
    "CC(=NNC=O)C(=O)O",
    "CC(=NNC=O)C=C",
    "CC(=NNC=O)C=Cc1ccccc1",

Clearly if 'CC(=NNC=O)C' does not exist in the molecule to fingerprint then there is no reason to test for the subsequent three patterns.

Similarly, there are patterns like:

    "FC(F)(C=O)C1(F)OC(F)(F)C(F)(F)C1(F)F",
    "FC(F)(F)C(F)(F)C(F)(F)OC(F)(C=O)C(F)(F)F",
    "FC(F)(F)C(F)(F)C(F)(F)S",

which could be improved by an element count test - count the number of fluorines, and only do the test if there are enough atoms in the molecule to fingerprint.

So one stage might be to construct a list of element counts;

   ele_counts = [0]*200
   seen = set()
   for atom in mol.GetAtoms():
      ele_counts[eleno:=atom.GetAtomicNum()] += 1
      seen.add(eleno)

then have a lookup table for each element, based on the patterns which have at least that count of the given element type;

   ele_patterns = [
     # max known count, list of set of matching patterns
     (0, [set()]), # element 0
     (0, [set()]), # hydrogen
     ..
     (20, [{all patterns which contain no carbon},
           {all patterns which require at most 1 carbon}, ...
           {all patterns which require at most 19 carbons}],
     (10, [{all patterns which contain no fluorine}, ..
           {all patterns which contain at most 9 fluorines}], 
      ...]

so one reduction can be

   def get_possible_patterns(seen, ele_counts):
     for eleno in seen:
        max_count, match_list = ele_patterns[eleno]
        count = min(ele_counts[eleno], max_count)
        yield match_list[count]
   patterns = set.intersect(*get_possible_patterns(seen, ele_counts))

and only test that subset of patterns.

However, this is not sophisticated enough to identify which other tests, like the "CC(=NNC=O)C" example I gave before, or "S(=O)(=O)", which might be good tests at a higher level than the element.

And clearly if there isn't a sulphur, aren't two oxygens, and aren't two double bonds then there's no need to test "S(=O)(=O)", suggesting a tree structure would be useful.