Parsing Chemistry

12 days ago (re.factorcode.org)

18 comments

kencausey

Does this do structural formulae too?

Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...

[0] https://en.wikipedia.org/wiki/International_Chemical_Identif... [1] https://www.inchi-trust.org/wp/wp-content/uploads/2019/12/18... [2] https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd0001...

jugoetz 4 hours ago

No, in Python you can use rdkit (https://github.com/rdkit/rdkit) for that

mwt 4 hours ago

This code is jibberish to me, but it appears the target is just parsing how many atoms are in a molecule string of some representation. That's cool, but to do just about anything useful in chemistry we need the bond graph (and often more - bond orders stereochemistry, plus much more for biopolymers).

the__alchemist 3 hours ago
That was my initial reaction too, but I suspect this is has utility in applications other than what you and I are looking for. From context, I gather this may be for thermodynamic arithmetic, or reaction product arithmetic.
- mwt 3 hours ago
  
  I'd be really interested to know of anybody making money with those topics (and doesn't already have their own domain-specific practice for the problem)
  
  1 reply →

brilee 4 hours ago

Does this handle, e.g., water of hydration CaSO4 . 2H2O? states of matter H2O(g)? does it preserve subunit information, as in (C6H5)CH2COOH? Writing a parser for basic formulae is such a tiny tiny part of the actual problem... deciding the scope of what you want to handle and how is the real problem

the__alchemist 6 hours ago

Note: There are two standardized formats for this called SMILES and SELFIES. SMILES is much better supported, but SELFIES is more robust. I'm integrating them into some bio and chem software I'm working on.

You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.

I believe most molecule editors can load and save SMILES.

dachrillz 6 hours ago
What about inchi? Isn’t that a common way of describing molecules as well?
- fred_tandemai 2 hours ago
  
  InChI isn't really meant to be used as a format to store 2D molecules say for rendering but rather serves as a unique descriptive chemical identifier. InChI has many flavors but the Standard InChI yields one unique identifier for multiple forms (tautomers) of the same molecule.
- the__alchemist 5 hours ago
  
  Good point!
jugoetz 4 hours ago
SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
- dekhn 3 hours ago
  
  It's still being argued if you really need SELFIES, or if SMILES autoencoders can be trained to only generate valid molecules, or if generating invalid molecules is useful (I'm in camp SELFIES, but I also want better ways to represent and learn on graphical chemical structures, ratehr than serialized strings).

whitten 6 hours ago

Does the SMILE (or Simplified Molecular Input Line Entry System) code have an EBNF definition ? https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Lin... Claims there is a context free grammar.

fred_tandemai 2 hours ago

[dead]

toast_x 5 hours ago

this is insanely cool

Jaxan 33 minutes ago

… It is just a parser? Sure the parser is written very succinctly and that’s neat. But parser generators for other languages can do it similarly.

fred_tandemai 2 hours ago

[dead]