Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...
This code is jibberish to me, but it appears the target is just parsing how many atoms are in a molecule string of some representation. That's cool, but to do just about anything useful in chemistry we need the bond graph (and often more - bond orders stereochemistry, plus much more for biopolymers).
That was my initial reaction too, but I suspect this is has utility in applications other than what you and I are looking for. From context, I gather this may be for thermodynamic arithmetic, or reaction product arithmetic.
I'd be really interested to know of anybody making money with those topics (and doesn't already have their own domain-specific practice for the problem)
Does this handle, e.g., water of hydration CaSO4 . 2H2O? states of matter H2O(g)? does it preserve subunit information, as in (C6H5)CH2COOH? Writing a parser for basic formulae is such a tiny tiny part of the actual problem... deciding the scope of what you want to handle and how is the real problem
Note: There are two standardized formats for this called SMILES and SELFIES. SMILES is much better supported, but SELFIES is more robust. I'm integrating them into some bio and chem software I'm working on.
You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.
I believe most molecule editors can load and save SMILES.
InChI isn't really meant to be used as a format to store 2D molecules say for rendering but rather serves as a unique descriptive chemical identifier. InChI has many flavors but the Standard InChI yields one unique identifier for multiple forms (tautomers) of the same molecule.
SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
It's still being argued if you really need SELFIES, or if SMILES autoencoders can be trained to only generate valid molecules, or if generating invalid molecules is useful (I'm in camp SELFIES, but I also want better ways to represent and learn on graphical chemical structures, ratehr than serialized strings).
Does this do structural formulae too?
Was thinking of InChI[0] but on Googling SMILES and SELFIES I found this[1] talk, this[2] paper and my goodness I've been down a few rabbit holes since...
[0] https://en.wikipedia.org/wiki/International_Chemical_Identif... [1] https://www.inchi-trust.org/wp/wp-content/uploads/2019/12/18... [2] https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd0001...
No, in Python you can use rdkit (https://github.com/rdkit/rdkit) for that
This code is jibberish to me, but it appears the target is just parsing how many atoms are in a molecule string of some representation. That's cool, but to do just about anything useful in chemistry we need the bond graph (and often more - bond orders stereochemistry, plus much more for biopolymers).
That was my initial reaction too, but I suspect this is has utility in applications other than what you and I are looking for. From context, I gather this may be for thermodynamic arithmetic, or reaction product arithmetic.
I'd be really interested to know of anybody making money with those topics (and doesn't already have their own domain-specific practice for the problem)
1 reply →
Does this handle, e.g., water of hydration CaSO4 . 2H2O? states of matter H2O(g)? does it preserve subunit information, as in (C6H5)CH2COOH? Writing a parser for basic formulae is such a tiny tiny part of the actual problem... deciding the scope of what you want to handle and how is the real problem
Note: There are two standardized formats for this called SMILES and SELFIES. SMILES is much better supported, but SELFIES is more robust. I'm integrating them into some bio and chem software I'm working on.
You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.
I believe most molecule editors can load and save SMILES.
What about inchi? Isn’t that a common way of describing molecules as well?
InChI isn't really meant to be used as a format to store 2D molecules say for rendering but rather serves as a unique descriptive chemical identifier. InChI has many flavors but the Standard InChI yields one unique identifier for multiple forms (tautomers) of the same molecule.
Good point!
SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.
SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.
It's still being argued if you really need SELFIES, or if SMILES autoencoders can be trained to only generate valid molecules, or if generating invalid molecules is useful (I'm in camp SELFIES, but I also want better ways to represent and learn on graphical chemical structures, ratehr than serialized strings).
Does the SMILE (or Simplified Molecular Input Line Entry System) code have an EBNF definition ? https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Lin... Claims there is a context free grammar.
[dead]
this is insanely cool
… It is just a parser? Sure the parser is written very succinctly and that’s neat. But parser generators for other languages can do it similarly.
[dead]