Comment by the__alchemist

9 hours ago

Note: There are two standardized formats for this called SMILES and SELFIES. SMILES is much better supported, but SELFIES is more robust. I'm integrating them into some bio and chem software I'm working on.

You can do things like look up, using PubChem's API, similar molecules etc to a SMILES string.

I believe most molecule editors can load and save SMILES.

6 comments

the__alchemist

dachrillz 9 hours ago

What about inchi? Isn’t that a common way of describing molecules as well?

fred_tandemai 5 hours ago

InChI isn't really meant to be used as a format to store 2D molecules say for rendering but rather serves as a unique descriptive chemical identifier. InChI has many flavors but the Standard InChI yields one unique identifier for multiple forms (tautomers) of the same molecule.
the__alchemist 9 hours ago

Good point!

jugoetz 7 hours ago

SMILES and SELFIES are molecular graph representations and aren't meant to solve the "parse this sum formula" problem.

SELFIES are for genAI. If you ask a VAE to generate SMILES, it will spit out some strings that are invalid - can't happen with SELFIES, that is the one application where they are robust.

dekhn 6 hours ago
It's still being argued if you really need SELFIES, or if SMILES autoencoders can be trained to only generate valid molecules, or if generating invalid molecules is useful (I'm in camp SELFIES, but I also want better ways to represent and learn on graphical chemical structures, ratehr than serialized strings).
- chermi 1 hour ago
  
  can you guys explain what makes SELFIES robust? I'd only heard of SMILES until this thread, but I have been out of this space for 10 years.