Comment by whitten
3 months ago
Does the SMILE (or Simplified Molecular Input Line Entry System) code have an EBNF definition ? https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Lin... Claims there is a context free grammar.
3 months ago
Does the SMILE (or Simplified Molecular Input Line Entry System) code have an EBNF definition ? https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Lin... Claims there is a context free grammar.
That's "SMILES".
Yes. Here is the yacc grammar for the SMILES parser in the RDKit. https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Smi...
There's also one from OpenSMILES at http://opensmiles.org/opensmiles.html#_grammar . It has a shift/reduce error (as I recall) that I was not competent enough to fix.
I prefer to parser almost completely in the lexer, with a small amount of lexer state to handle balanced parens, bracket atoms, and matching ring closures. See https://hg.sr.ht/~dalke/opensmiles-ragel and more specifically https://hg.sr.ht/~dalke/opensmiles-ragel/browse/opensmiles.r... .
Oh, I should have pointed out my Python lexer-driven parser at https://hg.sr.ht/~dalke/smiview/browse/smiview.py
The lexer: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...
The lexer state transitions: https://hg.sr.ht/~dalke/smiview/browse/smiview.py?rev=tip#L3...
I wrote a very simple SMILES parser using pyparsing https://github.com/dakoner/smilesparser/tree/master I wouldn't say it's intended for production work, but it has been useful in situations where I didn't want to pull in rdkit.
I see you include the dot disconnect "." as part of the Bond definition.
You also define Chain as:
I believe this means your grammar allows the invalid SMILES C=.N
[dead]