← Back to context

Comment by DrNosferatu

6 days ago

For some piece of mind, we can perform the search:

  OUTPUT=$(find .cursor/rules/ -name '*.mdc' -print0 2>/dev/null | xargs -0 perl -wnE '
    BEGIN { $re = qr/\x{200D}|\x{200C}|\x{200B}|\x{202A}|\x{202B}|\x{202C}|\x{202D}|\x{202E}|\x{2066}|\x{2067}|\x{2068}|\x{2069}/ }
    print "$ARGV:$.:$_" if /$re/
  ' 2>/dev/null)

  FILES_FOUND=$(find .cursor/rules/ -name '*.mdc' -print 2>/dev/null)

  if [[ -z "$FILES_FOUND" ]]; then
    echo "Error: No .mdc files found in the directory."
  elif [[ -z "$OUTPUT" ]]; then
    echo "No suspicious Unicode characters found."
  else
    echo "Found suspicious characters:"
    echo "$OUTPUT"
  fi

- Can this be improved?

Now, my toy programming languages all share the same "ensureCharLegal" function in their lexers that's called on every single character in the input (including those inside the literal strings) that filters out all those characters, plus all control characters (except the LF), and also something else that I can't remember right now... some weird space-like characters, I think?

Nothing really stops the non-toy programming and configuration languages from adopting the same approach except from the fact that someone has to think about it and then implement it.