Comment by DrNosferatu

3 months ago

For some piece of mind, we can perform the search:

  OUTPUT=$(find .cursor/rules/ -name '*.mdc' -print0 2>/dev/null | xargs -0 perl -wnE '
    BEGIN { $re = qr/\x{200D}|\x{200C}|\x{200B}|\x{202A}|\x{202B}|\x{202C}|\x{202D}|\x{202E}|\x{2066}|\x{2067}|\x{2068}|\x{2069}/ }
    print "$ARGV:$.:$_" if /$re/
  ' 2>/dev/null)

  FILES_FOUND=$(find .cursor/rules/ -name '*.mdc' -print 2>/dev/null)

  if [[ -z "$FILES_FOUND" ]]; then
    echo "Error: No .mdc files found in the directory."
  elif [[ -z "$OUTPUT" ]]; then
    echo "No suspicious Unicode characters found."
  else
    echo "Found suspicious characters:"
    echo "$OUTPUT"
  fi

- Can this be improved?

4 comments

DrNosferatu

Joker_vD 3 months ago

Now, my toy programming languages all share the same "ensureCharLegal" function in their lexers that's called on every single character in the input (including those inside the literal strings) that filters out all those characters, plus all control characters (except the LF), and also something else that I can't remember right now... some weird space-like characters, I think?

Nothing really stops the non-toy programming and configuration languages from adopting the same approach except from the fact that someone has to think about it and then implement it.

Cthulhu_ 3 months ago

Here's a Github Action / workflow that says it'll do something similar: https://tech.michaelaltfield.net/2021/11/22/bidi-unicode-git...

I'd say it's good practice to configure github or whatever tool you use to scan for hidden unicode files, ideally they are rendered very visibly in the diff tool.

anthk 3 months ago

You can just use Perl for the whole script instead of Bash.