Comment by koakuma-chan

1 day ago

> What if the file name is not valid UTF-8

Nothing? Neither Go nor the OS require file names to be UTF-8, I believe

> Nothing?

It breaks. Which is weird because you can create a string which isn't valid UTF-8 (eg "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") and print it out with no trouble; you just can't pass it to e.g. `os.Create` or `os.Open`.

(Bash and a variety of other utils will also complain about it being valid UTF-8; neovim won't save a file under that name; etc.)

  • That sounds like your kernel refusing to create that file, nothing to do with Go.

      $ cat main.go
      package main
    
      import (
       "log"
       "os"
      )
    
      func main() {
       f, err := os.Create("\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98")
       if err != nil {
        log.Fatalf("create: %v", err)
       }
       _ = f
      }
      $ go run .
      $ ls -1
      ''$'\275\262''='$'\274'' ⌘'
      go.mod
      main.go

  • It sounds like you found a bug in your filesystem, not in Golang's API, because you totally can pass that string to those functions and open the file successfully.

Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.

You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.

  • What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.

    In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).

    • Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.

      The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.

    • I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.

  • I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)

    • Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.

      In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.