Comment by koakuma-chan

1 day ago

> What if the file name is not valid UTF-8

Nothing? Neither Go nor the OS require file names to be UTF-8, I believe

22 comments

koakuma-chan

> Nothing?

It breaks. Which is weird because you can create a string which isn't valid UTF-8 (eg "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") and print it out with no trouble; you just can't pass it to e.g. `os.Create` or `os.Open`.

(Bash and a variety of other utils will also complain about it being valid UTF-8; neovim won't save a file under that name; etc.)

yencabulator 1 day ago
That sounds like your kernel refusing to create that file, nothing to do with Go.
$ cat main.go package main import ( "log" "os" ) func main() { f, err := os.Create("\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98") if err != nil { log.Fatalf("create: %v", err) } _ = f } $ go run . $ ls -1 ''$'\275\262''='$'\274'' ⌘' go.mod main.go
- kragen 2 hours ago
  
  I've posted a longer explanation in https://news.ycombinator.com/item?id=44991638. I'm interested to hear which kernel and which firesystem zimpenfish is using that has this problem.
  
  2 replies →
- commandersaki 21 hours ago
  
  I'm confused, so is Go restricted to UTF-8 only filenames, because it can read/write arbitrary byte sequences (which is what string can hold), which should be sufficient for dealing with other encodings?
  
  5 replies →
- zimpenfish 20 hours ago
  
  > That sounds like your kernel refusing to create that file
  Yes, that was my assumption when bash et al also had problems with it.
kragen 1 day ago

It sounds like you found a bug in your filesystem, not in Golang's API, because you totally can pass that string to those functions and open the file successfully.

johncolanduoni 1 day ago

Well, Windows is an odd beast when 8-bit file names are used. If done naively, you can’t express all valid filenames with even broken UTF-8 and non-valid-Unicode filenames cannot be encoded to UTF-8 without loss or some weird convention.

You can do something like WTF-8 (not a misspelling, alas) to make it bidirectional. Rust does this under the hood but doesn’t expose the internal representation.

jstimpfle 1 day ago
What do you mean by "when 8-bit filenames are used"? Do you mean the -A APIs, like CreateFileA()? Those do not take UTF-8, mind you -- unless you are using a relatively recent version of Windows that allows you to run your process with a UTF-8 codepage.
In general, Windows filenames are Unicode and you can always express those filenames by using the -W APIs (like CreateFileW()).
- johncolanduoni 17 hours ago
  
  Windows filenames in the W APIs are 16-bit (which the A APIs essentially wrap with conversions to the active old-school codepage), and are normally well formed UTF-16. But they aren’t required to be - NTFS itself only cares about 0x0000 and 0x005C (backslash) I believe, and all layers of the stack accept invalid UTF-16 surrogates. Don’t get me started on the normal Win32 path processing (Unicode normalization, “COM” is still a special file, etc.), some of which can be bypassed with the “\\?\” prefix when in NTFS.
  The upshot is that since the values aren’t always UTF-16, there’s no canonical way to convert them to single byte strings such that valid UTF-16 gets turned into valid UTF-8 but the rest can still be roundtripped. That’s what bastardized encodings like WTF-8 solve. The Rust Path API is the best take on this I’ve seen that doesn’t choke on bad Unicode.
- af78 1 day ago
  
  I think it depends on the underlying filesystem. Unicode (UTF-16) is first-class on NTFS. But Windows still supports FAT, I guess, where multiple 8-bit encodings are possible: the so-called "OEM" code pages (437, 850 etc.) or "ANSI" code pages (1250, 1251 etc.). I haven't checked how recent Windows versions cope with FAT file names that cannot be represented as Unicode.
andyferris 1 day ago
I believe the same is true on linux, which only cares about 0x2f bytes (i.e. /)
- johncolanduoni 17 hours ago
  
  Windows paths are not necessarily well-formed UTF-16 (UCS-2 by some people’s definition) down to the filesystem level. If they were always well formed, you could convert to a single byte representation by straightforward Unicode re-encoding. But since they aren’t - there are choices that need to be made about what to do with malformed UTF-16 if you want to round trip them to single byte strings such that they match UTF-8 encoding if they are well formed.
  In Linux, they’re 8-bit almost-arbitrary strings like you noted, and usually UTF-8. So they always have a convenient 8-bit encoding (I.e. leave them alone). If you hated yourself and wanted to convert them to UTF-16, however, you’d have the same problem Windows does but in reverse.
- orthoxerox 1 day ago
  
  And 0x00, if I remember correctly.
- matt_kantor 1 day ago
  
  And 0x00.