Comment by thomashabets2

1 day ago

Author here.

What I intended to say with this is that ignoring the problem if invalid UTF-8 (could be valid iso8859-1) with no error handling, or other way around, has lost me data in the past.

Compare this to Rust, where a path name is of a different type than a mere string. And if you need to treat it like a string and you don't care if it's "a bit wrong" (because it's for being shown to the user), then you can call `.to_string_lossy()`. But it's be more hard to accidentally not handle that case when exact name match does matter.

When exactness matters, `.to_str()` returns `Option<&str>`, so the caller is forced to deal with the situation that the file name may not be UTF-8.

Being sloppy with file name encodings is how data is lost. Go is sloppy with strings of all kinds, file names included.

Thanks for your reply. I understand that encoding the character set in the type system is more explicit and can help find bugs.

But forcing all strings to be UTF-8 does not magically help with the issue you described. In practice I've often seen the opposite: Now you have to write two code paths, one for UTF-8 and one for everything else. And the second one is ignored in practice because it is annoying to write. For example, I built the web server project in your other submission (very cool!) and gave it a tar file that has a non-UTF-8 name. There is no special handling happening, I simply get "error: invalid UTF-8 was detected in one or more arguments" and the application exits. It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

Forcing UTF-8 does not "fix" compatibility in strange edge cases, it just breaks them all. The best approach is to treat data as opaque bytes unless there is a good reason not to. Which is what Go does, so I think it is unfair to blame Go for this particular reason instead of the backup applications.

  • > It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

    You can debate whether it is sloppy but I think an error is much better than silently corrupting data.

    > The best approach is to treat data as opaque bytes unless there is a good reason not to

    This doesn't seem like a good approach when dealing with strings which are not just blobs of bytes. They have an encoding and generally you want ways to, for instance, convert a string to upper/lowercase.

  • Can't say I know the best way here. But Rust does this better than anything I've seen.

    I don't think you need two code paths. Maybe your program can live its entire life never converting away from the original form. Say you read from disk, pick out just the filename, and give to an archive library.

    There's no need to ever convert that to a "string". Yes, it could have been a byte array, but taking out the file name (or maybe final dir plus file name) are string operations, just not necessarily on UTF-8 strings.

    And like I said, for all use cases where it just needs to be shown to users, the "lossy" version is fine.

    > I simply get "error: invalid UTF-8 was detected in one or more arguments" and the application exits. It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

    Haha, touche. But yes, it's less sloppy. Would you prefer that the files were silently skipped? You've created your archive, you started the webserver, but you just can't get it to deliver the page you want.

    In order for tarweb to support non-UTF-8 in filenames, the programmer has to actually think about what that means. I don't think it means doing a lossy conversion, because that's not what the file name was, and it's not merely for human display. And it should probably not be the bytes either, because tools will likely want to send UTF-8 encoded.

    Or they don't. In either case unless that's designed, implemented, and tested, non-UTF-8 in filenames should probably be seen as malformed input. For something that uses a tarfile for the duration of the process's life, that probably means rejecting it, and asking the user to roll back to a previous working version or something.

    > Forcing UTF-8 does not "fix" compatibility in strange edge cases

    Yup. Still better than silently corrupting.

    Compare this to how for Rust kernel work they apparently had to implement a new Vec equivalent, because dealing with allocation failures is a different thing in user and kernel space[1], and Vec push can't fail.

    Similarly, Go string operations cannot fail. And memory allocation issues has reasons that string operations don't.

    [1] a big separate topic. Nobody (almost) runs with overcommit off.

    • An error is better than silent corruption, sure.

      But there is no silent corruption when you pass the data as opaque bytes, you just get some placeholder symbols when displayed. This is how I see the file in my terminal and I can rm it just fine.

      And yes, question marks in the terminal are way better than applications not working at all.

      The case of non-UTF-8 being skipped is usually a characteristic of applications written in languages that don't use bytes for their default string type, not the other way around. This has bitten me multiple times with Python2/3 libraries.

    •   There's no need to ever convert that to a "string". 
      

      Until you run into a crate that wants the filename in String form.