Comment by thomashabets2

1 day ago

Can't say I know the best way here. But Rust does this better than anything I've seen.

I don't think you need two code paths. Maybe your program can live its entire life never converting away from the original form. Say you read from disk, pick out just the filename, and give to an archive library.

There's no need to ever convert that to a "string". Yes, it could have been a byte array, but taking out the file name (or maybe final dir plus file name) are string operations, just not necessarily on UTF-8 strings.

And like I said, for all use cases where it just needs to be shown to users, the "lossy" version is fine.

> I simply get "error: invalid UTF-8 was detected in one or more arguments" and the application exits. It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

Haha, touche. But yes, it's less sloppy. Would you prefer that the files were silently skipped? You've created your archive, you started the webserver, but you just can't get it to deliver the page you want.

In order for tarweb to support non-UTF-8 in filenames, the programmer has to actually think about what that means. I don't think it means doing a lossy conversion, because that's not what the file name was, and it's not merely for human display. And it should probably not be the bytes either, because tools will likely want to send UTF-8 encoded.

Or they don't. In either case unless that's designed, implemented, and tested, non-UTF-8 in filenames should probably be seen as malformed input. For something that uses a tarfile for the duration of the process's life, that probably means rejecting it, and asking the user to roll back to a previous working version or something.

> Forcing UTF-8 does not "fix" compatibility in strange edge cases

Yup. Still better than silently corrupting.

Compare this to how for Rust kernel work they apparently had to implement a new Vec equivalent, because dealing with allocation failures is a different thing in user and kernel space[1], and Vec push can't fail.

Similarly, Go string operations cannot fail. And memory allocation issues has reasons that string operations don't.

[1] a big separate topic. Nobody (almost) runs with overcommit off.

An error is better than silent corruption, sure.

But there is no silent corruption when you pass the data as opaque bytes, you just get some placeholder symbols when displayed. This is how I see the file in my terminal and I can rm it just fine.

And yes, question marks in the terminal are way better than applications not working at all.

The case of non-UTF-8 being skipped is usually a characteristic of applications written in languages that don't use bytes for their default string type, not the other way around. This has bitten me multiple times with Python2/3 libraries.

  There's no need to ever convert that to a "string". 

Until you run into a crate that wants the filename in String form.