Comment by modeless

11 hours ago

UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.

> This is how the HTML5 spec works and it's been phenomenally successful.

Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.

Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.

For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.

  • This is only true because the interpretation is not defined, so different implementations do different things.