← Back to context

Comment by moefh

8 hours ago

> This is how the HTML5 spec works and it's been phenomenally successful.

Unicode does have a completely defined way to interpret invalid UTF-8 byte sequences by replacing them with the U+FFFD ("replacement character"). You'll see it used (for example) in browsers all the time.

Mandating acceptance for every invalid input works well for HTML because it's meant to be consumed (primarily) by humans. It's not done for UTF-8 because in some situations it's much more useful to detect and report errors instead of making an automatic correction that can't be automatically detected after the fact.