kafsemo.org

Name and labels

2022-09-19

Firefox 56 included a new character encoding implementation (“written in Rust”!) that follows the WHATWG Encoding Standard. The spec includes names and labels for the character encodings. It’s exhaustive — “User agents must not support any other encodings or labels.” Here’s an interesting bit:

Name Labels
windows-1252
"iso-8859-1"
"us-ascii"
"windows-1252"

All three subtly different character encodings are now merged together into one, named for the one with the largest repertoire. This made me kind of nostalgic: getting these encodings mixed up was a classic interoperability problem, that required a certain amount of knowledge or tooling to identify and resolve. However, folding them together essentially solves that problem.

text/xml gotcha

Still, as long as your document’s published as text/xml, there’s another gotcha to be aware of: omitting charset doesn’t mean “autodetect,” it means US-ASCII. But!, in RFC 6657:

Each subtype of the "text" media type that uses the "charset" parameter can define its own default value for the "charset" parameter, including the absence of any default.

And then RFC 7303 says:

If an XML MIME entity is received where the charset parameter is omitted, no information is being provided about the character encoding by the MIME Content-Type header. XML-aware consumers MUST follow the requirements in section 4.3.3 of [XML] that directly address this case.

It’s easy to value knowing how to work around historic mistakes so much that you forget to push for fixing them. It’s good to see a couple of cases where that hasn’t been the case. It’s nice to be able to move on from edge cases.

(Music: Santigold, “L.E.S. Artistes”)
(More from this year, or the front page? [K])