kafsemo.org: 2005-02-28: URI Normalisation

As of draft-ietf-atompub-format-02.txt, the Atom syndication format, strongly encourages that URIs used as identifiers be normalised (section 3.5, Identity Constructs). Although this isn’t yet a published specification, I’ve added a preparatory warning to the Feed Validator.

Since this is likely to be part of the spec, I thought I’d post a few additional notes that might be helpful for anyone writing code to produce or normalise identifiers for feeds.

Normalisation is easy, apart from the edge cases; however, there are a lot of edge cases. This is a great opportunity for test-driven development – by far the most useful artifact of my code is a test suite. I’m sure there are still bugs, but I have concrete documentation of the cases that are covered and immediate notice of any regressions. That simple list of URIs (along with the normalised, form if that differs) makes it easy to add cases, both initially and as issues are discovered. The Atom wiki examples cover most real-world cases (case of scheme, only percent-encoding where essential, UTF-8 normalisation, etc.), so here are a few other, less obvious examples to paste into that test list before signing off.

Case Normalisation

While the host name should be in lower case, any user info should stay unaffected:

http://USER:PASSWORD@EXAMPLE.COM/ → http://USER:PASSWORD@example.com/

Although the tag URI spec “[RECOMMENDS] that the domain name should be in lowercase form,” normalisation is not at liberty to enforce this:

TAG:Example.Com,2004:Test → tag:Example.Com,2004:Test

Empty Components

Empty queries and anchors are significant:

http://www.w3.org/2000/01/rdf-schema#
http://example.com/?

Both URIs are normalised, and distinct from:

http://www.w3.org/2000/01/rdf-schema
http://example.com/

Reserved Characters

In my first iteration, I completely missed section 2.2 (‘Reserved Characters’) of RFC 3986; normalisation must leave reserved characters unaffected, whether or not they are percent-encoded in the original URI. These two are very similar, but not equivalent:

http://example.com/?q=1%2F2
http://example.com/?q=1/2

Thanks to Mark Carrington for reporting this case:

http://xxx/read?id=abc%26x%3Dz&x=y
http://xxx/read?id=abc&x=z&x=y

Of course, the parameters [id: ‘abc&x=z’, x: ‘y’] are completely different from [id: ‘abc’, x: ‘z’, x: ‘y’]. ‘&’ is a reserved character and changing ‘%26’ to ‘&’ doesn’t invalidate the URI, but does change how the URI is interpreted.

Here’s a case I’m still not sure about:

http://example.com/test#test#test → http://example.com/test#test%23test

‘#’ isn’t valid in a fragment; does percent-encoding it change how the URI is interpreted, or simply correct an error?

Path Segments

Empty path segments are significant and, as long as it’s percent-encoded, a path segment of ‘/’ is perfectly valid (as is ‘\’).

http://www.example.com//a//
http://example.com/%2F/
http://example.com/\/ → http://example.com/%5C/

Lastly, a strong candidate for the least-readable URI possible.

aa1+-.:///?a1-._~!$&'()*+,;=:@/?#a1-._~!$&'()*+,;=:@/?

Trying it in Firefox under Linux highlights a misinteraction – URI schemes are looked up in GConf, which doesn’t allow the plus sign in paths. A more legitimate scheme where this may be a problem is svn+ssh.

(Music: Throwing Muses, “Teller”)