Hullo!
Last modified
Saturday, 21 August 2010 (#)
Modification timestamps are, along with filenames, pretty much the only piece of metadata that carries between systems. Sure, there’ll be a timezone screwup here and there, but I have files that were last changed several machines ago and have carried that stamp across.
The semantics are fairly simple: when you save a file it takes the current time. When you copy a file it keeps the modification time.
Web servers can carry this across the web in the Last-Modified header field. For example, at time of writing, this early web documentation was last changed in early 01995.
So here’s a question: if you download that file with your browser, are you saving or copying? Was the file on your local machine last modified in 1995 or just now?
Mozilla’s bug 178506 discusses exactly that question, with some passionate arguments about correctness, user expectation, interoperability and ease of testing.
It’s always illuminating to see the different levels of argument. Is this an obvious bug or a personal preference? It’s like an optical illusion. Once you’ve seen it one way, your brain really doesn’t want to flip back.
With so many files travelling through browsers, does throwing away metadata imply that one copy is different to the other? Is the user’s act of saving the creative act that should be recorded by their system? Why are times never simple?
How I read the web
Monday, 16 August 2010 (#)
The first time I tried Readability was in its Safari Reader incarnation in Safari 5. It’s invaluable, especially on sites following the “content banner” school of design.
There’s a post, and discussion, about the relationship between readers’ and producers’ control over formatting, which made me think about how their model matches how I read the web.
I use my own basic feed reader, which runs batched and produces a single page of content to read. It’s similar to river of news, similar to Planet, but not identical.
All styling is removed, and the browser’s scrollbars are a great indicator of how much is left, which in turn lets me judge how much to skim.
The river approach works well, especially for skimming. There’s the usual tension. Some feeds are for browsing - news, especially, or busy blogs. Others should work more like email: I really do want to read the backlog on some feeds, even if it’s months out of date. I’m still missing a trick for an elegant way to reconcile those modes.
BK-tree performance notes
Tuesday, 3 August 2010 (#)
The core of Mivvi is a fuzzy string search over a collection of known titles. Fuzzy means that typos, spelling mistakes and even minor changes don’t prevent a successful match. The Levenshtein distance measures the number of simple edits to transform one string into another. Mivvi’s algorithm is a simple linear search: each title is compared to the query and the closest one, within a reasonable distance, is used.
This isn’t quick and runtime increases linearly with the dataset: not great when the plan is to grow.
When I found a great write-up of BK-trees they seemed an ideal alternative. The concept and implementation are reassuringly simple.
My first step was benchmarking, rather than analysis. I measured the time taken to find a randomly-chosen title in a list of N titles, varying N to see how the algorithm scaled. I compared the original linear search to a BK-tree. Initial results were disappointing. The tree is faster, but not significantly.

Following the classic try-then-read cycle, I paid more attention to the article. I saw that the maximum distance affects the BK-tree’s performance. For a linear search this makes no difference.
By default Mivvi allows a large edit distance to catch major changes. Restricting the maximum edit distance to 1 makes a dramatic difference.

So dramatic that a change of scale is in order:

The BK-tree’s efficiency comes from restricting the search at each node to the surrounding ring of width D, the maximum distance. The wider the ring, the more of the search space we cover. An infinite maximum distance would result in an exhaustive search.
Different values for D show the change in performance:

The lower values offer a significant improvement over a linear search. Given that the BK-tree never performed worse, I’ll adopt it. There’s a tunable parameter that I can wind down to improve results after looking at the impact on accuracy.
In future, combining a low maximum distance with a separate word-based algorithm for more distant searches may offer the best performance.


![[Feed]](feed.png)