Scraping Wikipedia


RDF’s loosely-bound model makes it easy to mix and aggregate statements from diverse sources. As a specific example, one episode listing for Spooks omits the (unofficial) episode titles, where another includes the titles but has no broadcast dates. As long as the identifiers are common, we can create a combination that has both.

Wikipedia’s page includes the titles, so all I need is to turn that HTML into statements. Unfortunately, Wikipedia’s basic, presentational HTML doesn’t include much structure — time for scraping. As long as the markup’s rich, XSLT on tidy’d output is often a good route here. It’s not pretty or robust, but it doesn’t need to be.

(I see the structure’s recently change, so I need to update my code. That’s public data for you.)

The result is hardly maintainable, but it does the job: this XSLT turns this data (when tidied) into this RDF. I’ve been using diff -u <(rapper - spooks-before.rdf) <(rapper - spooks-after.rdf) to show that, after the changes, I’m getting the same results as before. It’s merged in with the other data during scraping, and I get my proper titles.

Ideally I wouldn’t have to jump through these hoops. I presume that Template:coor is what Google Earth is using for some pretty amazing results but I haven’t noticed a lot of use of structured data from other templates.

Recent pushes towards free text and console input have made real progress in UI design; things like NSTokenField seem to represent an effort to provide more structure. Something that blends the free-form editing of a wiki with structured, extensible metadata and exports the whole thing as RDF would be enormously significant for the semantic web. The concept of a semantic wiki is already established, and there are some interesting references in the comments to this Bob DuCharme post. A real move to make entry as simple as Backpack can only improve takeup.

(Music: Blonde Redhead, “23”)
(More from this year, or the front page? [K])