Tuesday, July 24, 2007

OSCON: Data-mining from Open APIs

My afternoon tutorial is Data-mining from Open APIs given by Toby Segaran. I'm currently stuck on the end of a row without access to a power socket, so we'll have to see how my battery holds out.

Toby Segaran talking about Data-mining

Having paged through the book I'm a bit unsure I really want to be in this tutorial, Toby has started out with a fairly dry discussion of what data mining actually is, which isn't really that reassuring. I don't think this is going to be the fun fast-paced tutorial on mashups that I was expecting. I think next year I should probably read the tutorial descriptions before picking what I'm going to, rather than waiting till I arrive at the conference.

Apparently he's got an book "Programming Collective Intelligence", which covers a lot of the same ground as this tutorial, due out in August with O'Reilly.

Update: Oh, he's just started to talk about regression trees and the CART algorithim. Moving on from supervised regression trees, he's talking about unsupervised methods.

Update: He's basing his unsupervised data mining example on grouping blogs on hierarchical clusters. He's using Mark Pilgrim's Universal Feed Reader to harvest the data from the Technorati top one hundred blogs,

I've sure you've all seen blogs...

with lots of Python code flipping across the screen he's building up a matrix of word occurrences and determining the distance between two blogs using Euclidean distance between the word counts. I guess that's okay for a simple example but there are other distance metrics; Manhattan, Tanamoto, Pearson Correlation, Chebychev and Spearman.

Update: He's showing how you can use Dendrograms and K-means Clustering to show how the blogs cluster up. I'm a bit frustrated here, this is good stuff, but his code is written in eight point font, black on white, and almost total unreadable.

Update: Okay, we've just broken for fifteen minutes for an unscheduled break...

Update: We just had about three quarters of an hour of network down time. The network crash happened just as I was posting an update, which appears to be toast. I hate it when that happens...

Update: ...okay, he's just finished two hours early. Erm? What on Earth!?