The often deranged postings of yet another hacker, pretending to be an Astronomer, pretending to be a hacker who has written a book on iPhone Programming for O'Reilly Media.
It's the MAKE:Hardware Innovation Workshop this week, as well as Maker Faire Bay Area this weekend, and I'll be hanging around both talking about this and other things and well as doing some live demos. So come and talk to me if you see me, and you want to know more about connecting your iPhone to the open hardware world.
Back in November last year I spoke at Øredev in Malmö, Sweden, about location enabled sensors, and the video of the talk has just been put up onto the web by the organisers.
If you're interested in the topics discussed at Øredev, all the talks will eventually make it onto the website including another talk I did at the conference on visualisation which hasn't made it online quite yet.
I forgot to post a pointer to this at the time, but while I was out at O'Reilly's Where conference I was interviewed by Mike Hendrickson about location privacy, data leakage and data exhaust.
While I was out at the O'Reilly offices in Sebastopol earlier in the month I sat down with Dale Dougherty to talk about how to make iPhones and iPads talk to the open source world.
This article was originally published on the O'Reilly Radar.
Big data isn't just about multi-terabyte datasets hidden inside eventually-concurrent distributed databases in the cloud, or enterprise-scale data warehousing, or even the emerging market in data. It's also about the hidden data you carry with you all the time; the slowly growing datasets on your movements, contacts and social interactions.
Until recently, most people's understanding of what can actually be done with the data collected about us by our own cell phones was theoretical. There were few real-world examples. But over the last couple of years, this has changed dramatically. Courting hubris perhaps, but I must admit it's possible some of that was my fault, though I haven't been alone.
You probably think you know how much data you carry around with you on your cell phone. You'll certainly be aware of it if you've ever lost your phone, or had it stolen, or it's just plain stopped working. But there is a large amount of data in the background that isn't surfaced in the user interface.
We know about what I generally call primary data: calendars, address books, photographs, SMS messages and browser bookmarks. These are usually user generated, and we'd be pretty unhappy if we lost them. There is also the secondary data that the phone generates about us: call history, voice mail, usage information and records of our current and past locations. Most of what I'd call secondary data is surfaced to us in our phone's user interface. We generally can't change this sort of information without resetting the phone to a factory fresh condition; it's generated by the device for us, it's not something we generate ourselves.
But there is also what I refer to as tertiary data. This is data that, similar to the examples I mentioned above, is generated about us, rather than by us. Mostly, this data consists of cache files — data that is entirely necessary to you using the device, or significantly improves your user experience, but you don't necessarily know is there. At least until some hole is found in the operating system to expose that data layer to you. That's happened before, after all.
An obvious example is tucked in your photographs. Every picture you take is geotagged and date stamped, and if you publish your pictures to a photo-sharing site without stripping that information, you're leaking data. Back in 2007, when geotagged photographs of newly arrived helicopters at a U.S. Army base in Iraq were published to the Internet, they allowed insurgents to determine the exact location of the helicopters inside the compound and conduct a mortar attack. Four of the AH-64 Apaches on the flight line were destroyed in the attack.
Where Conference 2012 — O'Reilly's Where Conference, being held April 2-4 in San Francisco, is where the people working on and using location technologies explore emerging trends in software development, tools, business strategies and marketing.
Recently, there have been a number of high-profile cases of data leakage, and the one that has raised the most controversy, at least until the next time, is the social network Path.
Upon opening the Path application on your phone, it automatically uploaded your address book to Path's servers so it could find "friends" that you might want to connect to without asking for explicit permission to do so, or even implicit permission for that matter. Path has since apologized and updated its application so that it now asks permission before pushing your address book to its servers.
This was not data theft, but data leakage. You asked the application to accomplish something and didn't really ask yourself how it was doing it. While there are technical solutions that don't involve uploading your address book, the laziest solution is probably what you should have expected. I can almost hear the developers, "... we'll just upload the address book for now and switch to hashing later on when we have time."
There has been a lot of comment that somehow the whole Path thing was unexpected. Realistically, that's not the case. It's not an isolated circumstance, either. To the best of my knowledge Hipster and other apps also tapped your address book behind the scenes without asking permission. Interestingly, there are other, less obvious, culprits. Applications that make use of Chillingo's "Crystal" game service, like Angry Birds, will in some circumstances also upload your address book. While there is a button to push, it is, at least for me, misleadingly labelled and doesn't suggest what's going to happen next.
Data leakage like this is not really a solvable problem at the user level, at least not in real-time. Having multiple permission boxes pop up at regular intervals is a bad design choice; users stop reading them, they lose importance and become ineffective. Just try using Microsoft Windows and you'll understand exactly what I mean. Modal interrupts should be reserved for vital time-critical issues. They're already used far to prolifically in iOS. Run the Mail application with multiple mail accounts configured when you're not connected to the network and that'll become instantly obvious. You'll be bombarded by error messages.
I did have a thought that you might be able to deploy a customized web proxy directly onto your mobile device and have all web requests directed through it. The proxy would sift through the outgoing network connections in a (semi-)Bayesian manner looking for data that you don't want transmitted and stop the application cold before it sends it to the remote server. Basically, it's acting as a reverse spam filter, or a smart firewall, depending on how you want to think about it.
I think that something like this could well be far more effective at stopping data leakage than the current solution, which Google has used on Android: Permissions pages when you initially install an application are all very well, but most people don't read them, and when you're installing an application you're not really thinking about why it might need certain permissions. However, you can be very clear about what data you're interested in not leaving your phone. One configuration page for the proxy, rather than multiple ones, every time you install an application. Like modal dialogs on the iPhone, you subconsciously start to ignore them, to your peril.
Location, location, location
Of course, I can't really talk about data leakage without mentioning the kerfuffle surrounding location and data privacy that happened just about this time last year. Unsurprisingly, the file in question still exists, despite some of the press stories; the existence of the file was never the problem. A cache of that nature is fairly necessary if you want to have reliable and timely location services on your phone. However, the file is now actually just that, a cache, and it is regularly swept clean by the operating system. It's also not included in your usually unencrypted backups to your laptop, which was perhaps more of a problem than the fact it wasn't being cleared out in the first place.
What Apple was doing was taking a piece of tertiary data, generated about you by the device, and then exposing it on a platform (laptop or desktop) where accessing that data was easier. There are a lot of people who know how to navigate a file system on a computer, but a lot fewer who would know how to get the same data directly from the phone itself. It was a classic case of data leakage: data moved from a secure(ish) environment on the phone to a less secure one on the computer.
Data exhaust
Back in the days of floppy disks, the lines of ownership were pretty clear. If you had the disk, the data was yours. If someone else had it, it was theirs. Things these days are much blurrier. That tertiary data — data that's generated about us but not by us — doesn't just build up on your mobile devices of course. Other people are building datasets about our patterns of movement, buying decisions, credit worthiness and other things. The ability to compile these sorts of datasets left the realm of major governments with the invention of the computer.
We're all aware of this, and there's even a provocative buzzword to describe it: data exhaust. It's the data we leave behind us, rather than carry with us.
In the U.S., data from grocery store loyalty schemes has been used by security services to search for terrorist suspects. Turns out the number of toilet rolls you buy can be quite telling.
Which does make me think, instead of being afraid of the data exhaust, perhaps we should embrace it. In the U.K., the biggest retailer is the supermarket Tesco. Like many, I spend a good fraction of my income there, and like almost everyone I know, I have a Tesco Clubcard. This is a loyalty card that has a record of (almost) every purchase I make, from toilet rolls to roast chicken.
I'd actually pay good money for a copy of my own Clubcard data, so long as it was actually in a machine-readable format, not on paper. Although for Tesco, the data is only really interesting in aggregate; it's the fact that they have millions of Clubcard records that makes the dataset useful to the company. To me, a history of my purchases would be useful data.
Of course, people have already started selling our data exhaust back to us. Think about your credit report, for instance.
Keep your friends close, and your enemies closer
It's not just your own data exhaust that you have to worry about. There was an interesting paper recently by Adam Sadilek of the Department of Computer Science at the University of Rochester. It talked about how geotagged tweets could be used to locate individuals, even if they themselves didn't geotag their tweets — it was enough that their friends did so.
Geotagged messages on Twitter during a typical weekday afternoon in New York City.
The paper found that only a couple of weeks' worth of location data on an individual, combined with location data from their two most-sharing friends, was enough to place that person within a 100-meter radius with 77% accuracy. That rises to nearly 85% when you combine information from nine friends.
Even someone who has never shared their location at all can be pinpointed with 47% accuracy from information available from two friends. That goes up to 57% when you include nine friends.
Data sharing
There is a great debate going on right now, which is really only starting to surface into the mainstream press, about how we share data. Despite social networks becoming mainstream, the recent privacy debacles in the mobile space say a lot about how users perceive information privacy. I think Sadilek's paper presents even more compelling evidence.
For instance, I'm finding Google's new Instant Upload feature, where photos taken on my phone are automatically uploaded to Google+ behind the scenes, a lot spookier and more worrying than I thought I would. It's especially interesting that I'm feeling that way, as I'm using Apple's Photo Stream without thinking or worrying about it that much.
I'm trying to figure out whether it's because the privacy trade-off — in Apple's case sharing my photos between all my devices, and in Google's case making my photos more-or-less instantly available for sharing in Google+ — is more obviously in my favor with Photo Stream, or it's for other reasons.
The interesting thing here is that Photo Stream and Instant Upload are, at least behind the scenes, effectively identical. Both are cloud services and your photos are stored in a data center somewhere. The master copies of your photos have essentially been moved to the cloud, rather than residing on your device.
However, because of the context these two services operate in, I have no problems with one, and I'm finding the other an uncomfortable fit. I think there's a big lesson there for people dealing with personal information. When you're sharing someone's information, even with their informed consent, the context is important about how they think about the implications surrounding that sharing.
Building platforms
So, all of this got me thinking. There are large personal datasets about me, and you, and everyone, being built up by large companies. But we're also building up datasets about ourselves, in our own control. What happens if we mash them together? Can we actually do something productive?
I'm currently running an interesting experiment with my credit card and my iPhone. I'm scraping my bank's website to grab transaction data in near real-time onto one of my servers. Each transaction comes with a postcode. This is like a U.S. zip code, but it normally specifies a much smaller neighborhood, perhaps down to a single street or smaller in a major urban area.
Watching my credit card transactions in real-time.
On my iPhone, I'm running an application that continually monitors my location using the Significant Location Change service, so my phone knows my location to better than 1km (perhaps much better in a crowded city) more or less all of the time.
Every time a new transaction occurs, I forward it via push notification from the back-end server to my iPhone. Now, my iPhone knows both the location where the transaction took place and where I actually was at the time. If those locations don't match, then this indicates there might have been a fraudulent transaction and it flags it for me with a notification.
The interesting thing here is that I'm using data that my credit card company doesn't have, and hopefully will never have: my actual physical location when the transaction took place. They couldn't possibly provide this service to me because they simply don't have the data I have.
Of course, there are false positives. Online transactions in particular stand out. Most of these are tagged with a postcode of the headquarters of the company I'm dealing with. However, my next development step will be to give my back-end server code access to my inbox and allow it to scrape for online transaction receipts. This should reduce the false-positive rate down to something vanishingly small, and I should be able to deal with those left over with some sort of machine learning. After all, there's a human-readable string attached to each transaction that details the retailer and sometimes other useful information.
A thought to ponder
A thought to ponder in the dead of night: In the near future, the absence of data is going to be increasingly unusual. If you think the data exhaust you leave behind yourself is wide and varied, then just you wait, because we're at the banging-the-rocks-together stage right now.
If your data exhaust becomes assumed, what happens if you turn your phone off for an hour or two one night? What if you're accused of a murder during that time period, and you can't prove where you were? Perhaps in the future that's going to be sufficiently unusual that it's automatically suspicious. Innocent until proven guilty may underlie our current legal system, but that's because our current legal system was codified in a very different era, one that was data poor rather than data rich. Perhaps in the future, the absence of data will imply guilt.
I spent most of last week at O'Reilly Media's Strata Conference on Big Data in Santa Clara, where I was talking about the data you carry with you. Something I've started to call migratory data.
Big data isn’t just about multi-terrabyte data sets hidden inside eventually-concurrent distributed databases in the cloud, or enterprise-scale data warehousing, or even the emerging market in data. It’s also about the hidden data you carry with you all the time, about the slowly growing data sets on your movements, contacts and social interactions.
Until recently most people’s understanding of what can actually be done with the data collected about us by our own cell phones was theoretical; there were few real-world examples. But over the last couple of years this has changed dramatically.
Interview with Mac Slocum at Strata
Mac Slocum caught up with me about half way through the week to talk about the data on your cell phone, and other mobile devices, along with the possibilities for making use of that hidden data to reveal things about our lives that we might not realise ourselves, about data privacy and data leakage, and about where all this might lead.
A couple of days ago I received a BiKN for iPhone case and tags, with just over a week between ordering and it arriving on my desk, considering the big body of water in the way, I was pretty impressed by the rapid delivery.
BiKN is advertised as a smart case that allows you to "Find your stuff". Effectively it's a wireless based sensor network which makes use of tags and uses your iPhone as the hub of the network.
BiKN for iPhone.
I was also fairly impressed with the packaging, which had be designed so that the instructions on how to charge and use the tags and case were just there. You didn't have to go looking. Everything also seemed to come fully charged, which is always nice.
The inside of the box cover has the instructions.
Plugging my iPhone into the case the standard iOS prompt to download the app associated with the accessory from the App Store popped up. Clicking on it took be directly to the BiKN app as you'd expect. It all worked as advertised.
However there is very little information about how the system works on the BiKN site, so I took one of my tags apart to find out. I was curious about whether they were using an 802.15.4 mesh network, active RFID, or something else. It turned out my hunch was correct and BiKN are using a 802.15.4 network between the tags and the phone case. For those of you in the Maker world, that's the same underlying protocol as the Digi XBee chips use. However BiKN us using a Jennic chipset instead of the, almost ubiquitous, Digi one.
Taking a BiKN tag apart.
There are really only two identifiable ICs on the tag's board. The first is a 4MB MXICMX25V4006E Serial Flash package, providing a Standard Serial Interface x1 or x2 I/O, Single I/O or Dual I/O, at single 3V or 2.5V power-supply voltage.
One side of the BiKN board.
However the really interesting find is visible when you lift the battery. The tag is running from an JennicJN5148-001 wireless micro-controller. It is a 32-bit RISC processor, with both 128kB of ROM and RAM onboard to support the 802.15.4 networking stack an on-chip user applications. It looks like the BiKN uses a JenNet-IP based networking solution. This is essentially a enhanced 6LoWPAN network so it's extremely generic. I'm going to be surprised if tags are the only hardware to get BiKN-enabled in the future. This is a serious bid to build an (at least semi-) standards-based Internet of Things from Treehouse Labs.
The other side of the BiKN board.
It's a nicely set out board and interestingly, if you look at the side holding the flash package, there seems to be exposed Rx and Tx lines on the board. Which sort of hints that some sort of serial access might be available. I'm presuming these are going to be a serial pass-through for the micro-controller, but I could be wrong.
The only real criticism I have at this point is the build quality. In fact I'm very disappointed with the quality of the materials that they've used. The case feels cheap and plastic, and the tags are bulky and likewise have a cheaply-made plastic feel. These don't measure up the iPhone build quality at all. I was expecting a much nicer rubberised look-and-feel.
I don't normally use a case with my iPhone, the phone itself is very hard to scratch and lives in my pocket with my loose change without any (well much to speak of) damage. I was hopefully I could live with the BiKN case well enough so I could leave it on my phone all the time, but it's bulky and cheap plastic feel has ruled that out. The case will have to live on my desk and that means I'll have to swap the phone in and out of it when I want to use BiKN. Which probably means I'll use it a lot less that I otherwise would have done.
My only other problem at this point is that, about two times in five, when I open the BiKN app it fails to register the presence of the hardware case. Those times it does register, the case battery life shows to be fairly healthy, so despite the suggestion to charge the case to rectify this, I don't think that's the problem. I've been in touch with the company and they're suggesting this might be a general problem that they're going to address with a forthcoming firmware update, so hopefully that's just an initial teething problem that's going to go away.
Their choice of micro-USB connector for the case, rather than a 30-pin pass-through, is also a bit irritating. I can see why they did it, having been involved around the edges of Apple's MFi programme in the past, I'm presuming using the patent laden 30-pin connection could seriously reduce their margins. But I've got lots of spare 30-pin cables around, and not many micro-USB ones. Charging the phone, and the case, is therefore somewhat problematic.
Finally then, I'm going to be really interested to see if they come out with a publicly available SDK, as I think that'll be the make-or-break for the product. It's moderately interesting without it, but with a decent SDK so developers can integrate it into their own apps? That's much more interesting. At this stage of the product they should be looking for network effects, but they should be looking for them from the developer community, not end-users. A public SDK would go a long way towards filling what seems to be a gaping hole in the product right now.
One of the things I've never talked about in this blog, mainly because I talked about it enough elsewhere, is the iPhone Tracking scandal and the bizarre reality distortion effect that went on during those couple of weeks back in April.
Mac Slocum finally managed to corner me at OSCON last week and got me to talk about things now everything has settled down.
Interview with Mac Slocum at OSCON
One of the things I mentioned during the interview as one of the really positive things to come out of the iPhone Tracking debacle is the visualisations from the crowdflow.net group. If you haven't already seen them, I'd encourage you to take a look at what they've managed to pull out of the data.
I seemed to spent a lot of time in front of the camera while I was at OSCON last week, amongst other things I talked about connecting iOS devices to the real world using the new Redpark serial cable for iOS.
Demonstration of the cable
Interview with Mac Slocum talking about the implications