Showing posts with label Academia. Show all posts
Showing posts with label Academia. Show all posts

Tuesday, April 02, 2013

Leaving the ivory tower

I've been planning to leave academia for some time, but kept on putting it off. Unlike the U.S. where tenure is a thing pursued vigorously by the great and the good, here in the U.K. at least it has long gone to dust. But my job was as permanent as they get, and actually left me a lot of time to do a lot of other things outside it that interested me...

Looking out over the Pacific
However recently I took a look around and discovered that everything that was getting me up in the morning had nothing to do with my day job, and everything to do with what I was doing outside it. That just isn't any way to live. So, I've just pushed the big red switch. I now have a long rope and will be using it to leave the ivory tower real soon, my last day here at the University of Exeter is later this week, Thursday the 4th of April.

I was originally planning to take a couple of months off to look around, mainly because I'm in the fortunate position that I can do that, and such opportunities shouldn't be wasted. However some Tesla-driving individuals said "Yes!" and I've now working on something that's going to swallow my life for the next couple of months.

However, I'm not complaining, it's just the sort of getting out of bed project that I'm quitting academia to do in the first place. You'll be hearing more about it shortly, just as soon as I can talk about it...

In the short to medium term I'm planning on staying freelance, and doing consulting, contracting, writing or anything else that'll pay the bills and keep the wolves from the door. Although I'm not opposed to the idea of joining a (large) company, I've just spent thirteen years working for someone else, it'll be nice to work for myself for a while. Or at least be nearer the top of the tree, as you can generally see the rest of the forest much better from there. That said, it doesn't mean I'm not open to offers; they'd just have to be interesting offers.

So, while I've got a large number of things that might come off; I'm interested in work. Preferably work of substance, but beggars can't be choosers.

I've done a number of (some quite infamous) things with iOS, and have a lot of experience on the app side of things. I have done a number of things that are now generally being lumped into the "Big Data" camp. While I'm not a Hadoop and NoSQL guy, I've done some interesting work with machine learning and agent architectures, mostly to do with distributed sensor networks. I'm a hardware guy, or at least I'm an Arduino guy, and have done a number of other things to do with that increasingly ubiquitous hardware platform.

I like playing with mobile platforms, hardware, software, sensors, 3D printers and data visualisation. Or preferably all of the above at the same time, a good example of this is the work on the Data Sensing Lab I've been doing for O'Reilly.

Basically I'm an emerging technology guy. If it's new and a lot of people know nothing about it, I probably know something or am learning about it right now. Then I generally write a book about it and move on to the next emerging technology. I like being on the cutting edge. It's interesting out here. Oh yes, I also helped discover the most distant astronomical object yet found; a gamma-ray burster at a redshift of 8.2. However I'm not so sure that's a useful skill outside of the ivory tower.

In summary then; I write, I code, I speak and am always willing to offer advice on things I know about.

Update: It has just been pointed out to me that I foresaw my own exit from academia some seven or eight years ago, back when I was still having fun in my day job, "...so what happens when I stop having fun? I'll probably have to sit down and make enough license plates so I don't have to worry about that stuff again."

Wednesday, August 15, 2012

Mining the astronomical literature

This post was originally published on the O'Reilly Radar.

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA's Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-'90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It's something they don't have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it's all searchable and downloadable.
Number of papers published in the three main astronomy journals each year Number of papers published in the three main astronomy journals each year. CREDIT: Robert Simpson
The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven't seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy's lead. The fact that the discipline's literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

For example, the .Astronomy series of conferences is a small workshop that brings together the best and the brightest of the technical community: researchers, developers, educators and communicators. Billed as "20% time for astronomers," it gives these people space to think about how the new technologies affect both how research and communicating research to their peers and to the public is done.

It should perhaps come as little surprise that one of the more interesting projects to come out of a hack day held as part of this year's .Astronomy meeting in Heidelberg was work by Robert Simpson, Karen Masters and Sarah Kendrew that focused on data mining the astronomical literature.

 The team grabbed and processed the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices of the Royal Astronomical Society (MNRAS) since each of those journals started publication — and that's 1827 in the case of MNRAS.

 By the end of the day, they'd found some interesting results showing how various terms have trended over time. The results were similar to what's found in Google Books' Ngram Viewer.
The relative popularity of the names of telescopes in the literature The relative popularity of the names of telescopes in the literature. Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck have each contributed to our knowledge of the cosmic microwave background in successive decades. References to Planck are still on the rise. CREDIT: Robert Simpson.
After the meeting, however, Robert has taken his initial results and explored the astronomical literature and his new corpus of data on the literature. He's explored various visualisations of the data, including word matrixes for related terms and for various astro-chemistry.
Correlation between terms related to Active Galactic Nuclei Correlation between terms related to Active Galactic Nuclei (AGN). The opacity of each square represents the strength of the correlation between the terms. CREDIT: Robert Simpson.
He's also taken a look at authorship in astronomy and is starting to find some interesting trends.
Fraction of astronomical papers published with one, two, three, four or more authors Fraction of astronomical papers published with one, two, three, four or more authors. CREDIT: Robert Simpson
You can see that single-author papers dominated for most of the 20th century. Around 1960, we see the decline begin, as two- and three-author papers begin to become a significant chunk of the whole. In 1978, author papers become more prevalent than single-author papers.
Compare the number of active research astronomers to the number of papers published each year Compare the number of "active" research astronomers to the number of papers published each year (across all the major journals). CREDIT: Robert Simpson.
Here we see that people begin to outpace papers in the 1960s. This may reflect the fact that as we get more technical as a field, and more specialised, it takes more people to write the same number of papers, which is a sort of interesting result all by itself.


Behind the project and what lies ahead


I recently talked with Rob about the work he, Karen Masters, and Sarah Kendrew did at the meeting, and the work he's been doing since with the newly gathered data.

What made you think about data mining the ADS?

Robert Simpson: At the .Astronomy 4 Hack Day in July, Sarah Kendrew had the idea to try to do an astronomy version of BrainSCANr, a project that generates new hypotheses in the neuroscience literature. I've had a go at mining ADS and arXiv before, so it seemed like a great excuse to dive back in.

Do you think there might be actual science that could be done here?

Robert Simpson: Yes, in the form of finding questions that were unexpected. With such large volumes of peer-reviewed papers being produced daily in astronomy, there is a lot being said. Most researchers can only try to keep up with it all — my daily RSS feed from arXiv is next to useless, it's so bloated. In amongst all that text, there must be connections and relationships that are being missed by the community at large, hidden in the chatter. Maybe we can develop simple techniques to highlight potential missed links, i.e. generate new hypotheses from the mass of words and data.

Are the results coming out of the work useful for auditing academics?

Robert Simpson: Well, perhaps, but that would be tricky territory in my opinion. I've only just begun to explore the data around authorship in astronomy. One thing that is clear is that we can see a big trend toward collaborative work. In 2012, only 6% of papers were single-author efforts, compared with 70+% in the 1950s.
The average number of authors per paper since 1827 The above plot shows the average number of authors, per paper since 1827. CREDIT: Robert Simpson.
We can measure how large groups are becoming, and who is part of which groups. In that sense, we can audit research groups, and maybe individual people. The big issue is keeping track of people through variations in their names and affiliations. Identifying authors is probably a solved problem if we look at ORCID.

What about citations? Can you draw any comparisons with h-index data?

Robert Simpson: I haven't looked at h-index stuff specifically, at least not yet, but citations are fun. I looked at the trends surrounding the term "dark matter" and saw something interesting. Mentions of dark matter rise steadily after it first appears in the late '70s.
Compare the term dark matter with related terms Compare the term "dark matter" with a few other related terms: "cosmology," "big bang," "dark energy," and "wmap." You can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition. CREDIT: Robert Simpson.
In the data, astronomy becomes more and more obsessed with dark matter — the term appears in 1% of all papers by the end of the '80s and 6% today. Looking at citations changes the picture. The community is writing papers about dark matter more and more each year, but they are getting fewer citations than they used to (the peak for this was in the late '90s). These trends are normalised, so the only regency effect I can think of is that dark matter papers take more than 10 years to become citable. Either that or dark matter studies are currently in a trough for impact.

Can you see where work is dropped by parts of the community and picked up again?

Robert Simpson: Not yet, but I see what you mean. I need to build a better picture of the community and its components.

Can you build a social graph of astronomers out of this data? What about (academic) family trees?

Robert Simpson: Identifying unique authors is my next step, followed by creating fingerprints of individuals at a given point in time. When do people create their first-author papers, when do they have the most impact in their careers, stuff like that.

What tools did you use? In hindsight, would you do it differently?

Robert Simpson: I'm using Ruby and Perl to grab the data, MySQL to store and query it, JavaScript to display it (Google Charts and D3.js). I may still move the database part to MongoDB because it was designed to store documents. Similarly, I may switch from ADS to arXiv as the data source. Using arXiv would allow me to grab the full text in many cases, even if it does introduce a peer-review issue.

What's next?

Robert Simpson: My aim is still to attempt real hypothesis generation. I've begun the process by investigating correlations between terms in the literature, but I think the power will be in being able to compare all terms with all terms and looking for the unexpected. Terms may correlate indirectly (via a third term, for example), so the entire corpus needs to be processed and optimised to make it work comprehensively.

Science between the cracks


I'm really looking forward to seeing more results coming out of Robert's work. This sort of analysis hasn't really been possible before. It's showing a lot of promise both from a sociological angle, with the ability to do research into how science is done and how that has changed, but also ultimately as a hypothesis engine — something that can generate new science in and of itself. This is just a hack day experiment. Imagine what could be done if the literature were more open and this sort of analysis could be done across fields?

Right now, a lot of the most interesting science is being done in the cracks between disciplines, but the hardest part of that sort of work is often trying to understand the literature of the discipline that isn't your own. Robert's project offers a lot of hope that this may soon become easier.