Saturday 8 June 2013

Data Scraping Wikipedia

I often use datasets found on Wikipedia as illustrative examples when I’m teaching or playing around with new statistical techniques. It is probably one of the easiest ways to find varied, real-world data sources out there. The major downside is actually getting the data into a format that I can manipulate in R. Generally this results in a lot of copy-and-pasting and manual translation of the the data. I've never been happy with this method. However, I recently discovered two extremely simple ways around this problem.

Method 1: Google Docs (or any other spreadsheet program)

The first and simplest method is literally a one line solution using Google Docs spreadsheets. Specifically this involves the importHTML function.

Let’s say that I want to import the breakdown of per capita alcohol consumption by country (link).  This is a fairly well formatted table so I could probably just copy and past it into my spreadsheet program du jour but this is a cleaner solution. Open up a Google Doc spreadsheet and type the following function call into one of the cells

=ImportHTML("http://en.wikipedia.org/wiki/List_of_countries_by_alcohol_consumption","table",2)

The syntax of this function is easy enough to understand. The first argument is the address of the HTML page you want to pull the table from. The second argument tells the function too look for a table (as opposed to a list object). The third argument specifies which table you want. Wikipedia defaults the Table 1 to the table of contents (if there is one) in their HTML files. From here it is a simple matter of exporting the spreadsheet in your favorite format and loading it into R.

I personally use Google Docs but I know that most spreadsheet programs have some sort of HTML/XML scraping function that works well with the predictably formatted tables of Wikipedia.

Method 2: XML Library in R

The XML library in R replicates this same functionality in only a few lines of code. The readHTMLTable function is the analog to importHTML  from Google Docs. This thread at Stack Overflow provides a good breakdown of this function.

Example

Below is an example graphic I made comparing the per capita alcohol consumption to  country's gross national income (GNI) per capita as defined by the World Bank. Even though there isn't much of a pattern in this data, being able to throw examples like this together in about 10 minutes could definitely help provide some additional flavor to examples in a statistics class.


Source: http://bugbee.me/blog/2013/4/5/data-scraping-wikipedia

No comments:

Post a Comment