Wikipedia Data Scraping: Data scraping with YQL and jQuery

For a project that I’m currently working on I need a list of all the US National Parks in XML format. Google didn’t come up with anything so I decided that I would need to somehow grab the data from this list on Wikipedia. The problem is that the list is in messy HTML but I want some nice clean XML ready for parsing with E4X in Flash.

There are a number of ways I could parse the data. If I knew Ruby and had an environment set up I’d probably use hpricot. Or I could get my hands dirty again with PHP and DOMDocument. Or if the Wikipedia page was XML or could be converted into XML easily then I could use an XSL transform. Or I’m sure there are hundreds of other approaches… But in this case I just wanted to very quickly and easily write a script which would grab and translate the data so I could get on with the rest of the project.

That’s when I thought of using jQuery to parse the data – it is the perfect tool for navigating a HTML DOM and extracting information from it. So I wrote a script which would use AJAX to load the page from Wikipedia. And that’s where I hit the first hurdle: “Access to restricted URI denied” – you can’t make crossdomain AJAX calls because of security restrictions in the browser :(
At this point I had at least a couple of ways to proceed with my jQuery approach:

    Copy the HTML file from Wikipedia to my server thus avoiding the cross domain issues.
    Write a quick serverside redirect script to live on my server and grab the page from Wikipedia and echo it back out.

I didn’t like the idea of either of those options but luckily at this point I remembered reading about YQL:

    The YQL platform provides a single endpoint service that enables developers to query, filter and combine data across Yahoo! and beyond.

After a quick flick through the documentation and some testing in the YQL Console I put together a script which would grab the relevant page from Wikipedia and convert it into a JSONP call which allows us to get around the cross-domain AJAX issues. As an added extra it’s really easy to add some XPath to your YQL so I’m grabbing only the relevant table from the Wikipedia document which cuts down on the complexity of my javascript. Here’s what I ended up with:
SELECT * FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//table[@class='wikitable sortable']"

If you run this code in the console you’ll see that it grabs the relevant table from wikipedia and returns it as XML or JSON. From here it’s easy to make the AJAX call from jQuery and then loop over the JSON returned creating an XML document. If you are interested in the details of that you can check out the complete example.

I was really impressed with how easy it was to quickly figure out YQL and I think it’s a really useful service. Even if you just use it to convert a HTML page to a valid XML document then it is still invaluable for all sorts of screen scraping purposes (it’s always much easier to parse XML than HTML tag soup). One improvement I’d love to see the addition of a CSS style selection engine as well as the XPath one. And the documentation could maybe be clearer (I figured out the above script by checking examples on other blogs rather than by reading the docs). But overall I give Yahoo! a big thumbs up for YQL and look forward to using it again soon…

Source: http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/

Wikipedia Data Scraping

Sunday, 26 May 2013

Data scraping with YQL and jQuery

No comments:

Post a Comment