Wikipedia Data Scraping: May 2013

Wednesday, 29 May 2013

Usefulness of Web Scraping Services

For any business or organization, surveys and market research play important roles in the strategic decision-making process. Data extraction and web scraping techniques are important tools that find relevant data and information for your personal or business use. Many companies employ people to copy-paste data manually from the web pages. This process is very reliable but very costly as it results to time wastage and effort. This is so because the data collected is less compared to the resources spent and time taken to gather such data.

Nowadays, various data mining companies have developed effective web scraping techniques that can crawl over thousands of websites and their pages to harvest particular information. The information extracted is then stored into a CSV file, database, XML file, or any other source with the required format. After the data has been collected and stored, data mining process can be used to extract the hidden patterns and trends contained in the data. By understanding the correlations and patterns in the data; policies can be formulated and thereby aiding the decision-making process. The information can also be stored for future reference.

The following are some of the common examples of data extraction process:

• Scrap through a government portal in order to extract the names of the citizens who are reliable for a given survey.
• Scraping competitor websites for feature data and product pricing
• Using web scraping to download videos and images for stock photography site or for website design

Automated Data Collection
It is important to note that web scraping process allows a company to monitor the website data changes over a given time frame. It also collects the data on a routine basis regularly. Automated data collection techniques are quite important as they help companies to discover customer trends and market trends. By determining market trends, it is possible to understand the customer behavior and predict the likelihood of how the data will change.

The following are some of the examples of the automated data collection:

• Monitoring price information for the particular stocks on hourly basis
• Collecting mortgage rates from the various financial institutions on the daily basis
• Checking on weather reports on regular basis as required

By using web scraping services it is possible to extract any data that is related to your business. The data can then be downloaded into a spreadsheet or a database for it to be analyzed and compared. Storing the data in a database or in a required format makes it easier for interpretation and understanding of the correlations and for identification of the hidden patterns.

Through web scraping it is possible to get quicker and accurate results and thus saving many resources in terms of money and time. With data extraction services, it is possible to fetch information about pricing, mailing, database, profile data, and competitors data on a consistent basis. With the emergence of professional data mining companies outsourcing your services will greatly reduce your costs and at the same time you are assured of high quality services.

Source: http://ezinearticles.com/?Usefulness-of-Web-Scraping-Services&id=7181014

Sunday, 26 May 2013

Data scraping with YQL and jQuery

For a project that I’m currently working on I need a list of all the US National Parks in XML format. Google didn’t come up with anything so I decided that I would need to somehow grab the data from this list on Wikipedia. The problem is that the list is in messy HTML but I want some nice clean XML ready for parsing with E4X in Flash.

There are a number of ways I could parse the data. If I knew Ruby and had an environment set up I’d probably use hpricot. Or I could get my hands dirty again with PHP and DOMDocument. Or if the Wikipedia page was XML or could be converted into XML easily then I could use an XSL transform. Or I’m sure there are hundreds of other approaches… But in this case I just wanted to very quickly and easily write a script which would grab and translate the data so I could get on with the rest of the project.

That’s when I thought of using jQuery to parse the data – it is the perfect tool for navigating a HTML DOM and extracting information from it. So I wrote a script which would use AJAX to load the page from Wikipedia. And that’s where I hit the first hurdle: “Access to restricted URI denied” – you can’t make crossdomain AJAX calls because of security restrictions in the browser :(
At this point I had at least a couple of ways to proceed with my jQuery approach:

    Copy the HTML file from Wikipedia to my server thus avoiding the cross domain issues.
    Write a quick serverside redirect script to live on my server and grab the page from Wikipedia and echo it back out.

I didn’t like the idea of either of those options but luckily at this point I remembered reading about YQL:

    The YQL platform provides a single endpoint service that enables developers to query, filter and combine data across Yahoo! and beyond.

After a quick flick through the documentation and some testing in the YQL Console I put together a script which would grab the relevant page from Wikipedia and convert it into a JSONP call which allows us to get around the cross-domain AJAX issues. As an added extra it’s really easy to add some XPath to your YQL so I’m grabbing only the relevant table from the Wikipedia document which cuts down on the complexity of my javascript. Here’s what I ended up with:
SELECT * FROM html WHERE url="http://en.wikipedia.org/wiki/List_of_United_States_National_Parks_by_state" AND xpath="//table[@class='wikitable sortable']"

If you run this code in the console you’ll see that it grabs the relevant table from wikipedia and returns it as XML or JSON. From here it’s easy to make the AJAX call from jQuery and then loop over the JSON returned creating an XML document. If you are interested in the details of that you can check out the complete example.

I was really impressed with how easy it was to quickly figure out YQL and I think it’s a really useful service. Even if you just use it to convert a HTML page to a valid XML document then it is still invaluable for all sorts of screen scraping purposes (it’s always much easier to parse XML than HTML tag soup). One improvement I’d love to see the addition of a CSS style selection engine as well as the XPath one. And the documentation could maybe be clearer (I figured out the above script by checking examples on other blogs rather than by reading the docs). But overall I give Yahoo! a big thumbs up for YQL and look forward to using it again soon…

Source: http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/

Saturday, 18 May 2013

Wikipedia Scraper API Script Copy Content Data Easily

Easy Wikipedia API script is the top notch Wikipedia scraper API script which helps you to copy the data content from the Wikipedia site when the user enter some key phrases in the search box. The API will show the results pages beautifully in the tool tips which make your visitors very easy to get the data which out opening the Wikipedia site in the browser. The accessing speed is comparatively very good since it using very advanced coding.

The API is used not only to access the Wikipedia site, it also fetch the data from the movie sites, information sites and also image sites to make the customer happy. So you will get data for all type of keywords you have entered in the search box.
Another special feature is you can add group to acquire more specific results from the third-party websites. All you need is, you have to add the following script around the word you want to get data in your website.

<span class=”easy_wiki”> and </span>
This API is always uses cache data to speed up the process and the search results too will be available in the local system for certain time period.
The another special features of this software is, you can choose the language of the local wiki, so that whenever you typed in the input box, your results will be in your local language like German, French.

Advanced css class is used to make the tool tip result very colorful to make customers feel impressive with the searched results.
Whenever this script is fetching the data, it also fetches the logo, image and other digital assets which are available in that wiki page.
This software is available to fetch the Wikipedia content of languages like English, Français, Deutsch, Español, Português.

Source: http://www.scriptsdump.com/wikipedia-scraper-api/

Wednesday, 15 May 2013

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("http://en.wikipedia.org/wiki/Aquamole_Pot").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v) for k, v in params.items())
url = "http://en.wikipedia.org/w/api.php?%s" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = [http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]
}}
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
'location_lon': '-2.50149',
'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
'caption': '', 'survey': '[http://cavemaps.org/cavePages/West%20Kingsdale__Aquamole%20Pot.htm cavemaps.org]',
'location': '[[West Kingsdale]], [[North Yorkshire]], England',
'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside
Category:Caves_of_the_United_Kingdom.

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")
scraperwiki.sqlite.save(["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.
So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

Source: http://blog.scraperwiki.com/2011/12/07/how-to-scrape-and-parse-wikipedia/