Saturday, 23 May 2015

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.

Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these tips could help solve some of your challenges

1. Cache pages visited for scraping

When scraping big websites, its always a good idea to cache the data you have already downloaded. So you don’t have to put load on the website again, in case you have start over again or that page is required again during scraping. Its effortless to cache to a key value store like redis.But Databases and filesystem caches are also good.

2. Don’t flood the website with large number of parallel requests , take it slow

Big websites posses algorithms to detect webscraping, large number of parallel requests from the same ip address will identify you as a Denial Of Service Attack on their website, and blacklist your IPs immediately. A better idea is to time your requests properly one after the other, giving it some human behavior. Oh!.. but scraping like that will take you ages. So balance requests using the average response time of the websites, and play around with the number of parallel requests to the website to get the right number.

3. Store the URLs that you have already fetched

You may want to keep a list of URLs you have already fetched, in a database or a key value store. What would you do if you scraper crashes after scraping 70% of the website. If you need to complete the remaining 30%, with out this list of URLs, you’ll waste a lot of time and bandwidth. Make sure you store this list of URLs

somewhere permanent, till you have all the required data. This could also be combined with the cache. This way, you’ll be able to resume scraping.

4. Split scraping into different phases

Its easier and safer if you split the scraping into multiple smaller phases. For example, you could split scraping a huge site into two. One for gathering links to the pages from which you need to extract data and another for downloading these pages to scrape content.

5. Take only whats required

Don’t grab or follow every link unless its required. You can define a proper navigation scheme to make the scraper visit only the pages required. Its always tempting to grab everything, but its just a waste of bandwidth, time and storage.

Source: http://learn.scrapehero.com/5-tips-for-scraping-big-websites/

No comments:

Post a Comment