Sssssssscripting blog yesterday wrote about writing your own web crawler (in Ruby) and I immediately remember that I have done similar projects in past. Only difference is, I had developed web scrappers which were used to monitor e-commerce websites market. Believe me, when you run online shop, having ability to look at rivals prices in simple way is priceless.
Web scrapers which I wrote were not very dynamic – everyone of them was made for specific website. It may seem not very good practice, but having different spiders for different sites helps a lot – source code is dramatically simpler and tweaking one doesn’t affect others.
Currently, I have two spiders running every week (they have been running for half year). They have one task: go to website X and get all information about products they sell. Products price is most valuable information, so I cache it for later use (for information “price on other stores”, etc.).
Generally, writing web scraper is not very hard if you know how to write regexp’s fast (currently running web scrappers are only 100 lines each). Getting information from website is extremely easy if you know how to use regular expressions – website structure doesn’t change frequently, so correctly written expressions can last years. Or until server administrator blocks you, but if you really need this information, you can use N anonymous proxies.
Creating such spider involves analysing website structure and HTML source code – web scraper is basically “automatic copy paste” so if you know where to search for information in website, then you can simulate it code. For example, pseudo-code for price getting spider could look like this:
foreach category in categories
goto category page
pages = get {category} pages list
foreach page in pages
information = extract products from {pages}
Some people may think that it’s not legal. Probably it is. If you look at Legal issues in this article, you will find that:
Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.
In my opinion, web scraping should be treated as legal thing if and only if you don’t directly use scraped information in your website. Scraping news, blog entries, etc. and showing them in your site is bad thing, but scraping information and using it somewhere in your back-end script is perfectly legal by me. What do you think?







