Web scraping – easy way to monitor market

Posted February 14th, 2009 by Juozas

Sssssssscripting blog yesterday wrote about writing your own web crawler (in Ruby) and I immediately remember that I have done similar projects in past. Only difference is, I had developed web scrappers which were used to monitor e-commerce websites market. Believe me, when you run online shop, having ability to look at rivals prices in simple way is priceless.

Web scrapers which I wrote were not very dynamic – everyone of them was made for specific website. It may seem not very good practice, but having different spiders for different sites helps a lot – source code is dramatically simpler and tweaking one doesn’t affect others.

Currently, I have two spiders running every week (they have been running for half year). They have one task: go to website X and get all information about products they sell. Products price is most valuable information, so I cache it for later use (for information “price on other stores”, etc.).

Generally, writing web scraper is not very hard if you know how to write regexp’s fast (currently running web scrappers are only 100 lines each). Getting information from website is extremely easy if you know how to use regular expressions – website structure doesn’t change frequently, so correctly written expressions can last years. Or until server administrator blocks you, but if you really need this information, you can use N anonymous proxies.

Creating such spider involves analysing website structure and HTML source code – web scraper is basically “automatic copy paste” so if you know where to search for information in website, then you can simulate it code. For example, pseudo-code for price getting spider could look like this:

foreach category in categories
   goto category page
       pages = get {category} pages list
       foreach page in pages
           information = extract products from {pages}

Some people may think that it’s not legal. Probably it is. If you look at Legal issues in this article, you will find that:

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.

In my opinion, web scraping should be treated as legal thing if and only if you don’t directly use scraped information in your website. Scraping news, blog entries, etc. and showing them in your site is bad thing, but scraping information and using it somewhere in your back-end script is perfectly legal by me. What do you think?

Trackbacks/Pingbacks

  1. Web scraping with PHP and XPath | Juozas devBlog
  2. Daily Digest for 2009-02-18 | Pedro Trindade

Comments (6)

  1. Geo

    I find it easier to scrape using XPath lately . Firefox has some great addons for retrieving a path for a element in a page . By using that with a xpath library , the development speed is blazing fast . There is a ruby module , Scrubyt , built on top of mechanize , which simplifies that even more . As for the legal issues , I think that if users respect robots.txt , everything should be okay .

  2. Juozas (author)

    Yes, XPath is wonderful thing, but always worry about document validity? What if document is badly formed?

    Also, do you use XPath by “html/body/table/tr/td/span…”? If so, it can be breakable easier.

    I haven’t used it much (only for well structured xml) so regular expressions for me does the job quite good :)

  3. Geo

    I forgot to mention about that . You should pass the source’s content through tidy , or BeautifulSoup , or a parser that can work with malformed documents . I know that XPath can become useless if the site’s structure changes , but it’s very easy to replace .

  4. Juozas (author)

    Today I found another article here (http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html).

    It seems that Dom PHP libraries can work with malformed documents. Still, both sites which I crawl are very bad made (they made 5-10 years ago) so Xpath can have problems.

    However, when I will be doing another spider, I will definitely use Xpath. Probably I’ll try next week.

  5. Geo

    I also wrote an article related to web scraping . You can find it here : http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/

  6. Xpath

    xpath is overrated shit.

    actually any language with a solid parsing Library will be easier to use, and faster.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">