Web scraping with PHP and XPath

Posted February 17th, 2009 by Juozas

When I was writing about how I use web scraping, I was still hadn’t tried using Xpath (shame on me). sssscripting blog responded to my article with very good and rich post about all sorts of different techniques for scraping (with Ruby examples) and after reading this post in Kore Nordmann blog I finally decided to try making something with Xpath.

It turned out, that using Xpath is extremely easy, really. When you master it, you can do everything in seconds. Yes, you need to know how XML works and how to write correct Xpath queries (brief explanation of Xpath syntax is available at W3Schools), but hey – these topics are in 1st year of university.

Also, there are good tools like XPath checker for Firefox which allows you to debug and test your queries without writing any code. Stupid to say, but XPath queries looks a lot like CSS selectors, but with much more power and flexibility. Without further talking, lets look at example (idea from Kore’s article):

<?php 
 
$oldSetting = libxml_use_internal_errors( true ); 
libxml_clear_errors(); 
 
$html = new DOMDocument(); 
$html->loadHtmlFile(
    'http://www.bhphotovideo.com/c/shop/6222/SLR_Digital_Cameras.html'); 
 
$xpath = new DOMXPath( $html ); 
$links = $xpath->query( "//div[@class='productBlock clearfix']" ); 
 
$return = array();
 
foreach ( $links as $item ) {
	$newDom = new DOMDocument;
	$newDom->appendChild($newDom->importNode($item,true));
 
	$xpath = new DOMXPath( $newDom ); 
	$title = trim($xpath->query("//div[@id='productTitle']")
                   ->item(0)->nodeValue);
	$price = trim($xpath
                   ->query("//li[@class='price']/span[@class='value']")
                   ->item(0)->nodeValue);
 
	$return[] = array(
		'title' => $title,
		'price' => $price,
		);
} 
 
// Products array with title and price
print_r($return);
 
libxml_clear_errors(); 
libxml_use_internal_errors( $oldSetting ); 
 
?>

This code gets all products titles and prices from Bhphotovideo.com (read below) category page. You must have noticed that XPath queries are really simple: one selects all products, others selects only specific elements of each product. How fast you can get same results with plain regexp? I made this in 3 minutes (downloading XPath extension for Firefox, reading php manual, etc. included).

I used to write queries as regular expressions, but now I see that I’ve been just wasting time – using XPath is much more easier. I don’t know why I haven’t tried them sooner (maybe because of my believing, that XPath only works with correctly structured documents), but now I see that this technology is just awesome. I don’t know what to say more – web scraping with XPath is supper easy.

Bhphotovideo.com is really good shop and I chose it as example of scraping. I don’t encourage you to steal their information and it’s your responsibility to scrape only these sites, which allows it. My code it’s just an example and shouldn’t be used to affect Bhphotovideo.com sales.

Trackbacks/Pingbacks

  1. ring » Blog Archive » Web scraping with PHP and XPath | Juozas devBlog
  2. Daily Digest for 2009-02-18 | Pedro Trindade
  3. How to scrape webpages using PHP and XPath
  4. Scraping login requiring websites with cURL | Juozas devBlog
  5. HTML filtering and XSS protection | Juozas devBlog
  6. Prevent scripts from being killed | Juozas devBlog
  7. Серфим по XHTML/XML с помощью XPath и PHP » proft.com.ua [Блог о Web, IT, life]

Comments (19)

  1. fosron

    Can’t disagree with you. Xpath is awesome, just like SQL, but for HTML :)

  2. Juozas (author)

    Also, there is Linq for variables ;) I just started using it with C#, but there must be implementation of it in PHP too.

  3. Kore

    You can also query “local” parts of the document, using the second argument of DOMXpath::query(). So there is no need to reconstruct a new DOMDocument from the already selected parts of the document.

    This would then look something like:

    $title = $xpath->query( "//div[@id='productTitle']", $item )
                   ->item( 0 )->nodeValue
    
  4. Pablo

    Great post! I’ll have to try XPath the next time I have to do some web scrapping…

  5. Geo

    When I wrote the article about scraping techniques, initially I wanted to show how to scrape mozilla’s addon page :) . Keep up the good work !

  6. fosron

    I’ve ran into some problems. I try to load my webpage using you code ... $html->loadHtmlFile(
    'http://fosron.lt'); ...
    and i get bunch of validation errors, and then Xpath won’t work for me :|

  7. Juozas (author)

    HTML + XPath = Warnings :) Unless your HTML is XHTML (Very) Strict, I think.

    You need to use (to avoid these warnings):

    $oldSetting = libxml_use_internal_errors( true );
    libxml_clear_errors();

    .. code ..

    libxml_clear_errors();
    libxml_use_internal_errors( $oldSetting );

  8. fosron

    Ow lol, fixed it. :D

  9. Jani Hartikainen

    Nice post. It’s good to see XPath articles, as it’s definitely an under-used technology amongst PHP people.

    However, I think there’s a small problem with XPath in regards to scraping websites: the site HAS TO BE valid XHTML! (fosron, perhaps this is your problem?)

    I don’t know if there are any libraries for PHP which attempt to repair broken HTML. Until there is one, which then turns the document into a valid XML DOM, I don’t think XPath will be very useful.

    There’s a really good library for Python, called BeautifulSoup, which can parse even very messed up HTML into a document structure. If you want to find an easy way to scrap sites, I think it’s possibly the best library you can use. Definitely worth checking out at least

  10. Juozas (author)

    By definition it works only with valid XHTML’s but somehow it accepts bad ones too (sometimes). I guess you only need to have good structure and not all other rules.

    There is thing called Tidy (http://uk3.php.net/tidy) it works as extension for PHP and can:

    Tidy is a binding for the Tidy HTML clean and repair utility which allows you to not only clean and otherwise manipulate HTML documents, but also traverse the document tree.

    I need to try it – maybe it can help with really bad code. But this package is not in default PHP, so shared hosting users would need to find another solution.

  11. fosron

    Yes, Tidy is a good solution. But think how your server will bleed, when you’ll load thousands of websites, and all of them need’s to be tidy’ed and then scraped in pieces with xpath. That’s a big performance problem…

  12. Juozas (author)

    If you want performance and thousands of pages use Python and BeatifulSoup :)

    Also, you wont scrape websites in every request, so for CRON job Tidy+XPath would work just fine. Or maybe you can spend some time writing regexp’s.

  13. Jason Bartholme

    Nice explanation, Juozas. I do a fair amount of scraping with ColdFusion, MySQL, and RegEx. My most recent was grabbing on the images from Fark.com. I’ve been learning php and your method looks like a good starting point for trying to use php for upcoming projects.

  14. Stephen Cronin

    Hi, looks good – but I’ve been very happy using PHP Simple HTML DOM Parser, which uses JQuery like selectors. Have you every come across this and, if so, what’s you’re take on it?

  15. Brian @ MGoBlog

    A note on Kore’s correction:
    $title = $xpath->query( "//div[@id='productTitle']", $item )
    ->item( 0 )->nodeValue

    That won’t work relative to path item. The // will go to the doc root even if you pass in the relative parameter. You need to put a period first:
    $title = $xpath->query( ".//div[@id='productTitle']", $item )
    ->item( 0 )->nodeValue

  16. Ma'moon

    Thanks a lot for this awesome post about HTMLPurifier, i have been using it for sometime now and it was really so awesome and feature rich but i had some issues related to XSS protection where if i test it using the XSS injection tool named scanmus it fails in some cases, is there any suggested solutions for this test using the same wonderful package?

  17. boxoft

    Nice tutorial. \(^o^)/~

    Using @ before the $html->loadHtmlFile(…) line can mask the warnings. So the code can be modified like this:

    loadHtmlFile('http://www.bhphotovideo.com/c/shop/6222/SLR_Digital_Cameras.html'); 
    
    $xpath = new DOMXPath( $html );
    $links = $xpath->query( "//div[@class='productBlock clearfix']" ); 
    
    $result = array();
    
    foreach ( $links as $item )
    {
    	$title = trim($xpath->query(".//div[@id='productTitle']", $item)->item(0)->nodeValue);
    	$price = trim($xpath->query(".//li[@class='price']/span[@class='value']", $item)->item(0)->nodeValue);
    
    	$result[] = array('title' => $title, 'price' => $price);
    } 
    
    print_r($result);
    
    ?>
    
  18. Alex

    Great post!
    You saved me a lot lot of time!
    Many thanks!

  19. Goha

    Cool stuff to learn.. thanks for sharing..

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">