<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Juozas devBlog &#187; crawler</title>
	<atom:link href="http://dev.juokaz.com/tag/crawler/feed" rel="self" type="application/rss+xml" />
	<link>http://dev.juokaz.com</link>
	<description>Random ideas, scripts and facts</description>
	<lastBuildDate>Mon, 22 Mar 2010 10:48:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Web scraping with PHP and XPath</title>
		<link>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath</link>
		<comments>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath#comments</comments>
		<pubDate>Tue, 17 Feb 2009 20:23:04 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Websites]]></category>
		<category><![CDATA[crawler]]></category>
		<category><![CDATA[easy]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[legal]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[spider]]></category>
		<category><![CDATA[ssssscpripting]]></category>
		<category><![CDATA[w3schools]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web scraping]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=176</guid>
		<description><![CDATA[When I was writing about how I use web scraping, I was still hadn&#8217;t tried using Xpath (shame on me). sssscripting blog responded to my article with very good and rich post about all sorts of different techniques for scraping (with Ruby examples) and after reading this post in Kore Nordmann blog I finally decided [...]]]></description>
			<content:encoded><![CDATA[<p>When I was writing about <a href="http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market">how I use web scraping</a>, I was still hadn&#8217;t tried using Xpath (shame on me). <a href="http://ssscripting.wordpress.com/">sssscripting blog</a> responded to my article with very good and rich <a href="http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/">post</a> about all sorts of different techniques for scraping (with Ruby examples) and after reading this <a href="http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html">post in Kore Nordmann blog</a> I finally decided to try making something with Xpath.</p>
<p>It turned out, that using Xpath is <strong>extremely easy</strong>, really. When you master it, you can do everything in seconds. Yes, you need to know how XML works and how to write correct Xpath queries (brief explanation of Xpath syntax is available at <a href="http://www.w3schools.com/XPath/xpath_syntax.asp">W3Schools</a>), but hey &#8211; these topics are in 1st year of university. </p>
<p>Also, there are good tools like <a href="https://addons.mozilla.org/en-US/firefox/addon/1095">XPath checker</a> for Firefox which allows you to debug and test your queries without writing any code. Stupid to say, but XPath queries looks a lot like CSS selectors, but with much more power and flexibility. Without further talking, lets look at example (idea from Kore&#8217;s article):</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">&lt;?php</span> 
&nbsp;
<span style="color: #000088;">$oldSetting</span> <span style="color: #339933;">=</span> <span style="color: #990000;">libxml_use_internal_errors</span><span style="color: #009900;">&#40;</span> <span style="color: #009900; font-weight: bold;">true</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #990000;">libxml_clear_errors</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMDocument<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">loadHtmlFile</span><span style="color: #009900;">&#40;</span>
    <span style="color: #0000ff;">'http://www.bhphotovideo.com/c/shop/6222/SLR_Digital_Cameras.html'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$xpath</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMXPath<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$html</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #000088;">$links</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$xpath</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">&quot;//div[@class='productBlock clearfix']&quot;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$return</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span> <span style="color: #000088;">$links</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$item</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #000088;">$newDom</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMDocument<span style="color: #339933;">;</span>
	<span style="color: #000088;">$newDom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">appendChild</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$newDom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">importNode</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$item</span><span style="color: #339933;">,</span><span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000088;">$xpath</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMXPath<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$newDom</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
	<span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$xpath</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;//div[@id='productTitle']&quot;</span><span style="color: #009900;">&#41;</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">item</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">nodeValue</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #000088;">$price</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$xpath</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;//li[@class='price']/span[@class='value']&quot;</span><span style="color: #009900;">&#41;</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">item</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">nodeValue</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000088;">$return</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span>
		<span style="color: #0000ff;">'title'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$title</span><span style="color: #339933;">,</span>
		<span style="color: #0000ff;">'price'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$price</span><span style="color: #339933;">,</span>
		<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span> 
&nbsp;
<span style="color: #666666; font-style: italic;">// Products array with title and price</span>
<span style="color: #990000;">print_r</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$return</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #990000;">libxml_clear_errors</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #990000;">libxml_use_internal_errors</span><span style="color: #009900;">&#40;</span> <span style="color: #000088;">$oldSetting</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000000; font-weight: bold;">?&gt;</span></pre></div></div>

<p>This code gets all products titles and prices from <a href="http://www.bhphotovideo.com/">Bhphotovideo.com</a> (read below) category page. You must have noticed that XPath queries are really simple: one selects all products, others selects only specific elements of each product. How fast you can get same results with plain regexp? I made this in 3 minutes (downloading XPath extension for Firefox, reading php manual, etc. included).</p>
<p>I used to write queries as regular expressions, but now I see that I&#8217;ve been just wasting time &#8211; using XPath is much more easier. I don&#8217;t know why I haven&#8217;t tried them sooner (maybe because of my believing, that XPath only works with correctly structured documents), but now I see that this technology is just awesome. I don&#8217;t know what to say more &#8211; <strong>web scraping with XPath is supper easy</strong>.</p>
<p><em><a href="http://www.bhphotovideo.com">Bhphotovideo.com</a> is really good shop and I chose it as example of scraping. I don&#8217;t encourage you to steal their information and it&#8217;s your responsibility to scrape only these sites, which allows it. My code it&#8217;s just an example and shouldn&#8217;t be used to affect Bhphotovideo.com sales.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath/feed</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
	</channel>
</rss>
