<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Juozas devBlog &#187; ssssscpripting</title>
	<atom:link href="http://dev.juokaz.com/tag/ssssscpripting/feed" rel="self" type="application/rss+xml" />
	<link>http://dev.juokaz.com</link>
	<description>Random ideas, scripts and facts</description>
	<lastBuildDate>Mon, 22 Mar 2010 10:48:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Web scraping with PHP and XPath</title>
		<link>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath</link>
		<comments>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath#comments</comments>
		<pubDate>Tue, 17 Feb 2009 20:23:04 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Websites]]></category>
		<category><![CDATA[crawler]]></category>
		<category><![CDATA[easy]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[legal]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[spider]]></category>
		<category><![CDATA[ssssscpripting]]></category>
		<category><![CDATA[w3schools]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web scraping]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=176</guid>
		<description><![CDATA[When I was writing about how I use web scraping, I was still hadn&#8217;t tried using Xpath (shame on me). sssscripting blog responded to my article with very good and rich post about all sorts of different techniques for scraping (with Ruby examples) and after reading this post in Kore Nordmann blog I finally decided [...]]]></description>
			<content:encoded><![CDATA[<p>When I was writing about <a href="http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market">how I use web scraping</a>, I was still hadn&#8217;t tried using Xpath (shame on me). <a href="http://ssscripting.wordpress.com/">sssscripting blog</a> responded to my article with very good and rich <a href="http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/">post</a> about all sorts of different techniques for scraping (with Ruby examples) and after reading this <a href="http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html">post in Kore Nordmann blog</a> I finally decided to try making something with Xpath.</p>
<p>It turned out, that using Xpath is <strong>extremely easy</strong>, really. When you master it, you can do everything in seconds. Yes, you need to know how XML works and how to write correct Xpath queries (brief explanation of Xpath syntax is available at <a href="http://www.w3schools.com/XPath/xpath_syntax.asp">W3Schools</a>), but hey &#8211; these topics are in 1st year of university. </p>
<p>Also, there are good tools like <a href="https://addons.mozilla.org/en-US/firefox/addon/1095">XPath checker</a> for Firefox which allows you to debug and test your queries without writing any code. Stupid to say, but XPath queries looks a lot like CSS selectors, but with much more power and flexibility. Without further talking, lets look at example (idea from Kore&#8217;s article):</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">&lt;?php</span> 
&nbsp;
<span style="color: #000088;">$oldSetting</span> <span style="color: #339933;">=</span> <span style="color: #990000;">libxml_use_internal_errors</span><span style="color: #009900;">&#40;</span> <span style="color: #009900; font-weight: bold;">true</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #990000;">libxml_clear_errors</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMDocument<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #000088;">$html</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">loadHtmlFile</span><span style="color: #009900;">&#40;</span>
    <span style="color: #0000ff;">'http://www.bhphotovideo.com/c/shop/6222/SLR_Digital_Cameras.html'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$xpath</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMXPath<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$html</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #000088;">$links</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$xpath</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span> <span style="color: #0000ff;">&quot;//div[@class='productBlock clearfix']&quot;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000088;">$return</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span> <span style="color: #000088;">$links</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$item</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
	<span style="color: #000088;">$newDom</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMDocument<span style="color: #339933;">;</span>
	<span style="color: #000088;">$newDom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">appendChild</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$newDom</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">importNode</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$item</span><span style="color: #339933;">,</span><span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000088;">$xpath</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> DOMXPath<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$newDom</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
	<span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$xpath</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;//div[@id='productTitle']&quot;</span><span style="color: #009900;">&#41;</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">item</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">nodeValue</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #000088;">$price</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$xpath</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;//li[@class='price']/span[@class='value']&quot;</span><span style="color: #009900;">&#41;</span>
                   <span style="color: #339933;">-&gt;</span><span style="color: #004000;">item</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">nodeValue</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #000088;">$return</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span>
		<span style="color: #0000ff;">'title'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$title</span><span style="color: #339933;">,</span>
		<span style="color: #0000ff;">'price'</span> <span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$price</span><span style="color: #339933;">,</span>
		<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span> 
&nbsp;
<span style="color: #666666; font-style: italic;">// Products array with title and price</span>
<span style="color: #990000;">print_r</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$return</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #990000;">libxml_clear_errors</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
<span style="color: #990000;">libxml_use_internal_errors</span><span style="color: #009900;">&#40;</span> <span style="color: #000088;">$oldSetting</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
&nbsp;
<span style="color: #000000; font-weight: bold;">?&gt;</span></pre></div></div>

<p>This code gets all products titles and prices from <a href="http://www.bhphotovideo.com/">Bhphotovideo.com</a> (read below) category page. You must have noticed that XPath queries are really simple: one selects all products, others selects only specific elements of each product. How fast you can get same results with plain regexp? I made this in 3 minutes (downloading XPath extension for Firefox, reading php manual, etc. included).</p>
<p>I used to write queries as regular expressions, but now I see that I&#8217;ve been just wasting time &#8211; using XPath is much more easier. I don&#8217;t know why I haven&#8217;t tried them sooner (maybe because of my believing, that XPath only works with correctly structured documents), but now I see that this technology is just awesome. I don&#8217;t know what to say more &#8211; <strong>web scraping with XPath is supper easy</strong>.</p>
<p><em><a href="http://www.bhphotovideo.com">Bhphotovideo.com</a> is really good shop and I chose it as example of scraping. I don&#8217;t encourage you to steal their information and it&#8217;s your responsibility to scrape only these sites, which allows it. My code it&#8217;s just an example and shouldn&#8217;t be used to affect Bhphotovideo.com sales.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/web-scraping-with-php-and-xpath/feed</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Web scraping &#8211; easy way to monitor market</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market</link>
		<comments>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market#comments</comments>
		<pubDate>Sat, 14 Feb 2009 15:54:52 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Websites]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[legal iseus]]></category>
		<category><![CDATA[prices]]></category>
		<category><![CDATA[rivals]]></category>
		<category><![CDATA[spider]]></category>
		<category><![CDATA[ssssscpripting]]></category>
		<category><![CDATA[web crawler]]></category>
		<category><![CDATA[web scraper]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=145</guid>
		<description><![CDATA[Sssssssscripting blog yesterday wrote about writing your own web crawler (in Ruby) and I immediately remember that I have done similar projects in past. Only difference is, I had developed web scrappers which were used to monitor e-commerce websites market. Believe me, when you run online shop, having ability to look at rivals prices in [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://ssscripting.wordpress.com/">Sssssssscripting</a> blog yesterday wrote about <a href="http://ssscripting.wordpress.com/2009/02/13/how-to-write-a-spider/">writing your own web crawler</a> (in Ruby) and I immediately remember that I have done similar projects in past. Only difference is, I had developed web scrappers which were used to monitor e-commerce websites market. Believe me, when you run online shop, having ability to look at rivals prices in simple way is priceless.</p>
<p><a href="http://en.wikipedia.org/wiki/Web_scraping">Web scrapers</a> which I wrote were not very dynamic &#8211; everyone of them was made for specific website. It may seem not very good practice, but having different spiders for different sites helps a lot &#8211; source code is dramatically simpler and tweaking one doesn&#8217;t affect others.</p>
<p>Currently, I have two spiders running every week (they have been running for half year). They have one task: go to website X and get all information about products they sell. Products price is most valuable information, so I cache it for later use (for information &#8220;price on other stores&#8221;, etc.).</p>
<p>Generally, writing web scraper is not very hard if you know how to write <a href="http://uk2.php.net/manual/en/book.pcre.php">regexp</a>&#8217;s fast (currently running web scrappers are only 100 lines each). Getting information from website is extremely easy if you know how to use regular expressions &#8211; website structure doesn&#8217;t change frequently, so correctly written expressions can last years. Or until server administrator blocks you, but if you really need this information, you can use N anonymous proxies.</p>
<p>Creating such spider involves analysing website structure and HTML source code &#8211; web scraper is basically &#8220;automatic copy paste&#8221; so if you know where to search for information in website, then you can simulate it code. For example, pseudo-code for price getting spider could look like this:</p>
<pre>foreach category in categories
   goto category page
       pages = get {category} pages list
       foreach page in pages
           information = extract products from {pages}</pre>
<p>Some people may think that it&#8217;s not legal. Probably it is. If you look at Legal issues in <a href="http://en.wikipedia.org/wiki/Web_scraping">this</a> article, you will find that:</p>
<blockquote><p>Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.</p></blockquote>
<p>In my opinion, web scraping should be treated as legal thing if and only if you don&#8217;t directly use scraped information in your website. Scraping news, blog entries, etc. and showing them in your site is bad thing, but scraping information and using it somewhere in your back-end script is perfectly legal by me. What do you think?</p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
