<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Web scraping &#8211; easy way to monitor market</title>
	<atom:link href="http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/feed" rel="self" type="application/rss+xml" />
	<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market</link>
	<description>Random ideas, scripts and facts</description>
	<lastBuildDate>Mon, 29 Mar 2010 18:47:16 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Xpath</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-1208</link>
		<dc:creator>Xpath</dc:creator>
		<pubDate>Fri, 17 Apr 2009 22:09:53 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-1208</guid>
		<description>xpath is overrated shit.

actually any language with a solid parsing Library will be easier to use, and faster.</description>
		<content:encoded><![CDATA[<p>xpath is overrated shit.</p>
<p>actually any language with a solid parsing Library will be easier to use, and faster.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daily Digest for 2009-02-18 &#124; Pedro Trindade</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-69</link>
		<dc:creator>Daily Digest for 2009-02-18 &#124; Pedro Trindade</dc:creator>
		<pubDate>Thu, 19 Feb 2009 08:12:51 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-69</guid>
		<description>[...] Web scraping - easy way to monitor market &#124; Juozas devBlog [...]</description>
		<content:encoded><![CDATA[<p>[...] Web scraping &#8211; easy way to monitor market | Juozas devBlog [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Web scraping with PHP and XPath &#124; Juozas devBlog</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-49</link>
		<dc:creator>Web scraping with PHP and XPath &#124; Juozas devBlog</dc:creator>
		<pubDate>Tue, 17 Feb 2009 20:23:37 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-49</guid>
		<description>[...] I was writing about how I use web scraping, I was still hadn&#8217;t tried using Xpath (shame on me). sssscripting blog responded to my [...]</description>
		<content:encoded><![CDATA[<p>[...] I was writing about how I use web scraping, I was still hadn&#8217;t tried using Xpath (shame on me). sssscripting blog responded to my [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geo</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-35</link>
		<dc:creator>Geo</dc:creator>
		<pubDate>Sun, 15 Feb 2009 15:18:09 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-35</guid>
		<description>I also wrote an article related to web scraping . You can find it here : http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/</description>
		<content:encoded><![CDATA[<p>I also wrote an article related to web scraping . You can find it here : <a href="http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/" rel="nofollow">http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Juozas</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-33</link>
		<dc:creator>Juozas</dc:creator>
		<pubDate>Sat, 14 Feb 2009 17:36:39 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-33</guid>
		<description>Today I found another article here (http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html).

It seems that Dom PHP libraries can work with malformed documents. Still, both sites which I crawl are very bad made (they made 5-10 years ago) so Xpath can have problems.

However, when I will be doing another spider, I will definitely use Xpath. Probably I&#039;ll try next week.</description>
		<content:encoded><![CDATA[<p>Today I found another article here (<a href="http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html" rel="nofollow">http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html</a>).</p>
<p>It seems that Dom PHP libraries can work with malformed documents. Still, both sites which I crawl are very bad made (they made 5-10 years ago) so Xpath can have problems.</p>
<p>However, when I will be doing another spider, I will definitely use Xpath. Probably I&#8217;ll try next week.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geo</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-32</link>
		<dc:creator>Geo</dc:creator>
		<pubDate>Sat, 14 Feb 2009 17:18:17 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-32</guid>
		<description>I forgot to mention about that . You should pass the source&#039;s content through tidy , or BeautifulSoup , or a parser that can work with malformed documents . I know that XPath can become useless if the site&#039;s structure changes , but it&#039;s very easy to replace .</description>
		<content:encoded><![CDATA[<p>I forgot to mention about that . You should pass the source&#8217;s content through tidy , or BeautifulSoup , or a parser that can work with malformed documents . I know that XPath can become useless if the site&#8217;s structure changes , but it&#8217;s very easy to replace .</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Juozas</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-30</link>
		<dc:creator>Juozas</dc:creator>
		<pubDate>Sat, 14 Feb 2009 16:49:49 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-30</guid>
		<description>Yes, XPath is wonderful thing, but always worry about document validity? What if document is badly formed? 

Also, do you use XPath by &quot;html/body/table/tr/td/span...&quot;? If so, it can be breakable easier.

I haven&#039;t used it much (only for well structured xml) so regular expressions for me does the job quite good :)</description>
		<content:encoded><![CDATA[<p>Yes, XPath is wonderful thing, but always worry about document validity? What if document is badly formed? </p>
<p>Also, do you use XPath by &#8220;html/body/table/tr/td/span&#8230;&#8221;? If so, it can be breakable easier.</p>
<p>I haven&#8217;t used it much (only for well structured xml) so regular expressions for me does the job quite good :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geo</title>
		<link>http://dev.juokaz.com/php/web-scraping-easy-way-to-monitor-market/comment-page-1#comment-28</link>
		<dc:creator>Geo</dc:creator>
		<pubDate>Sat, 14 Feb 2009 16:10:46 +0000</pubDate>
		<guid isPermaLink="false">http://dev.juokaz.com/?p=145#comment-28</guid>
		<description>I find it easier to scrape using XPath lately . Firefox has some great addons for retrieving a path for a element in a page . By using that with a xpath library , the development speed is blazing fast . There is a ruby module , &lt;a href=&quot;http://scrubyt.org/&quot; rel=&quot;nofollow&quot;&gt;Scrubyt&lt;/a&gt; , built on top of mechanize , which simplifies that even more . As for the legal issues , I think that if users respect robots.txt , everything should be okay .</description>
		<content:encoded><![CDATA[<p>I find it easier to scrape using XPath lately . Firefox has some great addons for retrieving a path for a element in a page . By using that with a xpath library , the development speed is blazing fast . There is a ruby module , <a href="http://scrubyt.org/" rel="nofollow">Scrubyt</a> , built on top of mechanize , which simplifies that even more . As for the legal issues , I think that if users respect robots.txt , everything should be okay .</p>
]]></content:encoded>
	</item>
</channel>
</rss>
