<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Juozas devBlog &#187; cron</title>
	<atom:link href="http://dev.juokaz.com/tag/cron/feed" rel="self" type="application/rss+xml" />
	<link>http://dev.juokaz.com</link>
	<description>Random ideas, scripts and facts</description>
	<lastBuildDate>Mon, 22 Mar 2010 10:48:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Prevent scripts from being killed</title>
		<link>http://dev.juokaz.com/php/prevent-scripts-from-being-killed</link>
		<comments>http://dev.juokaz.com/php/prevent-scripts-from-being-killed#comments</comments>
		<pubDate>Wed, 25 Mar 2009 21:10:06 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Servers]]></category>
		<category><![CDATA[buffer]]></category>
		<category><![CDATA[cron]]></category>
		<category><![CDATA[firefox]]></category>
		<category><![CDATA[gzip]]></category>
		<category><![CDATA[hosting]]></category>
		<category><![CDATA[jobs]]></category>
		<category><![CDATA[limit]]></category>
		<category><![CDATA[max-execution]]></category>
		<category><![CDATA[output]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[shared]]></category>
		<category><![CDATA[time]]></category>
		<category><![CDATA[user abort]]></category>
		<category><![CDATA[web scraping]]></category>
		<category><![CDATA[wget]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=417</guid>
		<description><![CDATA[Getting back to shared servers problems. I have some very time consuming scripts running through CRON &#8211; some nice web scrapping jobs. They are not processing-intense, but rather slow because of slow websites. All these jobs are really hard to divide in to separate scripts (another article), so one script should have no limits to [...]]]></description>
			<content:encoded><![CDATA[<p>Getting back to <a href="http://dev.juokaz.com/?s=shared">shared servers problems</a>. I have some very time consuming scripts running through <a href="http://en.wikipedia.org/wiki/Cron">CRON</a> &#8211; some nice <a href="http://dev.juokaz.com/php/web-scraping-with-php-and-xpath">web scrapping</a> jobs. They are not processing-intense, but rather slow because of slow websites. All these jobs are really hard to divide in to separate scripts (another article), so one script should have no limits to run for hours. However, web servers don&#8217;t like it by default.</p>
<p>To start with, <a href="http://uk.php.net/set_time_limit">max-execution time</a> is first problem. By default, Apache process will kill itself if script has been running for more than 30 sec. Actual time depends on various parameters, but it&#8217;s nowhere near some hours of running. So first thing is to remove time limit:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #990000;">set_time_limit</span><span style="color: #009900;">&#40;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Zero means no time limit at all. However, problem is not solved yet. If you are calling your script through <a href="http://en.wikipedia.org/wiki/Apache_HTTP_Server">Apache</a> it&#8217;s most likely that script without any output in about 5 will be killed too. I believe that this depends on web server settings, but it can be easily tested &#8211; just create infinity loop and try to load it in Firefox.</p>
<p>After some time Firefox will display &#8220;Download&#8221; window with your script name &#8211; this means that your process has just been killed. I haven&#8217;t spent much time analyzing this behaviour, but easiest thing to do is just printing some text, for example:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$i</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> <span style="color: #000088;">$i</span> <span style="color: #339933;">&lt;</span> <span style="color: #000088;">$pageCount</span><span style="color: #339933;">;</span> <span style="color: #000088;">$i</span><span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
   <span style="color: #b1b100;">print</span> <span style="color: #000088;">$i</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">' out of  '</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$pageCount</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">', working with: '</span><span style="color: #339933;">.</span><span style="color: #000088;">$pages</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$i</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
   <span style="color: #990000;">flush</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
   hardWork <span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$i</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>This not only prevents script from being killed, but also displays completion (x of N) information. It&#8217;s very useful when code may have bugs, because it shows actual unit where your code has stuck. Also, you need to make sure that there is enough memory. I have this code:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #990000;">ini_set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'memory_limit'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'128M'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Not all scripts require that much of memory, but since all of this is used only for CRON tasks, it&#8217;s not unsafe.</p>
<p>Furthermore, when scripts are called by <a href="http://www.gnu.org/software/wget/">wget</a> or just browser, they will be killed as soon as user aborts them. So if you click &#8220;Stop loading this page&#8221; in Firefox &#8211; execution stops. It&#8217;s good, but my experience showed, that sometimes wget (or other similar tool) decides not to wait longer and simply stops loading. Process gets killed again.</p>
<p>I don&#8217;t know why, but I spend whole day trying to make script complete its execution. Memory wasn&#8217;t an issue, there were no bugs, but still it kept being killed. Nevertheless, there is solution for this problem also:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #990000;">ignore_user_abort</span><span style="color: #009900;">&#40;</span><span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>Ignore user abort &#8211; it does what it says.</p>
<p>Last thing to make sure &#8211; disable output caching. When running CRON jobs, <a href="http://httpd.apache.org/docs/2.0/mod/mod_deflate.html">gzip</a>&#8216;ing content is absolutely useless and also uses memory and creates more problems with buffer flushing. I have it disabled by this:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #990000;">apache_setenv</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'no-gzip'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #990000;">ini_set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'zlib.output_compression'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #990000;">ini_set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'implicit_flush'</span><span style="color: #339933;">,</span> <span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #990000;">header</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Content-Encoding: none&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>My server uses gzip by default, so these settings makes sure that it&#8217;s not compressed.</p>
<p>That&#8217;s all. I use all these lines in start of my CRON jobs <a href="http://java.sun.com/blueprints/corej2eepatterns/Patterns/FrontController.html">front controller</a> and everything works fine. Please, better don&#8217;t try them on user-side scripts, because they can create problems &#8211; if you have no access to running processes, stuck processes with 0 time limit will probably kill your web server.</p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/prevent-scripts-from-being-killed/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>HTML filtering and XSS protection</title>
		<link>http://dev.juokaz.com/php/html-filtering-and-xss-protection</link>
		<comments>http://dev.juokaz.com/php/html-filtering-and-xss-protection#comments</comments>
		<pubDate>Sat, 21 Mar 2009 20:40:24 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Websites]]></category>
		<category><![CDATA[autoloader]]></category>
		<category><![CDATA[cleanup]]></category>
		<category><![CDATA[cron]]></category>
		<category><![CDATA[dom]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[how-to]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[htmlpurifier]]></category>
		<category><![CDATA[library]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[tidy]]></category>
		<category><![CDATA[tinymce]]></category>
		<category><![CDATA[validate]]></category>
		<category><![CDATA[web scraper]]></category>
		<category><![CDATA[xss]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=396</guid>
		<description><![CDATA[If you have been programming websites long enough you would know that user input is first think to worry about when thinking about security. It&#8217;s really hard to decide what data is acceptable, especially when user has permission to insert HTML content through form.
For example, if you are developing CMS you need to make sure [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://htmlpurifier.org/"><img class="size-thumbnail wp-image-402" style="float: left;" title="HTML Purifier" src="http://dev.juokaz.com/wp-content/uploads/2009/03/logo-large-150x150.png" alt="HTML Purifier" width="150" height="150" /></a>If you have been programming websites long enough you would know that user input is first think to worry about when thinking about security. It&#8217;s really hard to decide what data is acceptable, especially when user has permission to insert HTML content through form.</p>
<p>For example, if you are developing CMS you need to make sure that user input don&#8217;t break whole template. But that&#8217;s not so easy, because you need very clever HTML validations as even one missing closing tag for &lt;div&gt; or &lt;p&gt; can brake website&#8217;s layout completely. Editors like <a href="http://tinymce.moxiecode.com/">TinyMCE</a> can check and try to fix errors, but in my experience, they sometimes create more of them.</p>
<p>However, problem can be solved, and quite easily. Almost a year ago I was reading some random blog when I find out about <a href="http://htmlpurifier.org/">HTML Purifier</a>. Basically, it&#8217;s library which can filter and fix <strong>any</strong> HTML. <a href="http://htmlpurifier.org/comparison.html">Compared</a> to other libraries, it looks very promising, but since then I haven&#8217;t had a chance to test it &#8211; other libraries have been working fine.</p>
<p>Today I was working with <a href="http://dev.juokaz.com/php/web-scraping-with-php-and-xpath">web scrapper</a> again and ended up stuck because of very badly formatted HTML. When regular expressions are used, code validity isn&#8217;t (shouldn&#8217;t) a case at all, but XPath fails immediately. I tried simplifying queries, hard-coded source fixing, but all that required so many effort that I introduced Purifier filter between source fetching and <a href="http://en.wikipedia.org/wiki/Document_Object_Model">DOM</a> constructing. It worked!</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #b1b100;">require_once</span> <span style="color: #0000ff;">'HTMLPurifier.includes.php'</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$config</span> <span style="color: #339933;">=</span> HTMLPurifier_Config<span style="color: #339933;">::</span><span style="color: #004000;">createDefault</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$config</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'HTML'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'Doctype'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'XHTML 1.0 Transitional'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #000088;">$config</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'HTML'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'TidyLevel'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'heavy'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// Don't remove IDs (&lt;div id=&quot;first&quot; /&gt;)</span>
<span style="color: #000088;">$config</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">set</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'Attr'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'EnableID'</span><span style="color: #339933;">,</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$obj</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> HTMLPurifier<span style="color: #009900;">&#40;</span><span style="color: #000088;">$config</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$clean_html</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$obj</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">purify</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$html</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>In this example I chose worst way &#8211; include all files. Library uses <a href="http://pear.php.net">PEAR</a>-like directory structure so simple auto-loader  can include all required files in background, but for simplicity it&#8217;s not used here. This sample code filters <em>$html</em> variable using XHTML 1.0 and does heavy level <a href="http://en.wikipedia.org/wiki/HTML_Tidy">tidying</a> (quite clear from source code itself).</p>
<p><a href="http://en.wikipedia.org/wiki/Cross-site_scripting">XSS</a>? Purifier protects from them also &#8211; <a href="http://htmlpurifier.org/live/smoketests/xssAttacks.php">full list</a> of tests. Library is also highly customizable (<a href="http://htmlpurifier.org/live/configdoc/plain.html">configuration manual</a>), but documentation is not very clear &#8211; I have spent more than a hour trying to make it to return HTML with  <em>&lt;head&gt;</em> part. I haven&#8217;t found any nice solution (maybe because the library is not made for such things).</p>
<p>HTML Purifier contains about 350 files so it&#8217;s relatively big library, however it performs good and shouldn&#8217;t kill you web server. Today I Purified and using XPath extracted information from more than 1000 pages and it worked really stable &#8211; none of the results where filtered unexpectedly. I definitely recommend it for HTML inputs filtering because it just does wonderful job &#8211; you can try it online <a href="http://htmlpurifier.org/demo.php">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/html-filtering-and-xss-protection/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Parallel processes in PHP</title>
		<link>http://dev.juokaz.com/php/parallel-processes-in-php</link>
		<comments>http://dev.juokaz.com/php/parallel-processes-in-php#comments</comments>
		<pubDate>Thu, 12 Feb 2009 23:36:18 +0000</pubDate>
		<dc:creator>Juozas</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[computing]]></category>
		<category><![CDATA[cron]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[Haskell]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[parallel]]></category>
		<category><![CDATA[rayracer]]></category>
		<category><![CDATA[university]]></category>

		<guid isPermaLink="false">http://dev.juokaz.com/?p=108</guid>
		<description><![CDATA[When I was coding Ray-Tracer project for my Computer Science studies in university, I ran into using Haskell parallel map function (map calls function for all list elements). Ray-Tracer runs reflections, shadows, ray-casts, etc. detection for every single pixel in scene and since everything is mathematical calculations, it&#8217;s paralleling is almost trivial.
Parallel map functions does [...]]]></description>
			<content:encoded><![CDATA[<p>When I was coding Ray-Tracer project for my Computer Science studies in university, I ran into using <a href="http://www.haskell.org/ghc/docs/latest/html/libraries/parallel/Control-Parallel-Strategies.html">Haskell parallel map function</a> (map calls function for all list elements). <a href="http://en.wikipedia.org/wiki/Ray_tracing_(graphics)">Ray-Tracer</a> runs reflections, shadows, ray-casts, etc. detection for every single pixel in scene and since everything is mathematical calculations, it&#8217;s paralleling is almost trivial.</p>
<p>Parallel map functions does all work for you &#8211; you don&#8217;t even need to know when it&#8217;s right time to fork another thread. Implementing parallel computation in Haskell was very easy and it almost gave theoretical decrease in processing time by N times (where N is processor cores count). Today I will talk a little bit about possible parallelism with PHP&#8217;s internal <a href="http://uk2.php.net/manual/en/ref.posix.php">POSIX</a> functions.</p>
<p>PHP is neither a functional language, nor is made with threading support (correct me if I&#8217;m wrong). Still, it&#8217;s possible to create parallel processes by using <a href="http://uk2.php.net/manual/en/ref.pcntl.php">pcntl_*</a> functions family. Some days ago I tried just to get something working, what uses parallel processes and can possible be extended to deal with time-consuming mathematical algorithms.</p>
<p>Download my sample code <a href="http://dev.juokaz.com/examples/parallel/run.phps" target="_blank">here</a> (rename extension to <em>php</em>) and run it in terminal (it probably wont run with Apache). This code will try to compute x^3 for 0-1M integers two times, parallel runs both in one call, normal behaviour runs one after another. Output should look similar to this:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">juokaz<span style="color: #000000; font-weight: bold;">@</span>thinkpad:~<span style="color: #000000; font-weight: bold;">/</span>Desktop<span style="color: #000000; font-weight: bold;">/</span>php<span style="color: #000000; font-weight: bold;">/</span>parallel$ php run.php
We start
I am the child, pid = <span style="color: #000000;">0</span>
I am the parent, pid = <span style="color: #000000;">2004</span>
yep, finished, I have <span style="color: #000000;">2004</span> ID
yep, finished, I have <span style="color: #000000;">0</span> ID
Ran parallel <span style="color: #000000;">5.81009602547</span> seconds
yep, finished, I have <span style="color: #000000;">1</span> ID
yep, finished, I have <span style="color: #000000;">2</span> ID
Ran normal <span style="color: #000000;">9.76671719551</span> seconds</pre></div></div>

<p>As you can see, parallel computation on dual core processor ran almost 2x faster. And yes, it does use both cores simultaneously.</p>
<p>If you want to look more deeply, there is great article called <a href="http://www.van-steenbeek.net/?q=php_pcntl_fork">Thorough look at PHP&#8217;s pcntl_fork()</a>. You will quite quickly find that there are problems when using pcntl_fork, because it basically clones running process and then continues. Everything you define before forking child process will be accessible inside child &#8211; it&#8217;s not always what you really want.<a href="http://www.van-steenbeek.net/?q=php_pcntl_fork"><br />
</a></p>
<p>Parallel processes can be easily simulated by using asynchronous calls (over HTTP, for example) from one script to others, but for something more calculations-based, using pcntl_fork() can be much more practical. But I will probably still chose asyncronous calls over parallelizing because they are much more flexible.</p>
]]></content:encoded>
			<wfw:commentRss>http://dev.juokaz.com/php/parallel-processes-in-php/feed</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>
