If you have been programming websites long enough you would know that user input is first think to worry about when thinking about security. It’s really hard to decide what data is acceptable, especially when user has permission to insert HTML content through form.
For example, if you are developing CMS you need to make sure that user input don’t break whole template. But that’s not so easy, because you need very clever HTML validations as even one missing closing tag for <div> or <p> can brake website’s layout completely. Editors like TinyMCE can check and try to fix errors, but in my experience, they sometimes create more of them.
However, problem can be solved, and quite easily. Almost a year ago I was reading some random blog when I find out about HTML Purifier. Basically, it’s library which can filter and fix any HTML. Compared to other libraries, it looks very promising, but since then I haven’t had a chance to test it – other libraries have been working fine.
Today I was working with web scrapper again and ended up stuck because of very badly formatted HTML. When regular expressions are used, code validity isn’t (shouldn’t) a case at all, but XPath fails immediately. I tried simplifying queries, hard-coded source fixing, but all that required so many effort that I introduced Purifier filter between source fetching and DOM constructing. It worked!
require_once 'HTMLPurifier.includes.php'; $config = HTMLPurifier_Config::createDefault(); $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'TidyLevel', 'heavy'); // Don't remove IDs (<div id="first" />) $config->set('Attr', 'EnableID', true); $obj = new HTMLPurifier($config); $clean_html = $obj->purify($html);
In this example I chose worst way – include all files. Library uses PEAR-like directory structure so simple auto-loader can include all required files in background, but for simplicity it’s not used here. This sample code filters $html variable using XHTML 1.0 and does heavy level tidying (quite clear from source code itself).
XSS? Purifier protects from them also – full list of tests. Library is also highly customizable (configuration manual), but documentation is not very clear – I have spent more than a hour trying to make it to return HTML with <head> part. I haven’t found any nice solution (maybe because the library is not made for such things).
HTML Purifier contains about 350 files so it’s relatively big library, however it performs good and shouldn’t kill you web server. Today I Purified and using XPath extracted information from more than 1000 pages and it worked really stable – none of the results where filtered unexpectedly. I definitely recommend it for HTML inputs filtering because it just does wonderful job – you can try it online here.







