HTML filtering and XSS protection

Posted March 21st, 2009 by Juozas

HTML PurifierIf you have been programming websites long enough you would know that user input is first think to worry about when thinking about security. It’s really hard to decide what data is acceptable, especially when user has permission to insert HTML content through form.

For example, if you are developing CMS you need to make sure that user input don’t break whole template. But that’s not so easy, because you need very clever HTML validations as even one missing closing tag for <div> or <p> can brake website’s layout completely. Editors like TinyMCE can check and try to fix errors, but in my experience, they sometimes create more of them.

However, problem can be solved, and quite easily. Almost a year ago I was reading some random blog when I find out about HTML Purifier. Basically, it’s library which can filter and fix any HTML. Compared to other libraries, it looks very promising, but since then I haven’t had a chance to test it – other libraries have been working fine.

Today I was working with web scrapper again and ended up stuck because of very badly formatted HTML. When regular expressions are used, code validity isn’t (shouldn’t) a case at all, but XPath fails immediately. I tried simplifying queries, hard-coded source fixing, but all that required so many effort that I introduced Purifier filter between source fetching and DOM constructing. It worked!

require_once 'HTMLPurifier.includes.php';
 
$config = HTMLPurifier_Config::createDefault();
 
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
$config->set('HTML', 'TidyLevel', 'heavy');
// Don't remove IDs (<div id="first" />)
$config->set('Attr', 'EnableID', true);
 
$obj = new HTMLPurifier($config);
 
$clean_html = $obj->purify($html);

In this example I chose worst way – include all files. Library uses PEAR-like directory structure so simple auto-loader  can include all required files in background, but for simplicity it’s not used here. This sample code filters $html variable using XHTML 1.0 and does heavy level tidying (quite clear from source code itself).

XSS? Purifier protects from them also – full list of tests. Library is also highly customizable (configuration manual), but documentation is not very clear – I have spent more than a hour trying to make it to return HTML with <head> part. I haven’t found any nice solution (maybe because the library is not made for such things).

HTML Purifier contains about 350 files so it’s relatively big library, however it performs good and shouldn’t kill you web server. Today I Purified and using XPath extracted information from more than 1000 pages and it worked really stable – none of the results where filtered unexpectedly. I definitely recommend it for HTML inputs filtering because it just does wonderful job – you can try it online here.

Trackbacks/Pingbacks

  1. Juozas Kaziukenas’ Blog: HTML filtering and XSS protection | Development Blog With Code Updates : Developercast.com
  2. Juozas Kaziukenas’ Blog: HTML filtering and XSS protection : WebNetiques, LLC : Website Developers in Minneapolis, MN
  3. Juozas Kaziukenas’ Blog: HTML filtering and XSS protection : Dragonfly Networks
  4. HTML filtering and XSS protection | Juozas devBlog

Comments (6)

  1. Edward Z. Yang

    Hello!

    Thanks for your blog post about HTML Purifier. You are right: HTML Purifier isn’t currently able to return HTML with the head tag; it’s just not what HTML Purifier is made for. Maybe in a future version it will have that functionality (probably when we build-in HTML5 parsing).

  2. Juozas (author)

    Hi, thanks for your comment.

    What about returning:

    doctype return

    ? It shouldn’t be that hard to implement :)

  3. Jim R. Wilson

    You’re absolutely right that purifying potentially malicious HTML is a pain. One solution that I particularly like is using a light markup language in lieu of allowing full-blown HTML.

    A variety of suitable light markup languages exist, such as Markdown, Textile and those included with wiki systems (MediaWiki’s wikitext comes to mind). Of course, this generally comes as a tradeoff since most WYSIWYG editors focus on creating HTML to return to the server.

    If your users can tolerate learning a light markup language, IMO that’s a good way to go.

  4. Jamie Krasnoo

    HTML Purifier is meant to scrub user input for use in a site so it won’t return the if it’s included. It will scrub it out.

    HTML Purifier is a bit of a pig but if you take the time to set up its cache you’ll be rewarded with a performance increase.

  5. Sara

    Interesting point on web scrappers, For web scrappers i use python for simple things, but for larger projects i have used extractingdata.com web scrapper which builds custom web scrappers and data extracting programs simple and fast

  6. star config web design sydney

    I agree with you html validation is very important when u building cms, i heared about TinyMCE editir, similar editor is used in joomla and mambo content managment system, and they using it in moodle2, it is quite well.

    Thank you for your article it is really good, i liked it.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">