[langsec-discuss] stackoverflow's HTML sanitizer bypassed

Sven Kieske svenkieske at gmail.com
Thu Feb 26 10:43:48 UTC 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 23.02.2015 04:27, travis+ml-langsec at subspacefield.org wrote:
> Quote: This post is part of a series 
> <http://danlec.com/blog/hacking-stackoverflow-com> describing the
> 33 security vulnerabilities I reported 
> tostackoverflow.com<http://stackoverflow.com/> from 2009-2013.
> This particular exploit was reported and fixed in 2009. 
> http://danlec.com/blog/hacking-stackoverflow-com-s-html-sanitizer
> 
> Funny: 
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
>
>  Accurate: I think the flaw here is that HTML is a Chomsky Type 2
> grammar (context free 
> grammar)<http://en.wikipedia.org/wiki/Context-free_grammar> and
> RegEx is a Chomsky Type 3 grammar (regular 
> grammar)<http://en.wikipedia.org/wiki/Regular_grammar>. Since a
> Type 2 grammar is fundamentally more complex than a Type 3 grammar
> (see the Chomsky
> hierarchy<http://en.wikipedia.org/wiki/Chomsky_hierarchy>), you
> can't possibly make this work. But many will try, some will claim 
> success and others will find the fault and totally mess you up.
> 

Hi,

thanks for this information.
As I read  up about this topic I stumbled over this
article:

http://blog.codinghorror.com/parsing-html-the-cthulhu-way/

It claims that perl has "solved" this problem with
"HTML::SANITIZER"
http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm (dead
link)

This doesn't seem to be around anymore, so I searched and I found this:
http://search.cpan.org/~nigelm/HTML-Scrubber-0.11/lib/HTML/Scrubber.pm

It actually uses this for parsing the html:
"I wasn't satisfied with HTML::Sanitizer because it is based on
HTML::TreeBuilder, so I thought I'd write something similar that works
directly with HTML::Parser."

ok let's look at "HTML::Parser" :
http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.71/Parser.pm

It requires "HTML::Entities", so let's look there first:
http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.71/lib/HTML/Entities.pm

and there it is:

sub encode_entities
{
    return undef unless defined $_[0];
    my $ref;
    if (defined wantarray) {
	my $x = $_[0];
	$ref = \$x;     # copy
    } else {
	$ref = \$_[0];  # modify in-place
    }
    if (defined $_[1] and length $_[1]) {
	unless (exists $subst{$_[1]}) {
	    # Because we can't compile regex we fake it with a cached sub
	    my $chars = $_[1];
	    $chars =~ s,(?<!\\)([]/]),\\$1,g;
	    $chars =~ s,(?<!\\)\\\z,\\\\,;
	    my $code = "sub {\$_[0] =~ s/([$chars])/\$char2entity{\$1} ||
num_entity(\$1)/ge; }";
	    $subst{$_[1]} = eval $code;
	    die( $@ . " while trying to turn range: \"$_[1]\"\n "
	      . "into code: $code\n "
	    ) if $@;
	}
	&{$subst{$_[1]}}($$ref);
    } else {
	# Encode control chars, high bit chars and '<', '&', '>', ''' and '"'
	$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||
num_entity($1)/ge;
    }
    $$ref;
}


I don't know really much about perl, but I think in the end this also
uses regex to try to parse regex, doesn't it?

I asked myself how
python does do this (if you got a problem, there's almost
a guarantee that python has a module which solves it).

turns out it does it like this:
python 2:
https://hg.python.org/cpython/file/2.7/Lib/HTMLParser.py

python 3:
https://hg.python.org/cpython/file/3.4/Lib/html/parser.py

I just had a quick look at the code, so I might be wrong, but
it looks to me all those modules do use regex in the end
to parse html?

What do you think?

Maybe I should inform these programmers?

kind regards

Sven
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQGcBAEBAgAGBQJU7vjkAAoJEAq0kGAWDrql0swMAJ5sdbWGawjG/sO+suUxguD0
py/ASDhpeiESUtRhri9B5cZbNB/29t0v62MqXIt9Lez1a2v6A2NqK/6fyk4YgGKf
cjXRrGJaXERFE2HWK+G0MStZK3drY6T22YMFtIImdfQZ3ERx4gz+UwRI+z0aXD6i
1Ew9Nu6ftU2fmJoDek+SqEJm0Q2vviWoH0drzjMU55VfNhhfeOQLjCV+9ocXoAQo
apDG5v4jGV0AOW6vyCtwKfsxTaK7bcBwFwdHz+sGVQ6LFNTRV6K5yN/hxY0gw/s1
k5cw4/7SpmDOft1HLzdqOnBSxn0EjIj3MELyj0zbsCiler76ytiTDFjGWkrYsZce
LHC/OYiH6EoCoNb8cImR6XNw7/4mtHMwpT/PuD+bvYUMKSrUD/VL+tDhdeETUtMA
k7o2ZS3SHqRn1Dhzg1K/EVOQIOMnKm+768M0y4rwnH9xorKFE9wQopBJs+Kx2Igq
3LxzTPJST3xI1HQmnn2mcaBP+bfphQydJHKL4Vxfvw==
=W3nR
-----END PGP SIGNATURE-----


More information about the langsec-discuss mailing list