[langsec-discuss] stackoverflow's HTML sanitizer bypassed

Allan Wegan allanwegan at allanwegan.de
Thu Feb 26 17:55:43 UTC 2015


> sub encode_entities
> {
>     return undef unless defined $_[0];
>     my $ref;
>     if (defined wantarray) {
> 	my $x = $_[0];
> 	$ref = \$x;     # copy
>     } else {
> 	$ref = \$_[0];  # modify in-place
>     }
>     if (defined $_[1] and length $_[1]) {
> 	unless (exists $subst{$_[1]}) {
> 	    # Because we can't compile regex we fake it with a cached sub
> 	    my $chars = $_[1];
> 	    $chars =~ s,(?<!\\)([]/]),\\$1,g;
> 	    $chars =~ s,(?<!\\)\\\z,\\\\,;
> 	    my $code = "sub {\$_[0] =~ s/([$chars])/\$char2entity{\$1} ||
> num_entity(\$1)/ge; }";
> 	    $subst{$_[1]} = eval $code;
> 	    die( $@ . " while trying to turn range: \"$_[1]\"\n "
> 	      . "into code: $code\n "
> 	    ) if $@;
> 	}
> 	&{$subst{$_[1]}}($$ref);
>     } else {
> 	# Encode control chars, high bit chars and '<', '&', '>', ''' and '"'
> 	$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||
> num_entity($1)/ge;
>     }
>     $$ref;
> }
>
>
> I don't know really much about perl, but I think in the end this also
> uses regex to try to parse regex, doesn't it?

I am not into PERL myself. But the function name highly suggests, that
the code does not parses HTML but encodes a given string as HTML entities.


> I just had a quick look at the code, so I might be wrong, but
> it looks to me all those modules do use regex in the end
> to parse html?

Of course they use regular expressions for input tokenization. It most
often is the right tool for the job. But they for example do not rely on
regular expressions to get the tag structure parsed.
The linked Python codes resemble HTML parser skeletons to derive from.
They do not return a parse tree nor handle nesting themselves. The user
of the classes is expected to do that by deriving from the classes and
filling in the blanks intentionally left out (search for "pass" statements).
But even that skeleton handles comments, HTML entities and some other
stuff with some non-trivial non-regular-expression code.


> What do you think?

The tokens of most artificial languages are written in regular
languages. That is the reason, why regular expressions are very usefull
for parsing them even if the language formed by that tokens is not
regular itself. Don't let that confuse you. Always look at the
non-regular-expression code too. That is where the non-regular
properties of the language get parsed.



-- 
Allan Wegan
Jabber: allanwegan at jabber.ccc.de
 OTR-Fingerprint: A1AAA1B9C067F9884A424D339834346929164587
ICQ: 209459114
 OTR-Fingerprint: 71DE5B5E67D6D758A93BF1CE7DA06625205AC6EC


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <https://mail.langsec.org/pipermail/langsec-discuss/attachments/20150226/8e526990/attachment.sig>


More information about the langsec-discuss mailing list