[langsec-discuss] RegEx BBCode Parser

Fabian Fäßler fabi at fabif.de
Sat Jun 2 22:33:58 UTC 2012


thanks again for all the good information and help in my last thread. Now I have actually some questions about something I have done, and I try to improve.

Some months ago, I've played a lot with BBCodes and XSS. At that moment I didn't know, that from a langsec point of view, I played with bad parsing.
I found some XSS Vulns in the Browsergame from my roomate, by nesting BBCodes like this:

[img] [color=#ff0000 onerror=’eval(...)’]red[/color][/img]

This worked in his implementation. With another implementation It didn't work, but this worked:

[img] [color=#ff0000 onerror=’alert(1)’]red[/img][/color]

Then I saw the implementation. And he used some Regular Expression to replace the BBCode Strings.
Two things here I want to clarify.

He used Regular Expressions to match a recursive language. So spoken in the Chomsky hierarchy, he used a Type-3 language to match something recursively.
He should have used at least a Type-2 (Context-Free) language. Is that right?

He built a weird machine, because he changed the input with every regex. And this leads to unpredictable/unwanted behavior.

input = html_escape(real_input)
output1 = regex(input)
output2 = regex(output1)
output3 = regex(output2)

Is this really a weird machine, because he interprets (changes the output) with every parsing step, or have I misunderstood this?

So I started/tried to automate the process of finding such bad BBCode constellations. I called this project nazo (japanese for puzzle).
https://github.com/Samuirai/python/tree/master/nazo (it doesn't work - I gave up)
The scanner has three steps:
1. test BBCodes if they are implemented
2. created any permutation of available BBCodes
3. Test the BBCodes if they lead to an escape for XSS.

But as you can imagine, permute all the tags to find something is extremely inefficient.
I want to rewrite the code. Something intelligent to find such things. I think a lot about how I could automate this.

Now I've started to understand more about langsec and maybe I can find here an algorithm or an Idea how to achieve this better.
Also please correct me if I use the wrong terms or if I misunderstood something.

So what do we know about the "target":
We know, or we assume, that he uses a higher Type language (spoken in the chomsky hirarchy) to achieve something from a lower Type.
We can find parts of the grammar he tries to implement by using basic combinations and compare the input with the output.

Are there theories how you can find loopholes in an implemented grammar?
While writing this I recognize, that what I try to achieve is basically what fuzzers try to do. So it seems, that there is no "perfect" solution(?). The only way to find such things are try and error(?). Because you can't tell if a grammar is correct or not. That's basically the problem you try to address with language-theoretical security, right?

So any suggestions, papers, comments, ...?

kind regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.langsec.org/pipermail/langsec-discuss/attachments/20120603/5898882f/attachment-0002.html>

More information about the langsec-discuss mailing list