[langsec-discuss] CVEs from inappropriate use of regular expressions

Jeffrey Goldberg jeffrey at goldmark.org
Wed Nov 29 20:14:29 UTC 2017

I do not know if these resulted in CVEs, but a number of password managers had security vulnerabilities that were due to misparsing URIs with regular expressions. URIs comprise a regular language, so these don’t strictly meet what you are asking for; but the regular expressions were not built around the specification of the URL language.

(Also note that most developers use perl-compatible regular expressions which can describe some non-regular languages.)

The exploits were typically of the form of tricking a password manager to fill secrets for legitimate site A into malicious site B. Password managers store a URL along with username and password. And those password managers that assist with filling web login forms will not fill in forms on pages for which the location does not “match” what the the password manager has stored. That matching requires that the password manager parse the web page location and also parse what it has stored internally. Misparsing of either can lead to failure that would allow the kind of attack I described.

The sort of thing that was appearing in phishing email that could fool several password managers was of the form


The fragment would create the malicious page, but I don’t have that at hand. (What I’ve quoted is in our tests, but I’d have to dig through a lot of history to get the complete example of a malicious URI.) The point is that the password manager and the browser may interpret that URI very very differently.

In our case (I work for the makers of 1Password), we had known not to use regexes for this ever since Sergey explained the basic principles of langsec to me many years ago at a party in a Las Vegas penthouse. However, knowing not to do X and not doing X are two different things. One reason that it took as time to move to proper parsing of URLs is because many users had lots of data with alleged URLs like “www.facebook.com”. It turns out that that is a syntactically valid URL, but when parsed according to the specification, it is a URL with only a path component.

So because for years we had accepted and allowed users to accumulate such malformed data, such as “www.facebook.com”, we couldn’t simply switch over to proper URL parsing. So since there wasn’t an easy fix and the threat was seen as “theoretical” by many developers, we were slower than we should have been to address this. However, I believe that because I’d learned about langsec, we had a head start when the motivation finally arrived.

When actual exploits started to be reported for these sorts of problems in our competitors, the internal incentives changed rapidly. I also had the very dubious pleasure of saying “I told you so.”[1] By this time we had already moved to proper URL parsing on some platforms and had already identified the challenges and other things that would need to be addressed.

Anyway, our parsing of URLs still has to make an exception for things like “www.facebook.com”. (And we also strip leading and trailing white space.). And there are a few other quirks. Here are a couple of excerpts from our internal documentation on URI parsing:

> Apple’s `NSURL URLWithString` incorrectly rejects URI strings that that have unescaped "/" and "?" within the query portion of a URI. These should be allowed according to RFC 3986.


> There are two distinct classes available in the Android SDK. There is the good `java.net.URI` and the bad `android.net.Uri`. Do _not_ use the one from the android package as this implementation performs "little to no validation”.



Jeffrey Goldberg
Chief Defender Against the Dark Arts @ AgileBits

[1] A note on “I told you so"

One thing I learned during the long process of “telling so” is that many extremely smart developers either never took a Formal Language Theory course or forget the contents of the course within weeks after the final exam. So what Sergey had been able to explain to me in a few minutes (my background is in Linguistics) is something that I have struggled to explain to my colleagues. I was asked to construct a malicious candidate URL that the regex parsing would mishandle. I tried to explain that while I may not be able to do so, that doesn’t mean that others aren’t, but that if we use parsers that are built from the language specification then we can preclude a whole category of attacks, whether I can construct an instance of that category or not. People nodded their heads and worked on things that were more immediate priorities.

I’m please to say that while not everyone is a convert to my religion of langsec, the notion of precluding whole categories of yet to be discovered attacks through certain design principles has been gaining ground. Even if not all of our input validators are based on the form specs of expected input we have single purpose, strict(-ish), isolated validators for everything coming into our servers.

One thing that I’ve learned during this process is that I can’t simply tell developers “don’t do it that way, and here is some math that should guide you on how to do it”. I have to give them usable tools for doing it right. It would be really nice if there were some simple examples of using nail. (No the DNS example is not simple for people who don’t already know what is going on.)

> On Nov 27, 2017, at 5:01 PM, Frithjof Schulze <fs at ciphron.de> wrote:
> Hi all,
> is anybody aware of some recent CVEs that are the direct result of the attempt to parse a non-regular grammar with regular expressions? I expected to find something like this on cve.mitre.org/find, but didn’t. I expected at least a case where regex were used to do „input sanitization“ but found nothing good.
> Why am I looking for such a CVE? When talking about LangSec-ideas with (mostly web) developers I regularly have the problem that I either have to explain a lot of theory (that few people are really interested in) or have to go „thou shall not ….“ to argue against „but this is easy and works in practice!“.
> The best solution for me so far is similar to the approach suggested in the "Seven Turrents of Babel“: Show people examples of the bugs they are up against if they use certain antipatterns. I am now compiling a list of educational and „realistic“ bugs in the sense, that the most more popular bugs like string terminators in X.509/ASN.1, Heartbleed and the Android Master Key are great examples for LangSec in general, but are not the kind of bugs many developers have to actually deal with.
> Most people I am talking to actually know that they „shouldn’t“ use regex to do certain things, because of the Lovecraftian post on Stack Overflow[1], but that post also just repeatedly mentions the impossibility of a suggested solution without giving any examples of negative consequences of trying.
> [1|  https://stackoverflow.com/a/1732454
> Cheers,
> Frithjof
> _______________________________________________
> langsec-discuss mailing list
> langsec-discuss at mail.langsec.org
> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3367 bytes
Desc: not available
URL: <https://mail.langsec.org/pipermail/langsec-discuss/attachments/20171129/4e12ec3e/attachment.bin>

More information about the langsec-discuss mailing list