[langsec-discuss] Need advice on practical tools
geo.couprie at gmail.com
Wed Dec 16 09:16:32 UTC 2015
On Tue, Dec 15, 2015 at 9:03 PM, Jeffrey Goldberg <jeffrey at goldmark.org> wrote:
> On 2015-12-15, at 11:21 AM, Geoffroy Couprie <geo.couprie at gmail.com> wrote:
>> I tend to favor the parser combinator approach,
> I’m not familiar with “parser combinator”, but a quick look does make
> it seem much more like what I am actually looking for.
>> like Hammer, or my own
>> Rust library, nom ( https://github.com/Geal/nom ),
>> Usually, when you talk about lex and yacc, developers will
>> either think these tools are too complex, because they never used
>> them, or they will remember with dread their compilers 101 course.
> Yep. I think that whatever we end up using, I will probably be the one
> to write the grammar, so we might be able to get around that. (I’m
> not really a developer, but I grew up writing grammars.)
>> It's a shame, but as argument against handwritten parsers, parser
>> combinators subjectively work better than lex and others.
> I will need to learn more about parser combinators.
>> On the specific subject of email addresses, most attempts to validate
>> failed, since there are a lot of use cases people ignore. This is the
>> process I usually follow:
>> - parse as UTF-8 characters
> You do realize that the domain part needs to follow domain name specs, so
> UTF8 is overkill? Furthermore, there appear to be no standards on UTF even
> for the local part.
RFCs 6530 and 6531 indicate that the local part can contain almost any
character, multibyte chars included. For the domain part, it must be
normalized before sending, but that's where it gets interesting: users
won't enter the email normalized, but in UTF8 form, apps will tend to
store (and validate) the complete email in UTF8 form, and let the mail
or DNS library handle the normalization before sending emails. So if
someone abuses UTF8 control characters, the delimitation of local or
domain part could be different, depending on which code validates the
> Because email addresses are IDs that users will see of one another for
> who to share data with, we’ve decided to not allow Unicode even in the
> local part, so as to avoid problems with homoglyphs. But for other uses,
> allowing full unicode local parts makes sense, but then you still need
> to restrict it to alpha numerics somehow.
This will solve a lot of issues.
>> - find the @, as Nils said
> Assuming that there is only one and that it isn’t escaped or quoted.
>> - look for common typo errors in the domain part, like gnail.com, and
>> warn the user (I assume you want to warn the user before sending the
>> email, since you require Go or JS)
> This is a different problem. So while that is a good thing, that is separate
> from the question of whether the thing is a grammatical email address.
>> - send the email
>> Whatever the case, the last step is the only one that really works for
>> validation, since an email you deem valid might not work for the
>> remote server.
> Of course. I should have made it clear that I mean valid in a syntactic
> sense only. Checking whether the address is actually deliverable is a
> separate problem.
>> If you need to completely validate an email (or anything else), I can
>> still recommend nom, since Rust is easy to call from Go (cf
>> https://github.com/medimatrix/rust-plus-golang ) or any other
>> language. Would that work for you?
> Possibly. But I expect that as none of our development is in Rust this is
> less likely to be the solution to “write the grammar once, parse anywhere”
> sort of thing that I’m after.
Well, Rust can do whatever C does, so you could write the grammar in
that language, and give everybody else a library with a C header, this
usually works well. For JS on the client side, you would need another
solution, of course.
If you would like to attempt that solution, I can help a bit for the
setup. I would enjoy an email parsing library written in nom.
More information about the langsec-discuss