[langsec-discuss] The "grammar" approach...

Derick Winkworth ccie15672 at gmail.com
Wed Feb 24 15:56:37 UTC 2016


All:

I actually didn't receive Sven's reply, and it's not in my spam folder.
 :-(  So I'll elaborate a  little more.

I've been working in the ML field for about 9 months (so still fresh),
specifically doing feature extraction/engineering/data representation.
Specifically, we are trying to apply data science/ML tools and techniques
to communications network infrastructure telemetry.

We have multiple angles we are going down in this regard, but one of the
possibilities was leveraging existing NLP work by building a sort of
grammar for the raw telemetry coming in.  I'll give an example.

In an IP network (like the internet), nodes are able to export flow
telemetry.  Usually this is sub-sampled as it's CPU and bandwidth intensive
to generate this telemetry.  However on firewalls and IPS's where
individual flows are tracked, you get a much more complete picture of what
is going on.   A "flow" record usually contains a source address,
destination address, source port, destination port, and some other fields.

Suppose we have three hosts, "A", "B", and "C".   Host A wants to talk to
host C, but must resolve a DNS name to the address of C by sending a DNS
request to Host B, which is a DNS server.   When Host B responds, Host A
then starts communicating with Host C.  The records would like like this:

t0 - src:A, srcPort: 11435, dst:B, dstPort: 53
t1 - src:B, srcPort: 53,    dst:A, dstPort: 11435
t2 - src:A, srcPort: 22987, dst:C, dstPort: 443


Unfortunately this telemetry data does not tell us what URL Host:A was
resolving and in the real world Host C is hosted somewhere out on the
internet, so it wouldn't be possible to reverse-resolve the IP to the URL.

However, Host B keeps DNS logs.  Assuming that all devices are running NTP
and have their system times sync'd, we might be able to correlate a log
message from Host B to these flow records.  We might also gather logs from
a third source, a web proxy, that tells the exact URI that Host A sent to
Host C.  This is where a constructed grammer would be useful...

noun   verb   noun                    noun          noun
---------------------------------------------------------------------------------------------------
"A"  - sent - DNS request      - to - "B" - with - "drink.more.beer.com"
"B"  - sent - DNS response     - to - "A" - with - "C"
"A"  - sent - HTTP/SSL request - to - "C" - with - "
http://drink.more.beer.com/sendFreeBeerNow.html"




If you're looking at this and thinking "that's terrible," then you now
understand why I sent the original email to the group.  The thing is, to
get an understanding of what has actually transpired in IT infrastructure
(hardware and software) requires correlating logs, information from
protocols like SNMP, information received from message buses like RabbitMQ,
information retrieved from databases and APIs, etc, etc.  And it's all
formatted differently structurally and even atoms of data are represented
differently from one source to the next.  If we take this raw information
in the preprocessing phrase and construct a dialogue such as above using
some artificial grammer, we might be able to leverage existing NLP tools
like word2vec and other algorithms to identify specific events.  Or we
might be able to train them on what is "normal" so that it can identify
when some piece of "dialogue" built with this grammar does not make sense
or is otherwise irregular.

Derick

On Sun, Feb 21, 2016 at 3:58 PM, Jeffrey Goldberg <jeffrey at goldmark.org>
wrote:

>
> > On 2016-02-17, at 10:50 AM, Sven M. Hallberg <pesco at khjk.org> wrote:
> >
> > Derick Winkworth <ccie15672 at gmail.com> writes:
> >> In this case, they went through the process of defining a grammar for an
> >> existing protocol.  This process might actually have another
> application in
> >> the realm of machine learning and language processing.
>
> > Interesting, could you elaborate? I would believe natural language
> > processing includes a lot of grammar construction; Meredith should be
> > able to tell.
>
> For some approaches to natural language processing (and much else in
> Linguistics) trying to work out an explicit grammar from just having
> instances of what is (and if you are lucky, what isn’t) in a language is
> the fun part.
>
> Now when linguists do this, they typically have the ability to check
> whether something is or isn’t in the language. For example, I can check
> whether (1) is English by consulting my intuitions.
>
> (1) *language the in isn’t or is something whether check to
>
> But when you don’t have the opportunity to interrogate a parser that
> knowns the language, you are struck with working from (mostly) positive
> examples of what is in the language. Note that to a substantial extent
> children learning their native language are confronted with the same
> problem.
>
> So when presented with a bunch of grammatical sentences made up from a
> set, w, of words in a language, the simplest grammar would be
>
>   w*
>
> This of course, is not what people should do. We have exceptions of what a
> natural language grammar should look like and the kinds of things that the
> target language actually does.
>
> Anyway, there is a whole bunch of research on what sorts of assumptions
> about the nature of the grammar need to be in place to be able to “learn”
> the grammar from positive instances of it.
>
> One thing to keep in mind is that natural language grammars allow for
> ambiguity.
>
> (2) She saw the boy with the binoculars.
>
> There are clearly two distinct parse trees available for this.
>
>    She [saw [the boy] [with [the binoculars]]]
>
>    She [saw [the boy with the binoculars]]
>
> And see how many you can get for
>
> (3) We saw her duck
>
>
> Anyway, trying to figure out what a grammar is from instances of the
> language and various expectation about what the language is supposed to do
> is loads of fun. But the languages that Linguists look at are enormously
> more complex than the tiny languages of these protocols, so I don’t really
> think that many of the specific techniques are useful.
>
> Now as no to Linguists ever agree things, I await for Meredith to explain
> what I am wrong about.
>
> Cheers,
>
> -j
> _______________________________________________
> langsec-discuss mailing list
> langsec-discuss at mail.langsec.org
> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.langsec.org/pipermail/langsec-discuss/attachments/20160224/9cf59ebe/attachment.html>


More information about the langsec-discuss mailing list