[langsec-discuss] unambiguous encapsulation
mike at ossmann.com
Tue Mar 5 01:21:26 UTC 2013
Sorry to be that guy who lurks on a mailing list and then suddenly drops
a book, but I think this is the best place for me to solicit feedback on
Inspired by LANGSEC and Packet in Packet, I have been thinking about
methods of encapsulating data within other data such that the
encapsulated data cannot be mistaken for other data. My primary goal is
to find new classes of error correcting codes that support unambiguous
Most methods of channel coding encapsulate data ambiguously. For
example, a packet's payload in a wireless communication protocol is
encoded and modulated in the same manner as the packet's header. This
ambiguity enables a class of attacks exemplified by Packet in Packet.
Similarly, data at rest are typically encapsulated in a manner that can
be confused with surrounding data or executable code. This ambiguity
enables many types of attack in which user supplied data are interpreted
as executable code.
I hope to identify new ways to encapsulate data in other data using
encodings that unambiguously differentiate inner (encapsulated) data
from outer data.
a simple example: delimited base64 files
The csv file format is simpler and safer to parse if each data element
in the file is encoded with characters that are never used as (outer)
A traditional csv file might look as follows:
A comma is used as a field delimiter, and a newline is used as a record
delimiter. Quotation marks are optionally used as text delimiters and
in this case serve to prevent the inner comma from being interpreted as
an outer comma.
If each field is base64 encoded, the same file looks like this:
Because the comma and newline do not appear in base64, the encapsulation
of inner data elements is unambiguous. Outer data consist only of
commas and newlines. Inner data are represented only by characters that
appear in base64 encoding. This encapsulation avoids the problems
associated with escaping and quotation. Any commas or newlines present
within a field are unambiguously differentiated from outer commas and
newlines because they are base64 encoded.
(For a practical standard, I suggest choosing delimiters other than
commas and newlines. Because the base64 encoded data are not human
readable, there is no reason to invite the application of historical csv
parsing variations such as newline encoding. I would like to develop a
clear specification for such a file format and am interested in any
thoughts you may have regarding delimiter selection and whether or not
historical variations in either csv or base64 should be tolerated.)
Apart from implementing base64, it wouldn't take much code to parse
delimited base64 files. I notice that base64.py on my system is shorter
than csv.py despite including base32 and base16 implementations, more
lines of comments, main(), and test(). It also has fewer deeply
indented lines. There are two occurrences of the word "guess" in
function names in csv.py and none in base64.py. I strongly suspect that
a more thorough investigation of the relative complexity would agree.
Delimited base64 files can encode any 8-bit data while csv files
typically may include only plain text. They are nestable because the
inner encoding can encode all characters found in both the (outer)
delimiters and the (inner) encoded data. Nestability is an interesting
property of this scheme that is not found in every scheme for
Probably the best example of a similar solution is Dan's Interpolique.
However, the outer character set (the delimiters) for a delimited base64
file format may be chosen so that it does not overlap at all with the
character set of the inner encoding. This makes it easier to prove that
malicious inner data can't be interpreted as outer data.
the main show: error correcting codes
I think that a specification for a base64 delimited file format would be
helpful for introducing people to the concept of unambiguous
encapsulation, but my main area of interest is in ways that error
correcting codes can be used to unambiguously encapsulate data within
When data are stored or transmitted over an analog medium, they are
typically encoded with an error control code providing some degree of
error detection or correction. These codes use codewords of several
bits to represent a smaller number of data bits. The additional bits of
transmitted information may be used by the receiver of the data to
detect or correct bit errors introduced by the medium.
It may be beneficial to select codes that include distinct codewords to
represent encapsulated data.
Consider two codes with 3bit codewords:
The first code encodes an encapsulation flag and one data bit into a
3bit codeword. The first bit in a codeword is an encapsulation flag.
The second bit is the data bit. The third bit is a parity bit (even
codeword encapsulation bit data bit parity bit
000 0 (outer) 0 0
011 0 (outer) 1 1
101 1 (inner) 0 1
110 1 (inner) 1 0
Using this code, the data bits:
outer 0, outer 1, outer 1, inner 1, inner 1, inner 0, outer 0
would be encoded as the binary sequence:
This code provides 1 bit error detection. Any one flipped bit results
in an invalid code word at the receiver.
The second code also encodes an encapsulation flag and one data bit into
a 3bit codeword. The first bit in a codeword is an encapsulation flag.
The second bit is the data bit. The third bit provides isolation:
codeword encapsulation bit data bit isolation bit
000 0 (outer) 0 0
010 0 (outer) 1 0
101 1 (inner) 0 1
111 1 (inner) 1 1
This code does not provide error detection. A single flipped bit can
result in a valid codeword at the receiver. However, it provides 1 bit
isolation. Any one flipped bit is guaranteed to not break
encapsulation. In this simple case, the isolation bit is a repetition
of the encapsulation bit.
Comparing these two codes, it is best to choose the code with error
detection. By detecting any one bit error, the receiver is just as able
to prevent encapsulation breakage as it would be with the second code.
In other words, detection provides isolation, but isolation does not
I am embarking on a project to identify encoding schemes that are useful
for unambiguous encapsulation. In particular, I expect to find longer
codes (with codewords of length greater than five bits) that have
interesting encapsulation properties.
The first class of codes that my project will seek consists of linear
block codes that provide more bits of isolation than detection (without
squandering an opportunity to provide more detection as above). I also
hope to identify additional classes of error correcting codes with
encapsulation properties. For each class of codes, I will either prove
by construction that they exist, or I will prove by exhaustive search
that they do not exist up to a certain codeword length.
As far as I know, the property of isolation (guaranteeing encapsulation
in the presence of a certain number of bit errors) is a novel way to
judge the effectiveness of error control codes. Codes are selected
today primarily by the probability of uncorrectable or undetectable bit
errors and the efficiency of the code (the number of data bits vs.
transmitted symbols). I propose that future selection of error control
codes should also consider the probability of bit errors breaking
Error control codes are traditionally generated by analytic methods, but
they can also be found by brute force search. By implementing brute
force methods, my project will produce error control codes with certain
properties even if it does not succeed in producing analytic methods for
the generation of codes.
As a preliminary test, I wrote a Python program to enumerate all
possible codes of a given codeword length that have specified
properties. For example, it produced the following list of all binary
linear block codes including 00000 with fivebit codewords, two data
bits (meaning there are four codewords in the code), and a minimum
Hamming distance of three (meaning that each codeword differs from each
other codeword by at least three bits):
00000, 11001, 10110, 01111
00000, 11010, 10111, 01101
00000, 11011, 10101, 01110
00000, 11011, 10110, 01101
00000, 11100, 10011, 01111
00000, 11100, 10111, 01011
00000, 11100, 11011, 00111
00000, 11101, 10011, 01110
00000, 11101, 10110, 01011
00000, 11101, 11010, 00111
00000, 11110, 10011, 01101
00000, 11110, 10101, 01011
00000, 11110, 11001, 00111
Each of the above lines would be called a (5,2,3) linear block code in
the traditional notation of coding theory.
One way to think of a code that has the property of isolation is as a
pair of complementary codes.
Consider the following pair of codes found using the Python program:
00000, 00011, 00101, 01001
01110, 10110, 11010, 11100
Each of these is a (5,2,2) code (fivebit codewords, two data bits,
minimum Hamming distance of two), but the codes are complementary in
that the minimum Hamming distance from one code to the other code is
three. One of the pair could be used to encode inner (encapsulated)
data and the other to encode outer data.
A single bit error can be detected by the receiver of a message so
encoded. For example, if 00001 is received, the receiver would find
that the received codeword is not valid (in either code). A pair of bit
errors could result in an invalid codeword (e.g. a transmitted 00000
could be received as 11000) but could also result in a valid codeword
from the same code (e.g. a transmitted 00000 could be received as
00011). However, it is impossible for a pair of bit errors to result in
a codeword that belongs to the other code. A minimum of three bit
errors would be required for the receiver to interpret a codeword as
belonging to the wrong code, thus breaking encapsulation. In most
systems, the likelihood of three out of five bits being flipped is
considerably less than the likelihood of one or two flipped bits.
Because my preliminary program has already identified such complementary
codes with codewords up to five bits long, it is virtually certain that
longer codes can be found by a longer running search performed by more
efficient software. Longer codes have greater error correction and
detection capabilities, and I expect them to have more interesting
isolation capabilities as well.
Unambiguous encapsulation is a useful design pattern that can (should?)
be applied in any case where data are encapsulated.
I hope that my project will produce codes that are useful in practical
applications. At the very least, I hope to promote discussion of
unambiguous encapsulation in general.
If you have any thoughts about my project, I would love to hear from
More information about the langsec-discuss