[langsec-discuss] The String Type is Broken

Allan Wegan allanwegan at allanwegan.de
Fri Nov 29 07:28:22 UTC 2013


> To me, the crucial difference is that strings need *parsing*
> -- however simple, e.g., special handling for escape characters and
> separators -- to become interpretable text.
> And this parsing is what's broken.

Parsing is of course needed to detect grapheme cluster boundaries at
least once for each given string of code points while converting to a
string of grapheme clusters (text).
But whether handling of escape characters is needed, depends on the
source format. Length-prefixed formats do not need escaping whether on
the move or at rest and therefore can skip the parsing needed for
(un)escaping.
In a lot of cases the resulting string of grapheme clusters is itself
escaped or is expected to carry a message to be understood by the
software processing it. Then of course another round(s) of parsing may
be needed for unescaping and transformation to a symbol list or tree...

The lack of a grapheme-cluster-based string type is not necessarily
harmfull for (un)escaping - or parsing of programming languages. The
evilness lies in the subtleness of the failure modes.
There may be security implications in some places when it comes to
sanitizing (or annotation of irregularities to enable a human to
actually see them as there are a lot of "invisible" or
human-"indistinguishable" code points in the Unicode standard).
There may be security implications when slicing text where the meaning
of a glyph, word, or sentence changes because of splitting inside it.
In most cases the results of failing just annoy users not finding what
they searched for or looking at strangely crippled truncated text.
It most often is just a "GUI issue" and not security relevant. I guess,
that is why most software designers do not really care.

And it is one of that problems where one has to chose a non-perfect
solution because human language processing is just not there now.
I surely would choose the implemention for truncating article teaser
texts, that detects the exact sweat spot where most readers just read
enough of it to be teased - if there would be one readily available.

But instead i have to chose how much processing is enough.
The traditional answer seems to be "well, lets just take that string,
truncate after N whatever-is-stored-in-that-thing and live with the
result" for an arbitrarily chosen N. I go for a slightly better approach
falling back from primitive sentence and word detection to grapheme
clusters as smalles unit of text. I still use that arbitrarily chosen N
as maximum length and would go for full text recognition instead, but
can not afford it.
A lot of software designers can not even afford using grapheme clusters
as atomic unit because of the lack of support by the programming
language of choice (regardless of whose choice that was).
So the real solution is to make it easy to do right (grapheme cluster
level) or at least better (word and sentence level).

>    A classification of such kinds of "string to text" pasring might
> help properly frame and resolve this issue.

Hmm, that sort of parsing, i mean, is only the set of boundary
detections as defined in the Unicode standard.
(Un)escaping is another huge problem domain containing a lot of
opportunities for failure. But i think, it is a well understood field.
And we already got the tools for that almost right in the big languages.
(Un)escaping for processing or transportation is perfectly possible at
the code point or even byte level. It therefore gets not better by
introducing a type dealing with text at the grapheme cluster or more
abstract level.
Human language processing is of course needed for better word and
sentence boundary detection. And there is a lot of parsing going on, i
guess. But for the discussion about the need for a better base type
representing text - i do not think, that matters. And human Language
processing is a huge field where one could get lost really fast. That
sort of parsing would have to be pluggable to augment or replace the
default Unicode standard algorithms.

>    I suppose that the reason may be that the required parsing is
> considered elementary, and elementary almost always means "dealt with
> in an ad-hoc way".

Yes, that has to be the cause. ;)



-- 
Allan Wegan
Jabber: allanwegan at erdor.de
 OTR-Fingerprint: 97ED4E4FA9CEFAFC0EF783F8D010154829529E9E
Jabber: allanwegan at jabber.ccc.de
 OTR-Fingerprint: A1AAA1B9C067F9884A424D339834346929164587
ICQ: 209459114
 OTR-Fingerprint: 71DE5B5E67D6D758A93BF1CE7DA06625205AC6EC

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <https://mail.langsec.org/pipermail/langsec-discuss/attachments/20131129/d3d75759/attachment.pgp>


More information about the langsec-discuss mailing list