[scheme-reports-wg2] Re: [Scheme-reports] DISCUSSION/VOTE: The character tower John Cowan (06 May 2014 22:04 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Sascha Ziemann (07 May 2014 08:16 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Per Bothner (08 May 2014 01:35 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Alaric Snell-Pym (08 May 2014 12:22 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Jussi Piitulainen (08 May 2014 05:36 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Shiro Kawai (06 May 2014 21:04 UTC)

[scheme-reports-wg2] Re: [Scheme-reports] DISCUSSION/VOTE: The character tower John Cowan 06 May 2014 22:04 UTC

Bear scripsit:

> That ship has sailed, I fear.  When R6 and R7-small were being
> discussed, I advocated a model of characters as Unicode base character
> plus nondefective combining sequence, which would solve this problem
> and unambiguously make this string length 1, regardless of which
> normalization form were used to represent the value in memory.

In short, what Unicode calls "legacy grapheme clusters" nowadays.
But there's a reason for that word "legacy": the notion (which is somewhat
more general than you state) turned out not to be general enough, and
so "extended grapheme clusters" were devised, with the recognition that
grapheme clusters really need to be tailorable to a given locale as well,
though we don't support tailoring of case conversion.

    "I toyed with anarchy once, but on reading into the subject
    found that there were as many kinds of anarchy as there are of
    democracy. There are plain anarchists and syndicalist anarchists
    and deviationist anarchists and, for all I know, syndicalist
    deviationist anarchists. There’s as much anarchy in anarchy
    as in any political philosophy."  --Tully Bascomb

> It would also make 'indexing' and locations of characters in strings
> unambiguous, mostly eliminate length changes on case operations,
> preserve the string-as-sequence-of characters semantics we had, and
> yield a character semantics cleaner and less ambiguous than Unicode's
> and capable of being mapped cleanly onto other character repertoires
> or representations, all of which I thought of as good things.

Indeed they are, and indeed it would -- almost.
See <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>

> But it was without any widespread implementation support.

Alas, yes, and I tread very carefully to avoid standardizing what does
not exist.

    Papa Hegel he say that all we learn from history is that we learn
    nothing from history. I know people who can't even learn from what
    happened this morning. Hegel must have been taking the long view.
    --Chad C. Mulligan

> That rejection means we now have a model in which a character is a
> Unicode codepoint.  Because there are multiple ways to express given
> strings as sequences of unicode codepoints, the identity of a string
> is now unglued from the identities of the characters from which it was
> built and we no longer have clean string-as- sequence-of-characters
> semantics.

Well, arguably we didn't have them before because of the mutability
of strings (which I don't like but we are stuck with).  "AB" and "AB"
may or may not be identical in Scheme; the most we can say is that
they are component-wise equal.

> (string-length (string #\A #\x301))
>
> [is] therefore now irrevocably dependent on unicode, and specifically
> on unicode normalization form.

I don't see where the normalization form comes into this.  As things are,
(string-length (string #\A #\x301)) is 2 and (string-length (string
#\xC1)) is 1, even though they normalize to the same thing in either
normalization form.  So the current semantics don't depend on a NF;
rather they depend on not automatically applying any particular NF.

> So we now have this choice; either the standard dictates what
> normalization form to use in strings, or string-length etc have
> implementation-defined semantics.

I don't think so, no, unless "don't normalize" counts as a normalization
form, which it normally :-) doesn't.

--
John Cowan          http://www.ccil.org/~cowan        cowan@ccil.org
They do not preach that their God will rouse them
A little before the nuts work loose.
They do not teach that His Pity allows them
to drop their job when they damn-well choose.
                --Rudyard Kipling, "The Sons of Martha"

--
You received this message because you are subscribed to the Google Groups "scheme-reports-wg2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheme-reports-wg2+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.