Re: [Scheme-reports] DISCUSSION/VOTE: The character tower
Bear 06 May 2014 20:45 UTC
On Tue, 2014-05-06 at 14:45 -0400, John Cowan wrote:
> Bear scripsit:
> > If we presume that the standard should leave a choice of internal
> > normalization form to the implementations,
>
> <chair hat="off">
> Now we get into the tight and the bad and the crazy. Do you really
> think that it should be up to the implementation to choose whether
> (string-length (string #\A #\x0301)) returns 1 or 2? I submit
> that for the (scheme base) versions of these procedures the result
> must be 2. If you want auto-normalizing, you need to create a
> parallel string library that provides it.
> </chair>
>
> (#\x0301 is COMBINING ACUTE ACCENT, which following A makes it Á.)
That ship has sailed, I fear. When R6 and R7-small were being
discussed, I advocated a model of characters as Unicode base
character plus nondefective combining sequence, which would
solve this problem and unambiguously make this string length 1,
regardless of which normalization form were used to represent the
value in memory. It would also make 'indexing' and locations
of characters in strings unambiguous, mostly eliminate length
changes on case operations, preserve the string-as-sequence-of
characters semantics we had, and yield a character semantics
cleaner and less ambiguous than Unicode's and capable of being
mapped cleanly onto other character repertoires or representations,
all of which I thought of as good things.
But it was without any widespread implementation support.
Further, it would make the char library you referred to in the
last question impossible to implement, because in that model
there are a literally infinite number of possible characters,
and then you have a halting-problem issue when trying to
iterate over them all. So it was rejected.
That rejection means we now have a model in which a character is
a Unicode codepoint. Because there are multiple ways to express
given strings as sequences of unicode codepoints, the identity
of a string is now unglued from the identities of the characters
from which it was built and we no longer have clean string-as-
sequence-of-characters semantics. Questions such as
(string-length (string #\A #\x301))
are therefore now irrevocably dependent on unicode, and
specifically on unicode normalization form.
So we now have this choice; either the standard dictates
what normalization form to use in strings, or string-length
etc have implementation-defined semantics.
Bear
_______________________________________________
Scheme-reports mailing list
Scheme-reports@scheme-reports.org
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports