[scheme-reports-wg2] Re: [Scheme-reports] DISCUSSION/VOTE: The character tower John Cowan (06 May 2014 18:45 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Sascha Ziemann (07 May 2014 08:16 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Per Bothner (08 May 2014 01:35 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Alaric Snell-Pym (08 May 2014 12:22 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Jussi Piitulainen (08 May 2014 05:36 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Shiro Kawai (06 May 2014 21:04 UTC)

[scheme-reports-wg2] Re: [Scheme-reports] DISCUSSION/VOTE: The character tower John Cowan 06 May 2014 18:45 UTC

Bear scripsit:

> I was referring to (a) and (b),

Well, (a) is out of the case.  The reasons why noncharacters,
tag characters, and other deprecated/discouraged characters
should be supported by string libraries are given at
<http://www.unicode.org/faq/private_use.html#noncharacters>, especially
the question "So how should libraries and tools handle noncharacters?"

> plus things like 'tag characters' which are irrevocably mapped in the
> standard but which the current standard says not to use because that
> was a design mistake.

Tag characters were introduced in order to stave off an attempt by the
late Mark Crispin to encode language tagging in plain text by hiding
the information in otherwise-unsupported UTF-8 code unit sequences.
Needless to say, they have never been used for this or (as far as anyone
knows) any other purpose.  However, that doesn't make them much different
from the notorious t-with-cedilla pair, the Afrikaans n with preceding
apostrophe, the ohm sign, and other mistakes.

> I am ambivalent about (c); on the one hand I don't want nonsense points
> to be a possibility in strings, but on the other given that at some
> point people may be handling Unicode characters defined by a standard
> newer than the implementation, there is at least a possibility that
> requiring implementations to handle them is rational.

I think that's pretty much inevitable, especially since Unicode is now
switching to annual republication.  Trying to exclude unassigned but
valid codepoints is very much a moving target.  It's true that the set
of folks for whom this is a headache will likewise dwindle from year to
year, but minority-language users and scholars of archaic languages have
their needs too.  But why should implementers work hard to exclude such
characters anyway?

> Over 90% of [non-BMP] codepoints are nonsense not mapped to
> any character,

True, and it's unlikely that planes 4 through D will ever be used
by Unicode.  (Plane 3 is not yet in use, but almost certainly will be used
for archaic Han characters.)

> and I have not yet encountered any need for the few remaining.

It's by no means a few: as of Unicode 7.0, there are 10883 non-Han
characters outside the BMP and 47082 Han characters, for 57965 altogether.
Nor are they unused in practice.  The most common non-BMP *script* is
Gothic.  The most common use of non-BMP characters is the mathematical
alphanumeric symbols, which are required in order to do math in plain
text (see <http://www.unicode.org/reports/tr27/tr27-4.html> and search
on the page for "Hamiltonian").

> I fear that if programmers are guaranteed that many nonsense code points
> and code points they're not personally using, they're going to start
> abusing them for some semantics-breaking purpose like encoding floats,
> or otherwise treating strings as blobs.

I understand that concern, but I think the lumpy shape of the available
code point space makes it mostly nugatory.  If we were supporting 16-
or 32-bit code units there would be a real concern.

> Is there ever a reason for a string to contain NUL?

Semantically, no.  But there is no reason for a string to contain
any of the C0 control characters except TAB, CR, and LF these days.
Nobody thinks of excluding ^P just because classical synchronous modems
aren't used much any more.

> If we presume that the standard should leave a choice of internal
> normalization form to the implementations,

<chair hat="off">
Now we get into the tight and the bad and the crazy.  Do you really
think that it should be up to the implementation to choose whether
(string-length (string #\A #\x0301)) returns 1 or 2?  I submit
that for the (scheme base) versions of these procedures the result
must be 2.  If you want auto-normalizing, you need to create a
parallel string library that provides it.
</chair>

(#\x0301 is COMBINING ACUTE ACCENT, which following A makes it Á.)

--
John Cowan          http://www.ccil.org/~cowan        cowan@ccil.org
        Is it not written, "That which is written, is written"?

--
You received this message because you are subscribed to the Google Groups "scheme-reports-wg2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheme-reports-wg2+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.