Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Bear (06 May 2014 17:35 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Sascha Ziemann (07 May 2014 08:16 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Per Bothner (08 May 2014 01:35 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Alaric Snell-Pym (08 May 2014 12:22 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Jussi Piitulainen (08 May 2014 05:36 UTC)
Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Shiro Kawai (06 May 2014 21:04 UTC)

Re: [Scheme-reports] DISCUSSION/VOTE: The character tower Bear 06 May 2014 17:29 UTC

On Tue, 2014-05-06 at 02:22 -0400, John Cowan wrote:
> Bear scripsit:
>
> > Yes, with the exception of code points which are not actually mapped to
> > any character by the Unicode standard.
>
> For clarification, which of these do you mean?
>
> (a) Code points which will never correspond to any character, namely
> the surrogates?  (These are already excluded by -small.)
>
> (b) Code points for reserved noncharacters (there are 65 of these;
> they are not to be used in interchange, but may be useful internally to
> a program)?
>
> (c) Codepoints that will (or at least may) be assigned to characters in
> future versions of Unicode?

I was referring to (a) and (b), plus things like 'tag
characters' which are irrevocably mapped in the standard but
which the current standard says not to use because that was
a design mistake.  I am ambivalent about (c); on the one hand
I don't want nonsense points to be a possibility in strings,
but on the other given that at some point people may be handling
Unicode characters defined by a standard newer than the
implementation, there is at least a possibility that requiring
implementations to handle them is rational.

> > > 7) Should R7RS-large implementations be required to
> > > provide the characters from #\x10000 to #\x10FFFF?
> >
> > No.
>
> I'm curious why you reject these, seemingly out of hand.  They are
> required by a lot of scripts, though mostly archaic and minority-use ones.
> You similarly reject #11 without explanation.

Over 90% of these codepoints are nonsense not mapped to
any character, and I have not yet encountered any need
for the few remaining.  I fear that if programmers are
guaranteed that many nonsense code points and code points
they're not personally using, they're going to start
abusing them for some semantics-breaking purpose like
encoding floats, or otherwise treating strings as
blobs.

That said, implementers should definitely be *allowed* to
support these characters, and I assume that most will.

> > > 8) Should R7RS-large implementations be required to allow #\x0 in strings?
> >
> > Abstention.  If an implementation is serious enough about Unicode
> > support to keep its strings in a Unicode normalized form, which ought
> > not be forbidden, then NUL can never appear in any string.
>
> I don't understand this remark at all.  The normalized form of the U+0000
> character under any normalization form is quite simply itself.  The
> internal encoding of the characters with or without 0 bytes is not
> relevant here.

Is there ever a reason for a string to contain NUL?  NUL has no
semantics.  It is a nonsense point.  There is no concatenation,
substring, insertion, case operation, etc, of any linguistically
meaningful string in normal form which can result in a normalized
string with a NUL in it.  NUL is a concession to using strings as
blobs; now that we actually have blobs, we don't need it in strings.

> > Yes, with the exception of code points which are not actually mapped to
> > any character by the unicode standard and code points which have a
> > canonical decomposition (ie, the standard ought to allow an
> > implementation to implement strings as unicode normalized strings).
>
> That is, in normalization form D, I assume you mean.  (Normalization form
> C is more commonly used, and actually encourages the use of characters
> with a canonical decomposition.)

If we presume that the standard should leave a choice of internal
normalization form to the implementations, then there are many
characters which the standard cannot require an implementation to
allow in strings.  Those with a canonical or compatibility
decomposition, and those which are for whatever reason nonsense
points.  There are many characters (such as ligatures) which
have canonical decompositions, but which are not themselves the
result of canonical compositions, which cannot appear even in NFC,
and which the standard therefore cannot require implementations to
allow in strings.

That was what I had in mind, but on the other hand there may
actually be a good reason for the standard to pick an internal
string normalization form for all implementations.  If there
really is a good reason, and it doesn't create an incompatibility
with R7RS-small, then the standard *should* pick a normalization
form.

Bear

_______________________________________________
Scheme-reports mailing list
Scheme-reports@scheme-reports.org
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports