On Tue, 2014-05-06 at 02:22 -0400, John Cowan wrote: > Bear scripsit: > > > Yes, with the exception of code points which are not actually mapped to > > any character by the Unicode standard. > > For clarification, which of these do you mean? > > (a) Code points which will never correspond to any character, namely > the surrogates? (These are already excluded by -small.) > > (b) Code points for reserved noncharacters (there are 65 of these; > they are not to be used in interchange, but may be useful internally to > a program)? > > (c) Codepoints that will (or at least may) be assigned to characters in > future versions of Unicode? I was referring to (a) and (b), plus things like 'tag characters' which are irrevocably mapped in the standard but which the current standard says not to use because that was a design mistake. I am ambivalent about (c); on the one hand I don't want nonsense points to be a possibility in strings, but on the other given that at some point people may be handling Unicode characters defined by a standard newer than the implementation, there is at least a possibility that requiring implementations to handle them is rational. > > > 7) Should R7RS-large implementations be required to > > > provide the characters from #\x10000 to #\x10FFFF? > > > > No. > > I'm curious why you reject these, seemingly out of hand. They are > required by a lot of scripts, though mostly archaic and minority-use ones. > You similarly reject #11 without explanation. Over 90% of these codepoints are nonsense not mapped to any character, and I have not yet encountered any need for the few remaining. I fear that if programmers are guaranteed that many nonsense code points and code points they're not personally using, they're going to start abusing them for some semantics-breaking purpose like encoding floats, or otherwise treating strings as blobs. That said, implementers should definitely be *allowed* to support these characters, and I assume that most will. > > > 8) Should R7RS-large implementations be required to allow #\x0 in strings? > > > > Abstention. If an implementation is serious enough about Unicode > > support to keep its strings in a Unicode normalized form, which ought > > not be forbidden, then NUL can never appear in any string. > > I don't understand this remark at all. The normalized form of the U+0000 > character under any normalization form is quite simply itself. The > internal encoding of the characters with or without 0 bytes is not > relevant here. Is there ever a reason for a string to contain NUL? NUL has no semantics. It is a nonsense point. There is no concatenation, substring, insertion, case operation, etc, of any linguistically meaningful string in normal form which can result in a normalized string with a NUL in it. NUL is a concession to using strings as blobs; now that we actually have blobs, we don't need it in strings. > > Yes, with the exception of code points which are not actually mapped to > > any character by the unicode standard and code points which have a > > canonical decomposition (ie, the standard ought to allow an > > implementation to implement strings as unicode normalized strings). > > That is, in normalization form D, I assume you mean. (Normalization form > C is more commonly used, and actually encourages the use of characters > with a canonical decomposition.) If we presume that the standard should leave a choice of internal normalization form to the implementations, then there are many characters which the standard cannot require an implementation to allow in strings. Those with a canonical or compatibility decomposition, and those which are for whatever reason nonsense points. There are many characters (such as ligatures) which have canonical decompositions, but which are not themselves the result of canonical compositions, which cannot appear even in NFC, and which the standard therefore cannot require implementations to allow in strings. That was what I had in mind, but on the other hand there may actually be a good reason for the standard to pick an internal string normalization form for all implementations. If there really is a good reason, and it doesn't create an incompatibility with R7RS-small, then the standard *should* pick a normalization form. Bear _______________________________________________ Scheme-reports mailing list Scheme-reports@scheme-reports.org http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports