Bear scripsit: > I was referring to (a) and (b), Well, (a) is out of the case. The reasons why noncharacters, tag characters, and other deprecated/discouraged characters should be supported by string libraries are given at <http://www.unicode.org/faq/private_use.html#noncharacters>, especially the question "So how should libraries and tools handle noncharacters?" > plus things like 'tag characters' which are irrevocably mapped in the > standard but which the current standard says not to use because that > was a design mistake. Tag characters were introduced in order to stave off an attempt by the late Mark Crispin to encode language tagging in plain text by hiding the information in otherwise-unsupported UTF-8 code unit sequences. Needless to say, they have never been used for this or (as far as anyone knows) any other purpose. However, that doesn't make them much different from the notorious t-with-cedilla pair, the Afrikaans n with preceding apostrophe, the ohm sign, and other mistakes. > I am ambivalent about (c); on the one hand I don't want nonsense points > to be a possibility in strings, but on the other given that at some > point people may be handling Unicode characters defined by a standard > newer than the implementation, there is at least a possibility that > requiring implementations to handle them is rational. I think that's pretty much inevitable, especially since Unicode is now switching to annual republication. Trying to exclude unassigned but valid codepoints is very much a moving target. It's true that the set of folks for whom this is a headache will likewise dwindle from year to year, but minority-language users and scholars of archaic languages have their needs too. But why should implementers work hard to exclude such characters anyway? > Over 90% of [non-BMP] codepoints are nonsense not mapped to > any character, True, and it's unlikely that planes 4 through D will ever be used by Unicode. (Plane 3 is not yet in use, but almost certainly will be used for archaic Han characters.) > and I have not yet encountered any need for the few remaining. It's by no means a few: as of Unicode 7.0, there are 10883 non-Han characters outside the BMP and 47082 Han characters, for 57965 altogether. Nor are they unused in practice. The most common non-BMP *script* is Gothic. The most common use of non-BMP characters is the mathematical alphanumeric symbols, which are required in order to do math in plain text (see <http://www.unicode.org/reports/tr27/tr27-4.html> and search on the page for "Hamiltonian"). > I fear that if programmers are guaranteed that many nonsense code points > and code points they're not personally using, they're going to start > abusing them for some semantics-breaking purpose like encoding floats, > or otherwise treating strings as blobs. I understand that concern, but I think the lumpy shape of the available code point space makes it mostly nugatory. If we were supporting 16- or 32-bit code units there would be a real concern. > Is there ever a reason for a string to contain NUL? Semantically, no. But there is no reason for a string to contain any of the C0 control characters except TAB, CR, and LF these days. Nobody thinks of excluding ^P just because classical synchronous modems aren't used much any more. > If we presume that the standard should leave a choice of internal > normalization form to the implementations, <chair hat="off"> Now we get into the tight and the bad and the crazy. Do you really think that it should be up to the implementation to choose whether (string-length (string #\A #\x0301)) returns 1 or 2? I submit that for the (scheme base) versions of these procedures the result must be 2. If you want auto-normalizing, you need to create a parallel string library that provides it. </chair> (#\x0301 is COMBINING ACUTE ACCENT, which following A makes it Á.) -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Is it not written, "That which is written, is written"? -- You received this message because you are subscribed to the Google Groups "scheme-reports-wg2" group. To unsubscribe from this group and stop receiving emails from it, send an email to scheme-reports-wg2+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.