Bear scripsit: > That ship has sailed, I fear. When R6 and R7-small were being > discussed, I advocated a model of characters as Unicode base character > plus nondefective combining sequence, which would solve this problem > and unambiguously make this string length 1, regardless of which > normalization form were used to represent the value in memory. In short, what Unicode calls "legacy grapheme clusters" nowadays. But there's a reason for that word "legacy": the notion (which is somewhat more general than you state) turned out not to be general enough, and so "extended grapheme clusters" were devised, with the recognition that grapheme clusters really need to be tailorable to a given locale as well, though we don't support tailoring of case conversion. "I toyed with anarchy once, but on reading into the subject found that there were as many kinds of anarchy as there are of democracy. There are plain anarchists and syndicalist anarchists and deviationist anarchists and, for all I know, syndicalist deviationist anarchists. There’s as much anarchy in anarchy as in any political philosophy." --Tully Bascomb > It would also make 'indexing' and locations of characters in strings > unambiguous, mostly eliminate length changes on case operations, > preserve the string-as-sequence-of characters semantics we had, and > yield a character semantics cleaner and less ambiguous than Unicode's > and capable of being mapped cleanly onto other character repertoires > or representations, all of which I thought of as good things. Indeed they are, and indeed it would -- almost. See <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries> > But it was without any widespread implementation support. Alas, yes, and I tread very carefully to avoid standardizing what does not exist. Papa Hegel he say that all we learn from history is that we learn nothing from history. I know people who can't even learn from what happened this morning. Hegel must have been taking the long view. --Chad C. Mulligan > That rejection means we now have a model in which a character is a > Unicode codepoint. Because there are multiple ways to express given > strings as sequences of unicode codepoints, the identity of a string > is now unglued from the identities of the characters from which it was > built and we no longer have clean string-as- sequence-of-characters > semantics. Well, arguably we didn't have them before because of the mutability of strings (which I don't like but we are stuck with). "AB" and "AB" may or may not be identical in Scheme; the most we can say is that they are component-wise equal. > (string-length (string #\A #\x301)) > > [is] therefore now irrevocably dependent on unicode, and specifically > on unicode normalization form. I don't see where the normalization form comes into this. As things are, (string-length (string #\A #\x301)) is 2 and (string-length (string #\xC1)) is 1, even though they normalize to the same thing in either normalization form. So the current semantics don't depend on a NF; rather they depend on not automatically applying any particular NF. > So we now have this choice; either the standard dictates what > normalization form to use in strings, or string-length etc have > implementation-defined semantics. I don't think so, no, unless "don't normalize" counts as a normalization form, which it normally :-) doesn't. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org They do not preach that their God will rouse them A little before the nuts work loose. They do not teach that His Pity allows them to drop their job when they damn-well choose. --Rudyard Kipling, "The Sons of Martha" -- You received this message because you are subscribed to the Google Groups "scheme-reports-wg2" group. To unsubscribe from this group and stop receiving emails from it, send an email to scheme-reports-wg2+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/d/optout.