On 05/16/2014 02:01 PM, John Cowan wrote: > Here's the current list of Schemes that support the full numeric and > character towers: Racket, Gauche, MIT, Gambit, Chicken (with eggs), > Scheme48/scsh, Kawa, Chibi, Guile, Chez, Vicare, Larceny, Ypsilon, Mosh, > IronScheme, STklos, KSi. In addition, the Java/CLR based Schemes (other > than Kawa) almost do: they support characters up to U+FFFF. You're giving Kawa *slightly* more credit than it deserves when it comes to handling non-BMP characters: It's somewhat schizophrenic when it comes to dealing with surrogates. The character type handles non-BMP characters, and read-char, peek-char, and write-char convert these properly. However, string-ref and string-set! just work on 16-bit code units - including raw surrogates. The substring operations are use code unit offsets too. This is IMO a bug. Fixing this while maintaining Java compatibility isn't easy. string-ref is easy if you don't mind giving up O(1) performance. Since Kawa's string type is the java.lang.CharSequence interface, one could define new string type(s) that remain compatible with CharSequence, but have the needed extra tables to allow O(1) indexing of non-BMP strings. However, many Java APIs assume or return java.lang.String (which does implement CharSequence), and you don't know a priori if these contain non-BMP characters. Also, Kawa's "immutable string" type is just java.lang.String, and I'd hate to give that up. One idea is to accept O(N) string-ref (at least on java.lang.String), but have the compiler optimize iteration using string-ref: I.e. when the string is loop-invariant, but the index is an iteration variable. In that case the compiler can add a parallel index variable with using offsets in the char array. This isn't trivial, and it doesn't help with substring operations. Another idea is to have a small cache that maps codepoint indexes to 16-bit code units, for immutable strings. This should perhaps be thread-local. With string-set! we have the further complication that it can change the length of the underlying char array. The solution is to just drop the type of mutable fixed-length strings, and make all mutable strings be variable-length, perhaps using a gap-buffer. Kawa's XQuery implementation correctly indexes indexes in terms of codepoints, but of course performance is hurt - and it doesn't have to deal with updates. -- --Per Bothner per@bothner.com http://per.bothner.com/ _______________________________________________ Scheme-reports mailing list Scheme-reports@scheme-reports.org http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports