[Scheme-reports] dealing with non-BMP characters [was: Reformulated numeric-tower ballot]

Show/hide message thread

[scheme-reports-wg2] Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (01 May 2014 14:34 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Andy Wingo (07 May 2014 19:38 UTC)

[scheme-reports-wg2] Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (07 May 2014 23:06 UTC)

Re: [scheme-reports-wg2] Re: [Scheme-reports] Reformulated numeric-tower ballot Christian Stigen Larsen (08 May 2014 00:10 UTC)

Re: [scheme-reports-wg2] Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (08 May 2014 00:49 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Andy Wingo (14 May 2014 20:45 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Arthur A. Gleckler (14 May 2014 20:56 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Per Bothner (14 May 2014 23:11 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (15 May 2014 03:22 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Per Bothner (15 May 2014 07:14 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (15 May 2014 13:30 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Per Bothner (15 May 2014 22:08 UTC)

[Scheme-reports] Proposed R7RS-large library declarations John Cowan (16 May 2014 01:28 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Alaric Snell-Pym (15 May 2014 10:47 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (15 May 2014 12:20 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Sascha Ziemann (16 May 2014 08:37 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Alaric Snell-Pym (16 May 2014 08:46 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot Peter Bex (16 May 2014 08:57 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (16 May 2014 21:05 UTC)

[Scheme-reports] dealing with non-BMP characters [was: Reformulated numeric-tower ballot] Per Bothner (16 May 2014 21:55 UTC)

Re: [Scheme-reports] dealing with non-BMP characters [was: Reformulated numeric-tower ballot] John Cowan (18 May 2014 03:04 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot John Cowan (16 May 2014 20:26 UTC)

Re: [Scheme-reports] Reformulated numeric-tower ballot xacc.ide@gmail.com (16 May 2014 20:41 UTC)

[Scheme-reports] dealing with non-BMP characters [was: Reformulated numeric-tower ballot] Per Bothner 16 May 2014 21:50 UTC

On 05/16/2014 02:01 PM, John Cowan wrote:
> Here's the current list of Schemes that support the full numeric and
> character towers:  Racket, Gauche, MIT, Gambit, Chicken (with eggs),
> Scheme48/scsh, Kawa, Chibi, Guile, Chez, Vicare, Larceny, Ypsilon, Mosh,
> IronScheme, STklos, KSi.  In addition, the Java/CLR based Schemes (other
> than Kawa) almost do: they support characters up to U+FFFF.

You're giving Kawa *slightly* more credit than it deserves when it comes
to handling non-BMP characters: It's somewhat schizophrenic when it comes
to dealing with surrogates.  The character type handles non-BMP characters,
and read-char, peek-char, and write-char convert these properly.

However, string-ref and string-set! just work on 16-bit code units -
including raw surrogates.  The substring operations are use code unit
offsets too.  This is IMO a bug.

Fixing this while maintaining Java compatibility isn't easy.  string-ref
is easy if you don't mind giving up O(1) performance.  Since Kawa's string
type is the java.lang.CharSequence interface, one could define new string
type(s) that remain compatible with CharSequence, but have the needed extra
tables to allow O(1) indexing of non-BMP strings.  However, many Java APIs
assume or return java.lang.String (which does implement CharSequence),
and you don't know a priori if these contain non-BMP characters. Also,
Kawa's "immutable string" type is just java.lang.String, and I'd hate to
give that up.

One idea is to accept O(N) string-ref (at least on java.lang.String), but have
the compiler optimize iteration using string-ref: I.e. when the string is
loop-invariant, but the index is an iteration variable.  In that case the compiler
can add a parallel index variable with using offsets in the char array.  This isn't
trivial, and it doesn't help with substring operations.

Another idea is to have a small cache that maps codepoint indexes to 16-bit
code units, for immutable strings.  This should perhaps be thread-local.

With string-set! we have the further complication that it can change the length
of the underlying char array.  The solution is to just drop the type of mutable
fixed-length strings, and make all mutable strings be variable-length, perhaps
using a gap-buffer.

Kawa's XQuery implementation correctly indexes indexes in terms of codepoints,
but of course performance is hurt - and it doesn't have to deal with updates.
--
	--Per Bothner
per@bothner.com   http://per.bothner.com/

_______________________________________________
Scheme-reports mailing list
Scheme-reports@scheme-reports.org
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports