Re: [Scheme-reports] Proposing amending char-numeric? definition

Re: [Scheme-reports] Proposing amending char-numeric? definition Alex Shinn 28 Apr 2011 01:26 UTC
On Wed, Apr 27, 2011 at 5:40 PM, Shiro Kawai <shiro.kawai@gmail.com> wrote:
> Ok, I rehash the argument and make it more a proposal.
>
> The draft's wording of char-numeric? is confusing, for Unicode doesn't
> define "Numeric" property explicitly like "Alphabetic" or "Uppercase"
> properties.  So I propose to change it.
>
> There can be a few possible resolutions.
>
> (1) Define char-numeric? returns #t if the character's Numeric_Type
> property value is other than 'None'.   This seems a natural
> interpretation of the current wording.  However, I think it is
> practically useless, since it *can't* be used to separate numbers from
> a string.  Characters whose Numeric_Type isn't 'None' includes
> ordinary alphabetic characters (category Lo) that happens to have
> meanings related to numbers.  For example, '幺' (U+5e7a) has
> Numeric_Type = 'Numeric', since the character means small or young, so
> it can sometimes mean 1 in some specific context (for Japanese,
> probably the only place it means '1' is in some Mah-jong terms.)   So,
> when I'm scanning a string and found that char-numeric? returns #t for
> a character, and that character happens to '幺' (U+5e7a), and then what
> I do?   It is probably a part of other word so I should treat it as an
> alphabetic character.  And even if I want to make use of it, I need a
> separate database to look up to know what number '幺' is representing.
>
> (2) Drop char-numeric?, and add char-numeric-type and
> char-numeric-value.  The former returns the value of Numeric_Type
> property, and the latter returns the value of Numeric_Value property.
>  This should be the way to provide access to a character's Unicode
> "Numeric" property.
>
> (3) Define char-numeric? to return #t only for 0,1,2,3,4,5,6,7,8 and
> 9.   This retains the compatibility to R5RS, and we can still use
> char-numeric? to parse numbers, and safely use (- (char->integer c)
> (char->integer #\0)) to obtain the digit value the character
> represents.  (Note: R5RS programs that use char-numeric? to parse
> numbers will break if we adopt the current draft's definition of
> char-numeric?).

I'll have more to say about this when I get back
from my vacation, but will make a quick comment
now.

We're unlikely to remove char-numeric?, since that
would break R5RS compatibility.  We could add
char-numeric-type and char-numeric-value in addition
to them though.

At my work we recently had a case of an application
written for English which detected numbers in text
by looking for ASCII '0'..'9'.  It turns out that we probably
want this to apply to all digits in all scripts.  That would
include the standard ideographic numbers as well as
the old accounting numbers (壱 U+58F1, etc.), but
probably not with a Unihan numeric value of kOtherNumeric
(as with 幺).  So a middle ground between (1) and (3)
may be desirable.

--
Alex

_______________________________________________
Scheme-reports mailing list
Scheme-reports@scheme-reports.org
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports