Re: [Scheme-reports] Seeking review of sets and hash tables proposals

Show/hide message thread
[scheme-reports-wg2] Seeking review of sets and hash tables proposals John Cowan (20 May 2013 05:29 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Alaric Snell-Pym (24 May 2013 14:00 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Per Bothner (25 May 2013 00:59 UTC)
[scheme-reports-wg2] Re: [Scheme-reports] Seeking review of sets and hash tables proposals Kevin Wortman (25 May 2013 20:27 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Evan Hanson (25 May 2013 06:50 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Per Bothner (25 May 2013 18:48 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Per Bothner (25 May 2013 19:38 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Alexey Radul (25 May 2013 22:50 UTC)
Re: [Scheme-reports] Seeking review of sets and hash tables proposals Alexey Radul 25 May 2013 22:49 UTC
Having read the proposals and the discussion thread, I have the
following comments:

0) +1 for portable hash tables

1) Functional hash tables?

I am not sure of the value of specifying names for functional update
procedures that are permitted to take O(n) time.  I fully support
eventually spelling out some kind of story for some sort of functional
association structure that offers both reasonably fast lookups and
reasonably fast updates; but I do not know of any way to get said
"reasonably fast lookups" to be O(1) in this case.  If hash tables are
advertised in the preamble as being inherently mutable structures for
the purposes of this proposal, why bother with a specification for
very slow functional update functions?

2) Universal lookup procedure

I am fond of the following universal lookup procedure [1]:
  (hash-table-search table key
    (lambda (value update remove)
      ...)
    (lambda (insert)
      ...))
Here the first continuation argument is called iff the element is
present and the second iff not.  The "results" are:
- `value' is the value the element maps to;
- `update' is a unary procedure that updates the value of the element
  to a new value;
- `remove' is a nullary procedure that removes the element; and
- `insert' is a unary procedure that inserts the element with the
  given value.
The whole `search' call returns whatever the continuation it called
returns.

The advantage of thinking about point access to collections in terms
of `search' is that all other procedures that operate on one key can
be specified as simple patterns of invocation of `search'; and, if
one's compiler is polite enough to inline literal procedures, can even
be efficiently implemented that way.

Furthermore, if the procedural arguments to `search' do not mutate
the hash table too badly, the common pattern
  (if (hash-table-contains? table key)
      (let ((value (hash-table-ref table key)))
        (do-something value))
      (do-something-else))
can be implemented without having to repeat the actual lookup, because
the `update', `remove', and `insert' procedures can close over the
location to operate on.

To the extent this item contains a suggestion, it is to consider
specifying a `search' procedure for hash tables, sets, and bags, [2]
[3] and to consider whether using it would simplify the
implementation.  Specifying a `search' that is suitable for external
use is a little tricky, because, in order for `search' to be able to
offer a performance advantage, the procedural arguments to `search'
have to be carefully restricted.

[1] Taylor Campbell suggested this to me years ago.

[2] Both mutative and functional versions of `search' are possible,
depending on the behavior of `update', `remove', and `insert'.  In the
functional case, the programmer would need to arrange to return the
new hash table from the continuation procedures if they wanted to use
it later.

[3] set-search could be simpler, because `value' is always #t and
`update' makes no sense, so those could go away.  For bag-search,
`update' still makes no sense, but `remove' and `insert' could accept
an exact non-negative integer to mean "how many times".

3) Hash functions for sets?

If a set is to be implemented as a hash table, how is an appropriate
hash function to be derived from the equivalence predicate?  There
doesn't seem to be any room for the client programmer to supply a hash
function that would be appropriate for the desired equivalence
predicate.

4) Weak vs Ephemeral hash tables

The hash table proposal leaves a hook for requesting "hash tables with
weak keys" and "hash tables with weak values".  Herein lurks a
subtlety that will need to be addressed in a final proposal (even a
final "intermediate" proposal, because it would be appropriate to
specify what, exactly, the hooks are being left for).

The subtlety is this: What should happen if the only strong-reference
path to a key in a key-weak hash table goes through that key's value
in that hash table, and the value is not accessible any other way [4]?
If the programmer's intent is that the hash table's values can be
accessed by traversing the hash table, then the entry should be kept.
If the programmer's intent is that the only way to use the hash table
to get one of the values is by looking up the key, then the entry
should be removed and the key and value garbage collected.  The latter
semantics is considerably more difficult to implement (see
"ephemerons"), and is probably inevitably less efficient.

There are seven possible distinct weakness behaviors for associations,
[5] and I am under the impression that they are all useful, albeit
some in more abstruse situations than others (my apologies for the
nomenclature):
a strong (entries hold keys and values strongly)
b key-weak (entries hold keys weakly)
c value-weak (entries hold values weakly)
d key-weak-and-value-weak (entries hold keys and values weakly)
e key-and-value-weak (an entry is dropped if both the key and the
  value are inaccessible through strong pointers, but not otherwise)
f key-ephemeral (an entry is dropped if the key cannot be
  accessed except through the entry)
g value-ephemeral (an entry is dropped if the value cannot be
  accessed except through the entry)

I suggest adjusting the proposal to make clear which of these
semantics a user is permitted to request, and how; or to make clear
that this is not going to be made clear.  The reason I suggest this is
that the phrase "key weak" could mean either (b) or (f) to different
users and implementers in the absence of a more thorough discussion.

For what it's worth, MIT Scheme contains an undocumented
implementation of hash tables with all of these weakness semantics.

[4] This actually happened to me when I tried to use MIT Scheme's weak
hash tables to memoize a function whose return value was a data
structure that contained its argument.  The same problem arises if the
path goes through the value of some other key that can only be
accessed through the value of that key, or if there are longer cycles.

[5] Assuming that entries whose keys or values have been reclaimed are
themselves removed from the hash table, but I cannot image why anyone
would want any other behavior.

5) In-place key change?

It seems as though the in-place mutation of hash-table-map! would be
considerably more useful (for efficiency) if the proc were not allowed
to change the keys to inequivalent ones.  In fact, since an
implementation of hash-table-map! would need to prevent keys returned
by proc from being fed back into proc, it couldn't do an in-place loop
at all, and would effectively need to build a new hash table and then
mutate its argument to become it.  Is this behavior useful for
anything?  Even if it is, should a variant be specified that requires
proc not to change the key (and that can therefore actually do an
in-place traversal)?

It also seems that leaving the key alone would be a common use case
for hash-table-map; is it worth specifying a variant that accepts
this?  Or specifying that the procedure given to hash-table-map may
return either one value or two, with the second being assumed to be
the key if absent?

6) Mergers changing keys

I am surprised by the proposal to allow the merger argument to the
various hash table manipulation procedures to return an arbitrary new
key as a result of a collision.  What purpose does this serve?  Its
cost strikes me as high: what if the merger returns a key that is not
equivalent to either of its input keys?  What if that key conflicts
with a different key already present in the result being built?  This
is also inconsistent with the mergers specified for hash-table-union!
and hash-table-intersection!.

7) Small nits

Why is the failure argument to hash-table-replace! mandatory?  If not
a typo, that seems like a useless irregularity.

hash-table-push! should probably say "may be implemented more
efficiently than" to be consistent

Presumably all the hash table traversal operations should forbid any
procedural arguments from mutating the hash table underfoot (with the
possible exception of changing the binding of the key being
inspected).

Clarification in "hash tables as functions": presumably
hash-table-mutator should accept a key and a value, rather than just a
key; and presumably hash-table-replacer should return the result of
the underlying hash-table-replace! rather than nothing.

MIT Scheme supports hash tables with arbitrary equivalence predicates:
see strong-hash-table/constructor and weak-hash-table/constructor.

Best,
~Alexey

On Mon, May 20, 2013 at 1:29 AM, John Cowan <cowan@mercury.ccil.org> wrote:
> I'm writing this as an individual contributor to WG2 and the Scheme
> community, not as the chair of WG2.
>
> I would like to ask people to review and send criticisms of two
> proto-SRFIs that I intend to propose for R7RS-large.  These will both
> become SRFIs (the first is submitted already, the second will be soon)
> and then will be voted on by the WG after SRFI finalization.  The more
> early commentary from the community the better.
>
> The first is on sets, bags, integer sets, and enumeration
> sets.  The current editor's draft is in SRFI format at
> <http://ccil.org/~cowan/temp/srfi-sets.html>.  Send feedback to either
> list or directly to me.
>
> The second is on intermediate hash tables -- "intermediate"
> because it does not yet give a serious account of weak hash
> tables, a WG2 goal.  The current editor's draft is on the wiki at
> <http://trac.sacrideo.us/wg/wiki/HashTablesCowan>.
>
> Send feedback to either list or directly to me.
>
> --
> John Cowan
>         cowan@ccil.org
>                 I am a member of a civilization. --David Brin
>
> _______________________________________________
> Scheme-reports mailing list
> Scheme-reports@scheme-reports.org
> http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports

_______________________________________________
Scheme-reports mailing list
Scheme-reports@scheme-reports.org
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports