Implement union, intersection, union_counts, and intersection_counts #22

HarrisonMc555 · 2019-10-16T02:43:18Z

The union and union_counts methods return iterators that yield elements the maximum number of times they appear in either set. This is different from the previously implemented Add trait, which yields each element the combined number of times it appears in both sets.

The intersection and intersection_counts methods return iterators that yield elements the minimum number of times they appear in either set. This is distinct from the previously implemented Sub trait, which yields each element the number of times it appears in the first set less the number of times it appears in the second.

The difference between the plain variants and the _counts variants is that the former yields only the elements (repeated the correct number of times) while the latter yields a tuple of an element and the count of that element.

This closes #17.

I'm totally willing to take feedback on implementation, naming, or even whether you want to include all of these changes or not.

Let me know what you think!

The `union` and `union_counts` methods return iterators that yield elements the maximum number of times they appear in either set. This is different from the previously implemented `Add` trait, which yields each element the combined number of times it appears in both sets. The `intersection` and `intersection_counts` methods return iterators that yield elements the minimum number of times they appear in either set. This is distinct from the previously implemented `Sub` trait, which yields each element the number of times it appears in the first set less the number of times it appears in the second. The different between the plain variants and the `_counts` variants is that the former yields only the elements (repeated the correct number of times) while the latter yields a tuple of an element and the count of that element.

HarrisonMc555 · 2019-10-16T02:51:52Z

I just realized that I didn't add any documentation to the new structs. I also didn't implement/derive Clone. Should I add those things?

mashedcode · 2019-10-19T21:10:00Z

I feel like it might be confusing to have union and union_counts. Can you name a use-case for when one would want to use union over union_counts?

HarrisonMc555 · 2019-10-20T17:40:37Z

Honestly, I implemented union_counts first (named union), but created the method now named union to match more closely with the current implementation of iter.

I also think that you should be able to construct a collection from the result of an iterator over it. An iterator over key/count pairs cannot be used to construct a HashMultiSet...

...unless it could. We could add (I wrote one earlier to test it out) another implementation of FromIterator<T: IntoIterator<Item = (A, usize)>>.

The only issue with this is that now you have overlapping implementations of FromIterator. This doesn't prevent the code from compiling, but it does mean that if you try to run this code:

let set1: HashMultiSet<_> = [1, 2, 2, 3, 3, 3].iter().cloned().collect();
let set2: HashMultiSet<_> = [1, 1, 1, 3, 4].iter().cloned().collect();
let union: HashMultiSet<_> = set1.union(&set2).collect();

You'll get this error:

error[E0282]: type annotations needed
    --> src/multiset.rs:1087:20
     |
1087 |         let union: HashMultiSet<_> = set1.union_counts(&set2).collect();
     |             -----  ^^^^^^^^^^^^^^^ cannot infer type
     |             |
     |             consider giving `union` a type

You have to be more specific, and write this:

let set1: HashMultiSet<_> = [1, 2, 2, 3, 3, 3].iter().cloned().collect();
let set2: HashMultiSet<_> = [1, 1, 1, 3, 4].iter().cloned().collect();
let union: HashMultiSet<&u32> = set1.union(&set2).collect();

This also demonstrates that it can be obnoxious to use, since a call to cloned won't help transform a (&K, usize) to a (K, usize)...

You'd have to write something like this:

let union: HashMultiSet<u32> = set1
    .union_counts(&set2)
    .map(|(&key, count)| (key, count))
    .collect();

Which is a little bit annoying.

As a consolation, union_counts is still more efficient than union, and is O(keys) instead of O(size)...and "size" may be much larger than "keys"!

So I'm totally up for feedback or ideas!

P.S. I totally forgot that I left the second implementation for FromIterator in there, whoops! But you can try it out and see for yourself.

mashedcode · 2019-10-20T23:42:15Z

Re-reading your implementation both union_counts and union don't do an actual union but a union_update (as python calls it). I think it should do count a + count b instead of max(count a, count b).

Looking at the python multiset and the C++ std::multiset implementation. They both don't have an extra _counts method equivalent. Obviously they mutate in-place instead of returning an iterator.
I don't think that having the _counts is a good idea. IMHO one should just do .union(&multiset2).collect().iter_counts() instead. Because otherwise we'd need a _counts for every counterpart. Let's say for example you'd choose to implement union_update you'll also have to add union_update_counts. That's a lot of bloat even if generated by a macro.

src/multiset.rs

HarrisonMc555 · 2019-10-21T07:34:39Z

Looks like there are two different concerns here. Let's address each of them.

1. Definition of "union"

I think we have different ideas of what a "union" should mean in the context of a multiset. Let's explore it a little bit.

Set

Since sets do not contain duplicates, a union of two sets contains every value that appears in either set, without duplicates.

Map (Dictionary, etc.)

Since maps contain keys and values, it's unclear which value would "win" if a key is present in both maps. The Rust HashMap implementation does not contain an implementation for union, presumably because it seems to be an ill-defined operation on maps.

Multiset (Bags, etc.)

A multiset is conceptually a set that can contain duplicates. In my mind, a union is thought of as the "overlap" between the two sets.

In my mind, a union between two multisets would contain each value the maximum number of times it appears in either set. If the union contained the sum of the two sets, it would be more than the overlap.

The implementation you suggested, where each value appears the sum of the times it appears in both sets is a useful concept. It is the current implementation of the Add trait.

I would prefer to implement union as I have proposed and keep the implementation of the Add trait for the "summation" of the two sets, i.e. each value is contain the sum of the times it appears in both sets.

2. Should we keep the "`_counts`" methods?

I totally understand the aversion to having *_counts methods for every possible method. I'm not a huge fan of having two different versions either.

The biggest argument I can think of is that of efficiency. I try to avoid premature optimization, but there is a large difference between union and union_counts. The current implementation of union is O(keys×counts), while union_counts is O(keys). If your counts were extremely large, the difference would be staggering.

If we're only going to keep one, I would propose keeping union_counts over union (and probably renaming it to be union).

Alternative?

As I was writing this, it occurred to me that there is another alternative. We could forego the iterator pattern altogether and only provide methods that mutate an existing multiset and/or create an entirely new multiset. This would be less efficient if the user was going to create a different data structure and doesn't need the intermediate HashMulitset, but that is probably the exception rather than the rule.

As a counter-point, it is fairly idiomatic to provide this kind of functionality with iterators. This is how Vec, HashSet, and HashMap all work. There are a few methods on each struct that modify them in place, but that mostly is for methods like insert, remove, etc. The most similar methods to our discussion, HashSet's union and intersection, return an iterator.

mashedcode · 2019-10-21T12:14:22Z

Yeah sorry, you seem to be right about union! That's also what the python multiset union does. Not sure why I thought otherwise.

Yes union_counts is more efficient. Yes most people probably just want to do .union_counts(b).collect() or union(b).collect(). In that use-case they would be equal.

Looking at the Collection Conventions RFC we could do it in-place and call it union_with or use the bit-wise operators.

No I don't know the best solution. Just thinking out loud.

Add documentation comments and export Iterators

fdc8166

Fix typos in documentation comments

490c857

mashedcode reviewed Oct 21, 2019

View reviewed changes

src/multiset.rs Show resolved Hide resolved

HarrisonMc555 mentioned this pull request Oct 21, 2019

Add new iter_counts method #23

Open

jmitchell mentioned this pull request Jan 6, 2020

RFC: project maintainership #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement union, intersection, union_counts, and intersection_counts #22

Implement union, intersection, union_counts, and intersection_counts #22

Uh oh!

HarrisonMc555 commented Oct 16, 2019

Uh oh!

HarrisonMc555 commented Oct 16, 2019

Uh oh!

mashedcode commented Oct 19, 2019

Uh oh!

HarrisonMc555 commented Oct 20, 2019 •

edited

Loading

Uh oh!

mashedcode commented Oct 20, 2019 •

edited

Loading

Uh oh!

Uh oh!

HarrisonMc555 commented Oct 21, 2019

Uh oh!

mashedcode commented Oct 21, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement union, intersection, union_counts, and intersection_counts #22

Are you sure you want to change the base?

Implement union, intersection, union_counts, and intersection_counts #22

Uh oh!

Conversation

HarrisonMc555 commented Oct 16, 2019

Uh oh!

HarrisonMc555 commented Oct 16, 2019

Uh oh!

mashedcode commented Oct 19, 2019

Uh oh!

HarrisonMc555 commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mashedcode commented Oct 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HarrisonMc555 commented Oct 21, 2019

1. Definition of "union"

Set

Map (Dictionary, etc.)

Multiset (Bags, etc.)

2. Should we keep the "_counts" methods?

Alternative?

Uh oh!

mashedcode commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HarrisonMc555 commented Oct 20, 2019 •

edited

Loading

mashedcode commented Oct 20, 2019 •

edited

Loading

2. Should we keep the "`_counts`" methods?

mashedcode commented Oct 21, 2019 •

edited

Loading