Skip to content

Conversation

@HarrisonMc555
Copy link
Contributor

The union and union_counts methods return iterators that yield elements the maximum number of times they appear in either set. This is different from the previously implemented Add trait, which yields each element the combined number of times it appears in both sets.

The intersection and intersection_counts methods return iterators that yield elements the minimum number of times they appear in either set. This is distinct from the previously implemented Sub trait, which yields each element the number of times it appears in the first set less the number of times it appears in the second.

The difference between the plain variants and the _counts variants is that the former yields only the elements (repeated the correct number of times) while the latter yields a tuple of an element and the count of that element.


This closes #17.

I'm totally willing to take feedback on implementation, naming, or even whether you want to include all of these changes or not.

Let me know what you think!

The `union` and `union_counts` methods return iterators that yield elements the
maximum number of times they appear in either set. This is different from the
previously implemented `Add` trait, which yields each element the combined
number of times it appears in both sets.

The `intersection` and `intersection_counts` methods return iterators that yield
elements the minimum number of times they appear in either set. This is distinct
from the previously implemented `Sub` trait, which yields each element the
number of times it appears in the first set less the number of times it appears
in the second.

The different between the plain variants and the `_counts` variants is that the
former yields only the elements (repeated the correct number of times) while the
latter yields a tuple of an element and the count of that element.
@HarrisonMc555
Copy link
Contributor Author

I just realized that I didn't add any documentation to the new structs. I also didn't implement/derive Clone. Should I add those things?

@mashedcode
Copy link
Collaborator

I feel like it might be confusing to have union and union_counts. Can you name a use-case for when one would want to use union over union_counts?

@HarrisonMc555
Copy link
Contributor Author

HarrisonMc555 commented Oct 20, 2019

Honestly, I implemented union_counts first (named union), but created the method now named union to match more closely with the current implementation of iter.

I also think that you should be able to construct a collection from the result of an iterator over it. An iterator over key/count pairs cannot be used to construct a HashMultiSet...

...unless it could. We could add (I wrote one earlier to test it out) another implementation of FromIterator<T: IntoIterator<Item = (A, usize)>>.

The only issue with this is that now you have overlapping implementations of FromIterator. This doesn't prevent the code from compiling, but it does mean that if you try to run this code:

let set1: HashMultiSet<_> = [1, 2, 2, 3, 3, 3].iter().cloned().collect();
let set2: HashMultiSet<_> = [1, 1, 1, 3, 4].iter().cloned().collect();
let union: HashMultiSet<_> = set1.union(&set2).collect();

You'll get this error:

error[E0282]: type annotations needed
    --> src/multiset.rs:1087:20
     |
1087 |         let union: HashMultiSet<_> = set1.union_counts(&set2).collect();
     |             -----  ^^^^^^^^^^^^^^^ cannot infer type
     |             |
     |             consider giving `union` a type

You have to be more specific, and write this:

let set1: HashMultiSet<_> = [1, 2, 2, 3, 3, 3].iter().cloned().collect();
let set2: HashMultiSet<_> = [1, 1, 1, 3, 4].iter().cloned().collect();
let union: HashMultiSet<&u32> = set1.union(&set2).collect();

This also demonstrates that it can be obnoxious to use, since a call to cloned won't help transform a (&K, usize) to a (K, usize)...

You'd have to write something like this:

let union: HashMultiSet<u32> = set1
    .union_counts(&set2)
    .map(|(&key, count)| (key, count))
    .collect();

Which is a little bit annoying.

As a consolation, union_counts is still more efficient than union, and is O(keys) instead of O(size)...and "size" may be much larger than "keys"!

So I'm totally up for feedback or ideas!

P.S. I totally forgot that I left the second implementation for FromIterator in there, whoops! But you can try it out and see for yourself.

@mashedcode
Copy link
Collaborator

mashedcode commented Oct 20, 2019

Re-reading your implementation both union_counts and union don't do an actual union but a union_update (as python calls it). I think it should do count a + count b instead of max(count a, count b).

Looking at the python multiset and the C++ std::multiset implementation. They both don't have an extra _counts method equivalent. Obviously they mutate in-place instead of returning an iterator.
I don't think that having the _counts is a good idea. IMHO one should just do .union(&multiset2).collect().iter_counts() instead. Because otherwise we'd need a _counts for every counterpart. Let's say for example you'd choose to implement union_update you'll also have to add union_update_counts. That's a lot of bloat even if generated by a macro.

@HarrisonMc555
Copy link
Contributor Author

Looks like there are two different concerns here. Let's address each of them.

1. Definition of "union"

I think we have different ideas of what a "union" should mean in the context of a multiset. Let's explore it a little bit.

Set

Since sets do not contain duplicates, a union of two sets contains every value that appears in either set, without duplicates.

Map (Dictionary, etc.)

Since maps contain keys and values, it's unclear which value would "win" if a key is present in both maps. The Rust HashMap implementation does not contain an implementation for union, presumably because it seems to be an ill-defined operation on maps.

Multiset (Bags, etc.)

A multiset is conceptually a set that can contain duplicates. In my mind, a union is thought of as the "overlap" between the two sets.

In my mind, a union between two multisets would contain each value the maximum number of times it appears in either set. If the union contained the sum of the two sets, it would be more than the overlap.

The implementation you suggested, where each value appears the sum of the times it appears in both sets is a useful concept. It is the current implementation of the Add trait.

I would prefer to implement union as I have proposed and keep the implementation of the Add trait for the "summation" of the two sets, i.e. each value is contain the sum of the times it appears in both sets.

2. Should we keep the "_counts" methods?

I totally understand the aversion to having *_counts methods for every possible method. I'm not a huge fan of having two different versions either.

The biggest argument I can think of is that of efficiency. I try to avoid premature optimization, but there is a large difference between union and union_counts. The current implementation of union is O(keys×counts), while union_counts is O(keys). If your counts were extremely large, the difference would be staggering.

If we're only going to keep one, I would propose keeping union_counts over union (and probably renaming it to be union).

Alternative?

As I was writing this, it occurred to me that there is another alternative. We could forego the iterator pattern altogether and only provide methods that mutate an existing multiset and/or create an entirely new multiset. This would be less efficient if the user was going to create a different data structure and doesn't need the intermediate HashMulitset, but that is probably the exception rather than the rule.

As a counter-point, it is fairly idiomatic to provide this kind of functionality with iterators. This is how Vec, HashSet, and HashMap all work. There are a few methods on each struct that modify them in place, but that mostly is for methods like insert, remove, etc. The most similar methods to our discussion, HashSet's union and intersection, return an iterator.

@mashedcode
Copy link
Collaborator

mashedcode commented Oct 21, 2019

Yeah sorry, you seem to be right about union! That's also what the python multiset union does. Not sure why I thought otherwise.

Yes union_counts is more efficient. Yes most people probably just want to do .union_counts(b).collect() or union(b).collect(). In that use-case they would be equal.

Looking at the Collection Conventions RFC we could do it in-place and call it union_with or use the bit-wise operators.

No I don't know the best solution. Just thinking out loud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding union, iter_counts, etc.

2 participants