-
Notifications
You must be signed in to change notification settings - Fork 27
#1017 Fix is_column_of_iterables to check all values instead of just first row #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
resources/schema/Operator.md
Outdated
|
|
||
| Will return True if the value in `value` is contained within the collection/iterable in the target column, or if there's an exact match for non-iterable data. | ||
|
|
||
| The operator determines if a column contains iterables by checking that **all** values in the column are iterables (lists or sets), not `None`, and not `NaN`. If all values are iterables, the operator performs row-by-row comparison. If the column has mixed types (e.g., `None` and lists), the column is not treated as iterables and the operator uses value vs all column values logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit verbose. A lot of the rule authors are non-technical and this maybe leans too technical so I would avoid the term iterables. The operator checks if every value in a column is a list or set. If yes, it compares row-by-row. If any value is blank or a different type (like a string or number), it compares each value against the entire column instead.
resources/schema/Operator.md
Outdated
|
|
||
| Complement of `contains`. Returns True when the value is NOT contained within the target collection. | ||
|
|
||
| The operator uses the same iterable detection logic as `contains`: all values must be iterables for the column to be treated as iterables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is necessary since the docs already state it is the complement of contains.
resources/schema/Operator.md
Outdated
|
|
||
| Value in `name` compared against a list in `value`. The list can have literal values or be a reference to a `$variable`. | ||
|
|
||
| When the `value` parameter references a column variable (e.g., `$variable`), the operator determines if that column contains iterables by checking that **all** values in the column are iterables (lists or sets), not `None`, and not `NaN`. If all values are iterables, the operator performs row-by-row comparison. If the column has mixed types, the operator uses value vs all column values logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if this needs to be restated. I would just reference what you already wrote above. 'this operator behaves similarly to contains. The key distinction: contains checks if comparator ∈ target, while is_contained_by checks if target ∈ comparator'
| {"target": "target", "comparator": comparator} | ||
| ) | ||
| assert result.equals(df.convert_to_series(expected_result)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests look good
SFJohnson24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
SFJohnson24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work, looks good.
This PR fixes a bug in
is_column_of_iterableswhere it only checked the first row (iloc[0]) to determine if a column contained iterables, which could cause misclassification when the first row wasNonebut later rows were lists or sets. The fix now checks all values in the column, returningTrueonly when all values are iterables (lists or sets), notNone, and notNaN. This prevents incorrect behavior in containment operators (contains,does_not_contain,is_contained_by) when used with groupeddistinctoperations that produce columns with mixed types (e.g.,Nonefor groups with no matches, lists for groups with matches). The PR includes tests for the bug scenario (None-first-row cases) and updates the operator documentation inOperator.mdto explain the iterable detection behavior.