Skip to content

Conversation

@RakeshBobba03
Copy link
Collaborator

This PR fixes a bug in is_column_of_iterables where it only checked the first row (iloc[0]) to determine if a column contained iterables, which could cause misclassification when the first row was None but later rows were lists or sets. The fix now checks all values in the column, returning True only when all values are iterables (lists or sets), not None, and not NaN. This prevents incorrect behavior in containment operators (contains, does_not_contain, is_contained_by) when used with grouped distinct operations that produce columns with mixed types (e.g., None for groups with no matches, lists for groups with matches). The PR includes tests for the bug scenario (None-first-row cases) and updates the operator documentation in Operator.md to explain the iterable detection behavior.

@RakeshBobba03 RakeshBobba03 marked this pull request as ready for review November 20, 2025 23:38

Will return True if the value in `value` is contained within the collection/iterable in the target column, or if there's an exact match for non-iterable data.

The operator determines if a column contains iterables by checking that **all** values in the column are iterables (lists or sets), not `None`, and not `NaN`. If all values are iterables, the operator performs row-by-row comparison. If the column has mixed types (e.g., `None` and lists), the column is not treated as iterables and the operator uses value vs all column values logic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit verbose. A lot of the rule authors are non-technical and this maybe leans too technical so I would avoid the term iterables. The operator checks if every value in a column is a list or set. If yes, it compares row-by-row. If any value is blank or a different type (like a string or number), it compares each value against the entire column instead.


Complement of `contains`. Returns True when the value is NOT contained within the target collection.

The operator uses the same iterable detection logic as `contains`: all values must be iterables for the column to be treated as iterables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is necessary since the docs already state it is the complement of contains.


Value in `name` compared against a list in `value`. The list can have literal values or be a reference to a `$variable`.

When the `value` parameter references a column variable (e.g., `$variable`), the operator determines if that column contains iterables by checking that **all** values in the column are iterables (lists or sets), not `None`, and not `NaN`. If all values are iterables, the operator performs row-by-row comparison. If the column has mixed types, the operator uses value vs all column values logic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know if this needs to be restated. I would just reference what you already wrote above. 'this operator behaves similarly to contains. The key distinction: contains checks if comparator ∈ target, while is_contained_by checks if target ∈ comparator'

{"target": "target", "comparator": comparator}
)
assert result.equals(df.convert_to_series(expected_result))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests look good

Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@RakeshBobba03 RakeshBobba03 changed the title Fix is_column_of_iterables to check all values instead of just first row #1017 Fix is_column_of_iterables to check all values instead of just first row Nov 23, 2025
Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, looks good.

@SFJohnson24 SFJohnson24 merged commit 89f2653 into main Nov 24, 2025
11 checks passed
@SFJohnson24 SFJohnson24 deleted the 1017-DDF00102 branch November 24, 2025 19:03
@RakeshBobba03 RakeshBobba03 restored the 1017-DDF00102 branch December 4, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CORERULES-9645 - is_column_of_iterables method only checks first row

3 participants