-
Notifications
You must be signed in to change notification settings - Fork 27
CG0562 #1454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CG0562 #1454
Conversation
| "data, expected, regex", | ||
| [ | ||
| ( | ||
| PandasDataset.from_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have all the test cases here with pandasdataset only. Could you please add some cases using DASK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
RamilCDISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR adds the regex option for record count operation, to select a specific part of a value to compare with. The PR was validated by:
- Validating the PR for any unwanted code or comments.
- Validating the PR logic in context with the AC.
- Ensuring all the unit and regression testing pass.
- Ensuring all related testing is updated.
- Ensuring the updated testing covers cases for both pandas and DASK implementations.
- Running manual testing using dev editor for positive dataset.
- Running manual testing using dev editor for negative dataset.
- Ensuring test cases for the regex matching.
This PR resolves the issues currrently reported with CG0562. Mainly: the authors need to filter only the date portion and not datetime for the record count. I added regex handling to this operator to support this and the rule below can be found that implements this. I also added logic for grouping based on a wildcard column to appropriately process the -- in the column name
Datasets.json
Datasets.xlsx
Rule_underscores.json
CORE-Report-2025-12-01T15-16-44.xlsx
This pull request adds support for applying regex transformations to grouping columns in the
record_countoperation, allowing users to group records based on extracted patterns (such as dates from datetime strings). The changes include updates to the operation logic, schema, documentation, and comprehensive unit tests to ensure correct behavior for regex-based grouping, including support for grouping aliases and filters.Record Count Operation Enhancements
Added support for a new
regexparameter in therecord_countoperation, enabling transformation of grouping column values using a regex pattern before grouping. This allows, for example, grouping by just the date portion of a datetime string. (cdisc_rules_engine/operations/record_count.py,cdisc_rules_engine/models/operation_params.py,cdisc_rules_engine/utilities/rule_processor.py,resources/schema/Operations.json) [1] [2] [3] [4] [5]Implemented helper methods
_get_grouping_for_operations,_get_regex_grouped_counts, and_apply_regex_to_grouping_columnsinrecord_count.pyto handle regex transformation and grouping logic robustly, including proper handling of grouping aliases and filters. (cdisc_rules_engine/operations/record_count.py) [1] [2]Schema and Documentation Updates
Operations.json) and documentation (Operations.md) to describe the newregexparameter and provide examples of how to use regex-based grouping in YAML operation definitions. (resources/schema/Operations.json,resources/schema/Operations.md) [1] [2] [3]Unit Test Coverage
tests/unit/test_operations/test_record_count.py)Codebase Consistency
cdisc_rules_engine/operations/base_operation.py) [1] [2]These changes collectively make the
record_countoperation more flexible and powerful for data analysis involving grouped record counts with transformed grouping keys.