Skip to content

Conversation

@SFJohnson24
Copy link
Collaborator

@SFJohnson24 SFJohnson24 commented Nov 25, 2025

This PR resolves the issues currrently reported with CG0562. Mainly: the authors need to filter only the date portion and not datetime for the record count. I added regex handling to this operator to support this and the rule below can be found that implements this. I also added logic for grouping based on a wildcard column to appropriately process the -- in the column name
Datasets.json
Datasets.xlsx
Rule_underscores.json
CORE-Report-2025-12-01T15-16-44.xlsx

This pull request adds support for applying regex transformations to grouping columns in the record_count operation, allowing users to group records based on extracted patterns (such as dates from datetime strings). The changes include updates to the operation logic, schema, documentation, and comprehensive unit tests to ensure correct behavior for regex-based grouping, including support for grouping aliases and filters.

Record Count Operation Enhancements

  • Added support for a new regex parameter in the record_count operation, enabling transformation of grouping column values using a regex pattern before grouping. This allows, for example, grouping by just the date portion of a datetime string. (cdisc_rules_engine/operations/record_count.py, cdisc_rules_engine/models/operation_params.py, cdisc_rules_engine/utilities/rule_processor.py, resources/schema/Operations.json) [1] [2] [3] [4] [5]

  • Implemented helper methods _get_grouping_for_operations, _get_regex_grouped_counts, and _apply_regex_to_grouping_columns in record_count.py to handle regex transformation and grouping logic robustly, including proper handling of grouping aliases and filters. (cdisc_rules_engine/operations/record_count.py) [1] [2]

Schema and Documentation Updates

  • Updated the operations schema (Operations.json) and documentation (Operations.md) to describe the new regex parameter and provide examples of how to use regex-based grouping in YAML operation definitions. (resources/schema/Operations.json, resources/schema/Operations.md) [1] [2] [3]

Unit Test Coverage

  • Added extensive unit tests for the new regex grouping feature, covering scenarios with and without grouping aliases, and with filters, ensuring correctness and robustness of the implementation. (tests/unit/test_operations/test_record_count.py)

Codebase Consistency

  • Refactored and updated related code to consistently resolve variable names with domain wildcards and ensure correct merging and grouping behavior in the base operation logic. (cdisc_rules_engine/operations/base_operation.py) [1] [2]

These changes collectively make the record_count operation more flexible and powerful for data analysis involving grouped record counts with transformed grouping keys.

@SFJohnson24 SFJohnson24 marked this pull request as ready for review December 1, 2025 22:25
@SFJohnson24 SFJohnson24 changed the title wildcard logic CG)562 Dec 2, 2025
@SFJohnson24 SFJohnson24 changed the title CG)562 CG0562 Dec 2, 2025
"data, expected, regex",
[
(
PandasDataset.from_dict(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have all the test cases here with pandasdataset only. Could you please add some cases using DASK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@SFJohnson24 SFJohnson24 removed their assignment Dec 4, 2025
@SFJohnson24 SFJohnson24 self-assigned this Dec 5, 2025
Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds the regex option for record count operation, to select a specific part of a value to compare with. The PR was validated by:

  1. Validating the PR for any unwanted code or comments.
  2. Validating the PR logic in context with the AC.
  3. Ensuring all the unit and regression testing pass.
  4. Ensuring all related testing is updated.
  5. Ensuring the updated testing covers cases for both pandas and DASK implementations.
  6. Running manual testing using dev editor for positive dataset.
  7. Running manual testing using dev editor for negative dataset.
  8. Ensuring test cases for the regex matching.

@RamilCDISC RamilCDISC merged commit c76f2f2 into main Dec 5, 2025
11 checks passed
@RamilCDISC RamilCDISC deleted the cg0562 branch December 5, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants