record count optimize #1472

SFJohnson24 · 2025-12-08T19:20:18Z

in reviewing #1454 changes for sprint review--i noticed an optimization. Instead of copying the frame and adding index to merge back to the original from the grouped, we can rely on the indexes in values(). This adds a memory optimization while maintaining the functionality.

This pull request refines the logic for handling regex-based grouping and counting in the record_count.py operation. The main improvements are focused on ensuring that regex grouping is only applied when necessary and simplifying the process of merging grouped counts back to the original dataframe.

Logic improvements for regex grouping:

Updated the condition in _execute_operation to only perform regex grouping when a regex is provided and no pre-filtering has occurred, preventing unnecessary operations.

Simplification and efficiency in merging grouped counts:

Refactored _get_regex_grouped_counts to streamline the process of merging grouped counts back to the original dataframe, removing the use of auxiliary _idx columns and simplifying the assignment of the "size" column.

RamilCDISC · 2025-12-08T20:30:55Z

cdisc_rules_engine/operations/record_count.py

-        )
+        result = dataframe[grouping_columns].copy()
+        result["size"] = transformed_with_counts["size"].values
+        result = result.groupby(grouping_columns, as_index=False, dropna=False).first()


The new logic breaks row alignments and i snot equivalent to the previous logic. This can cause silent bugs. Could you please mention if you think this is correct and that regex transformations will not reorder rows?

@RamilCDISC why do you believe it will break row alignment? df_for_grouping is transformed by the apply_regex_to_grouping_columns but this performs only element-wise operations on selected columns. No rows are removed/added/reordered by this. The left merge keep all rows in the original order. I dont know where the row alignment not changing would be introduced

Sorry I actually wanted to tag line 105. The merge operation can change the ordering. The previous code had 'idx' which could be used for reorder to original. This may not be a big issue. Please let me know your thoughts.

@RamilCDISC
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
"left: use only keys from left frame, similar to a SQL left outer join; preserve key order "

It is a left merge so the key order is preserved.

RamilCDISC · 2025-12-08T20:31:01Z

cdisc_rules_engine/operations/record_count.py

                else effective_grouping
            )
-            if self.params.regex:
+            if self.params.regex and not filtered:


filtered here would be either none or a dataframe. Using not on a dataframe raises ```
ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This would need a different logic for handling here

I corrected this--changed it to an explicit None check as I want to avoid doing regex on the original frame if filtering has taken place and the regex will be applied to it lower down

RamilCDISC

The PR optimizes the record count operation. The validation was done by

Reviewing the PR for any unwanted code or comments.
Reviewing the PR code in reference to the AC.
Reviewing the PR new logic in comparison with old logic.
Ensuring the updated code is bug and error free by looking for potential edge cases.
Ensuring all unit and regression testing pass.
Running manual testing using positive and negative datasets.

record count refactor

6ce1eb4

SFJohnson24 requested a review from RamilCDISC December 8, 2025 19:20

SFJohnson24 self-assigned this Dec 8, 2025

SFJohnson24 temporarily deployed to DEV December 8, 2025 19:20 — with GitHub Actions Inactive

RamilCDISC requested changes Dec 8, 2025

View reviewed changes

none check

98e8104

SFJohnson24 temporarily deployed to DEV December 9, 2025 20:25 — with GitHub Actions Inactive

SFJohnson24 requested a review from RamilCDISC December 10, 2025 14:32

Merge branch 'main' into record_opt

bcf66e1

RamilCDISC temporarily deployed to DEV December 12, 2025 19:22 — with GitHub Actions Inactive

RamilCDISC approved these changes Dec 12, 2025

View reviewed changes

RamilCDISC merged commit f4693ec into main Dec 12, 2025
11 checks passed

RamilCDISC deleted the record_opt branch December 12, 2025 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record count optimize #1472

record count optimize #1472

Uh oh!

SFJohnson24 commented Dec 8, 2025

Uh oh!

RamilCDISC Dec 8, 2025

Uh oh!

SFJohnson24 Dec 9, 2025

Uh oh!

RamilCDISC Dec 11, 2025

Uh oh!

SFJohnson24 Dec 11, 2025 •

edited

Loading

Uh oh!

RamilCDISC Dec 8, 2025

Uh oh!

SFJohnson24 Dec 9, 2025

Uh oh!

RamilCDISC left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

record count optimize #1472

record count optimize #1472

Uh oh!

Conversation

SFJohnson24 commented Dec 8, 2025

Uh oh!

RamilCDISC Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

SFJohnson24 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

RamilCDISC Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

SFJohnson24 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RamilCDISC Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

SFJohnson24 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

RamilCDISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SFJohnson24 Dec 11, 2025 •

edited

Loading