Skip to content

Conversation

@SFJohnson24
Copy link
Collaborator

in reviewing #1454 changes for sprint review--i noticed an optimization. Instead of copying the frame and adding index to merge back to the original from the grouped, we can rely on the indexes in values(). This adds a memory optimization while maintaining the functionality.

This pull request refines the logic for handling regex-based grouping and counting in the record_count.py operation. The main improvements are focused on ensuring that regex grouping is only applied when necessary and simplifying the process of merging grouped counts back to the original dataframe.

Logic improvements for regex grouping:

  • Updated the condition in _execute_operation to only perform regex grouping when a regex is provided and no pre-filtering has occurred, preventing unnecessary operations.

Simplification and efficiency in merging grouped counts:

  • Refactored _get_regex_grouped_counts to streamline the process of merging grouped counts back to the original dataframe, removing the use of auxiliary _idx columns and simplifying the assignment of the "size" column.

)
result = dataframe[grouping_columns].copy()
result["size"] = transformed_with_counts["size"].values
result = result.groupby(grouping_columns, as_index=False, dropna=False).first()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic breaks row alignments and i snot equivalent to the previous logic. This can cause silent bugs. Could you please mention if you think this is correct and that regex transformations will not reorder rows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamilCDISC why do you believe it will break row alignment? df_for_grouping is transformed by the apply_regex_to_grouping_columns but this performs only element-wise operations on selected columns. No rows are removed/added/reordered by this. The left merge keep all rows in the original order. I dont know where the row alignment not changing would be introduced

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I actually wanted to tag line 105. The merge operation can change the ordering. The previous code had 'idx' which could be used for reorder to original. This may not be a big issue. Please let me know your thoughts.

Copy link
Collaborator Author

@SFJohnson24 SFJohnson24 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RamilCDISC
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
"left: use only keys from left frame, similar to a SQL left outer join; preserve key order "

It is a left merge so the key order is preserved.

else effective_grouping
)
if self.params.regex:
if self.params.regex and not filtered:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filtered here would be either none or a dataframe. Using not on a dataframe raises ```
ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This would need a different logic for handling here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected this--changed it to an explicit None check as I want to avoid doing regex on the original frame if filtering has taken place and the regex will be applied to it lower down

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR optimizes the record count operation. The validation was done by

  1. Reviewing the PR for any unwanted code or comments.
  2. Reviewing the PR code in reference to the AC.
  3. Reviewing the PR new logic in comparison with old logic.
  4. Ensuring the updated code is bug and error free by looking for potential edge cases.
  5. Ensuring all unit and regression testing pass.
  6. Running manual testing using positive and negative datasets.

@RamilCDISC RamilCDISC merged commit f4693ec into main Dec 12, 2025
11 checks passed
@RamilCDISC RamilCDISC deleted the record_opt branch December 12, 2025 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants