-
Notifications
You must be signed in to change notification settings - Fork 27
record count optimize #1472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
record count optimize #1472
Conversation
| ) | ||
| result = dataframe[grouping_columns].copy() | ||
| result["size"] = transformed_with_counts["size"].values | ||
| result = result.groupby(grouping_columns, as_index=False, dropna=False).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new logic breaks row alignments and i snot equivalent to the previous logic. This can cause silent bugs. Could you please mention if you think this is correct and that regex transformations will not reorder rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RamilCDISC why do you believe it will break row alignment? df_for_grouping is transformed by the apply_regex_to_grouping_columns but this performs only element-wise operations on selected columns. No rows are removed/added/reordered by this. The left merge keep all rows in the original order. I dont know where the row alignment not changing would be introduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I actually wanted to tag line 105. The merge operation can change the ordering. The previous code had 'idx' which could be used for reorder to original. This may not be a big issue. Please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RamilCDISC
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
"left: use only keys from left frame, similar to a SQL left outer join; preserve key order "
It is a left merge so the key order is preserved.
| else effective_grouping | ||
| ) | ||
| if self.params.regex: | ||
| if self.params.regex and not filtered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtered here would be either none or a dataframe. Using not on a dataframe raises ```
ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This would need a different logic for handling here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I corrected this--changed it to an explicit None check as I want to avoid doing regex on the original frame if filtering has taken place and the regex will be applied to it lower down
RamilCDISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR optimizes the record count operation. The validation was done by
- Reviewing the PR for any unwanted code or comments.
- Reviewing the PR code in reference to the AC.
- Reviewing the PR new logic in comparison with old logic.
- Ensuring the updated code is bug and error free by looking for potential edge cases.
- Ensuring all unit and regression testing pass.
- Running manual testing using positive and negative datasets.
in reviewing #1454 changes for sprint review--i noticed an optimization. Instead of copying the frame and adding index to merge back to the original from the grouped, we can rely on the indexes in values(). This adds a memory optimization while maintaining the functionality.
This pull request refines the logic for handling regex-based grouping and counting in the
record_count.pyoperation. The main improvements are focused on ensuring that regex grouping is only applied when necessary and simplifying the process of merging grouped counts back to the original dataframe.Logic improvements for regex grouping:
_execute_operationto only perform regex grouping when a regex is provided and no pre-filtering has occurred, preventing unnecessary operations.Simplification and efficiency in merging grouped counts:
_get_regex_grouped_countsto streamline the process of merging grouped counts back to the original dataframe, removing the use of auxiliary_idxcolumns and simplifying the assignment of the"size"column.