-
Notifications
You must be signed in to change notification settings - Fork 27
Multiidvar #1437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiidvar #1437
Conversation
| ] | ||
| current_supp = right_dataset.drop(columns=columns_to_drop) | ||
| if isinstance(left_dataset, DaskDataset): | ||
| left_pandas = PandasDataset(left_dataset.data.compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of computing whole dataset can we compute subset of dataset with only required columns? This will help to keep use of memory low for large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to port single IDVAR logic for dask to just converting the supp dataset to pandas then back to dask for the merge. The supp should be smaller than the parent and manageable. Unfortunately, the way multi-IDVAR pivot merges would need to take place (multiple sequential merges for each idvar while dropping extra columns) doesnt play well with dask lazy evaluation. I kept this scenario as pandas conversion (especially with the potential refactor on the horizon)
RamilCDISC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR updates the merge logic for SUPP with multiple IDVAR. The PR was validated by:
- Reviewing the PR for any unwanted changes, code or comments.
- Reviewing the PR logic in accordance with the AC.
- Validating the logic for both pandas and DASK use cases.
- Validating all unit tests and regression tests pass.
- Validating all related testing has been updated.
- Running manual validation using dev rule editor with positive dataset.
- Running the manual validations using the dev editor with negative datasets.
- Covering edge cases of missing, duplicate, single and multiple values.
Datasets_single.json
Rule_underscores.json
CG0019_multi.xlsx
CG0019_split_and_supp.xlsx
Datasets.json
this PR adds merge logic for Supp datasets where multiple IDVAR are present. It does so by taking the parent, iterating through the unique IDVAR, grabbing the data for each IDVAR, pivoting the QNAM/QVAL into columns, cleans up the supp metadata columns as well as aggregating multiple rows with the same idvar and dropping duplicate columns. It then uses the left merge logic from single IDVAR supps with the qnam validation loop. Unfortunately, much of the row by row logic is needed to do the merge pivot which Dask is not particularly good at so a conversion to Pandas during this step is needed.
This also removes the old logic of is_relationship that was an artifact before the SDTM Metadata container was created which has been removed. I also updated the tests--they were looking for qnam/qval which was not dropperly being dropped, and the pivot was not properly occuring with some Supp datasets. This resolves that.