Skip to content

Conversation

@SFJohnson24
Copy link
Collaborator

@SFJohnson24 SFJohnson24 commented Nov 14, 2025

Datasets_single.json
Rule_underscores.json
CG0019_multi.xlsx
CG0019_split_and_supp.xlsx
Datasets.json

this PR adds merge logic for Supp datasets where multiple IDVAR are present. It does so by taking the parent, iterating through the unique IDVAR, grabbing the data for each IDVAR, pivoting the QNAM/QVAL into columns, cleans up the supp metadata columns as well as aggregating multiple rows with the same idvar and dropping duplicate columns. It then uses the left merge logic from single IDVAR supps with the qnam validation loop. Unfortunately, much of the row by row logic is needed to do the merge pivot which Dask is not particularly good at so a conversion to Pandas during this step is needed.

This also removes the old logic of is_relationship that was an artifact before the SDTM Metadata container was created which has been removed. I also updated the tests--they were looking for qnam/qval which was not dropperly being dropped, and the pivot was not properly occuring with some Supp datasets. This resolves that.

@SFJohnson24 SFJohnson24 marked this pull request as ready for review November 17, 2025 21:20
]
current_supp = right_dataset.drop(columns=columns_to_drop)
if isinstance(left_dataset, DaskDataset):
left_pandas = PandasDataset(left_dataset.data.compute())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of computing whole dataset can we compute subset of dataset with only required columns? This will help to keep use of memory low for large datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to port single IDVAR logic for dask to just converting the supp dataset to pandas then back to dask for the merge. The supp should be smaller than the parent and manageable. Unfortunately, the way multi-IDVAR pivot merges would need to take place (multiple sequential merges for each idvar while dropping extra columns) doesnt play well with dask lazy evaluation. I kept this scenario as pandas conversion (especially with the potential refactor on the horizon)

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR updates the merge logic for SUPP with multiple IDVAR. The PR was validated by:

  1. Reviewing the PR for any unwanted changes, code or comments.
  2. Reviewing the PR logic in accordance with the AC.
  3. Validating the logic for both pandas and DASK use cases.
  4. Validating all unit tests and regression tests pass.
  5. Validating all related testing has been updated.
  6. Running manual validation using dev rule editor with positive dataset.
  7. Running the manual validations using the dev editor with negative datasets.
  8. Covering edge cases of missing, duplicate, single and multiple values.

@RamilCDISC RamilCDISC merged commit d7e80d5 into main Nov 21, 2025
13 checks passed
@RamilCDISC RamilCDISC deleted the multiidvar branch November 21, 2025 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants