Skip to content

Fix get_dataset#1691

Merged
SFJohnson24 merged 25 commits into
mainfrom
1672-get_dataset-incorrect-calls
Apr 28, 2026
Merged

Fix get_dataset#1691
SFJohnson24 merged 25 commits into
mainfrom
1672-get_dataset-incorrect-calls

Conversation

@gerrycampion
Copy link
Copy Markdown
Collaborator

@gerrycampion gerrycampion commented Apr 10, 2026

This pull request refactors how dataset references are handled throughout the codebase, replacing the use of file paths (e.g., dataset_path and .xpt filenames) with dataset names and metadata objects. Additionally, the pull request cleans up unused parameters and updates test data to match the new conventions.

Reports have also been updated to use the dataset names instead of filenames when it make sense. This triggers a required test suite update: https://github.com/cdisc-org/CORE_Test_Suite/pull/91

@gerrycampion gerrycampion linked an issue Apr 10, 2026 that may be closed by this pull request
@gerrycampion gerrycampion marked this pull request as ready for review April 17, 2026 02:55
@gerrycampion gerrycampion marked this pull request as draft April 17, 2026 04:36
@gerrycampion gerrycampion marked this pull request as ready for review April 17, 2026 17:48
def _execute_operation(self):
# get metadata
dataframe = self.data_service.get_dataset(dataset_name=self.params.dataset_path)
dataframe = self.data_service.get_dataset(dataset_name=self.params.domain)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get dataset now expects a name of the dataset which only true when domain == metadata.name but for split datasets this is not true. This will not work for split dataset then. Please let me know if I am misunderstanding something.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I removed use of self.params.domain where not necessary because we are already creating self.params.dataframe_metadata in the operation params

dataset_name=dataset_name, **params
)
metadata_to_return: dict = {
"dataset_size": [dataset_metadata.file_size],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a possible regression here. The LocalDataService.get_raw_dataset_metadata() rebuilds metadata per call and can apply size unit. In this PR that override is removed and metadata is initialized upfront.

So now the dataset_metadata.file_size comes from a stored object so calls like get_dataset_metadata(...size_unit=size_unit) in metadata builders will not have proper size unit and will return only the stored value. Please let me know if I am confusing it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, except that size_unit is a dead variable that wasn't/isn't being used in get_dataset_metadata. Can you create a new issue to fix or remove the use of size_unit?

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR fixes the get_dataset() function calls from different dataset builder to remove the reported bug in connected ticket. The validation was done by:

  1. Reviewing the PR for any unwanted code or comments.
  2. Reviewing the PR in accordance to AC.
  3. Reviewing the PR updated tests for logic and coverage.
  4. Reviewing the updated builders for consistency.
  5. Reviewing the updated builders to ensure they preserve intended functionality.
  6. Ensuring all relevant builders are updated.
  7. Ensuring all unit and integration testing pass.
  8. Ensuring proper execution of negative and positive dataset using dev editor.
  9. Ensuring execution of all dataset builders to confirm functionality.

Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great. The PR fixes get_dataset() calls across dataset builders to resolve the bug in the connected ticket, including refactoring build_split_datasets to swap self.dataset_metadata via get_raw_dataset_metadata() and simplifying the get_dataset_class signature. All relevant builders were reviewed for consistency, correctness, and preserved functionality. Tests were updated as well as cert data run to confirm preservation and consistency

@SFJohnson24 SFJohnson24 merged commit 2159870 into main Apr 28, 2026
5 of 8 checks passed
@SFJohnson24 SFJohnson24 deleted the 1672-get_dataset-incorrect-calls branch April 28, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_dataset() incorrect calls

3 participants