Fix get_dataset#1691
Conversation
| def _execute_operation(self): | ||
| # get metadata | ||
| dataframe = self.data_service.get_dataset(dataset_name=self.params.dataset_path) | ||
| dataframe = self.data_service.get_dataset(dataset_name=self.params.domain) |
There was a problem hiding this comment.
The get dataset now expects a name of the dataset which only true when domain == metadata.name but for split datasets this is not true. This will not work for split dataset then. Please let me know if I am misunderstanding something.
There was a problem hiding this comment.
Great catch. I removed use of self.params.domain where not necessary because we are already creating self.params.dataframe_metadata in the operation params
| dataset_name=dataset_name, **params | ||
| ) | ||
| metadata_to_return: dict = { | ||
| "dataset_size": [dataset_metadata.file_size], |
There was a problem hiding this comment.
I see a possible regression here. The LocalDataService.get_raw_dataset_metadata() rebuilds metadata per call and can apply size unit. In this PR that override is removed and metadata is initialized upfront.
So now the dataset_metadata.file_size comes from a stored object so calls like get_dataset_metadata(...size_unit=size_unit) in metadata builders will not have proper size unit and will return only the stored value. Please let me know if I am confusing it.
There was a problem hiding this comment.
It's a good point, except that size_unit is a dead variable that wasn't/isn't being used in get_dataset_metadata. Can you create a new issue to fix or remove the use of size_unit?
RamilCDISC
left a comment
There was a problem hiding this comment.
The PR fixes the get_dataset() function calls from different dataset builder to remove the reported bug in connected ticket. The validation was done by:
- Reviewing the PR for any unwanted code or comments.
- Reviewing the PR in accordance to AC.
- Reviewing the PR updated tests for logic and coverage.
- Reviewing the updated builders for consistency.
- Reviewing the updated builders to ensure they preserve intended functionality.
- Ensuring all relevant builders are updated.
- Ensuring all unit and integration testing pass.
- Ensuring proper execution of negative and positive dataset using dev editor.
- Ensuring execution of all dataset builders to confirm functionality.
SFJohnson24
left a comment
There was a problem hiding this comment.
this looks great. The PR fixes get_dataset() calls across dataset builders to resolve the bug in the connected ticket, including refactoring build_split_datasets to swap self.dataset_metadata via get_raw_dataset_metadata() and simplifying the get_dataset_class signature. All relevant builders were reviewed for consistency, correctness, and preserved functionality. Tests were updated as well as cert data run to confirm preservation and consistency
This pull request refactors how dataset references are handled throughout the codebase, replacing the use of file paths (e.g.,
dataset_pathand.xptfilenames) with dataset names and metadata objects. Additionally, the pull request cleans up unused parameters and updates test data to match the new conventions.Reports have also been updated to use the dataset names instead of filenames when it make sense. This triggers a required test suite update: https://github.com/cdisc-org/CORE_Test_Suite/pull/91