Skip to content

1681: added @cached decorator to get_dataset#1687

Merged
SFJohnson24 merged 24 commits into
mainfrom
1681-cache-dataset-build
Apr 30, 2026
Merged

1681: added @cached decorator to get_dataset#1687
SFJohnson24 merged 24 commits into
mainfrom
1681-cache-dataset-build

Conversation

@alexfurmenkov
Copy link
Copy Markdown
Collaborator

@alexfurmenkov alexfurmenkov commented Apr 9, 2026

added caching to BaseDatasetBuilder.get_dataset function. added new fields to BaseDatasetBuilder to support caching without
added corresponging unit tests and a fixture to clean cache between tests

@alexfurmenkov alexfurmenkov changed the title added cached_dataset decorator to get_dataset added @cached decorator to get_dataset Apr 14, 2026
@alexfurmenkov alexfurmenkov marked this pull request as ready for review April 14, 2026 13:27
@alexfurmenkov alexfurmenkov linked an issue Apr 14, 2026 that may be closed by this pull request
self.dataset_path = original_path

@cached("get_dataset")
def get_dataset(self, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get dataset still accepts kwargs which are then passed to concat_split_dataset function. Either pass kwargs to cached instance or could we please have a test to confirm kwargs cannot affect returned dataset

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify - should kwargs affect the resulting dataset in this case?

Currently, kwargs are passed to concat_split_datasets and used in DatasetBuilder.build function, but they are not included in the cache key (the decorator only considers args). I've added test to check that kwargs are ignored.

I've noticed, that we can't pass any kwargs to this function -- it will cause issues later in JsonSchemaCheckDatasetBuilder -- it uses same kwargs for dataset_implementation.from_records(filtered, **kwargs)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_duplicates is an artifact and It is no longer used at all-- @alexfurmenkov can you delete drop_duplicates from the concat_split_datasets logic (as well as all the calls to get get_variables_metadata() as an aside)
we can remove kwargs from get_dataset/concat_split_dataset.

Caching does appear that is will work fine as kwargs is not used--we can eliminate the test showing kwargs does not impact get_dataset as kwargs is not used at all

Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment; otherwise looks good

@alexfurmenkov alexfurmenkov changed the title added @cached decorator to get_dataset 1681: added @cached decorator to get_dataset Apr 22, 2026
# Conflicts:
#	cdisc_rules_engine/services/data_services/base_data_service.py
Copy link
Copy Markdown
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR correctly resolves the caching issue with get_dataset()'
it also removes artifacts from the test/validate separate command which has been resolved recently through other PRs.

@SFJohnson24 SFJohnson24 merged commit 21ac8e6 into main Apr 30, 2026
12 checks passed
@SFJohnson24 SFJohnson24 deleted the 1681-cache-dataset-build branch April 30, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_dataset() is not cached.

3 participants