Skip to content

691: Tests for CoW behavior in pandas#1690

Open
alexfurmenkov wants to merge 12 commits intomainfrom
691-remove-deepcopy
Open

691: Tests for CoW behavior in pandas#1690
alexfurmenkov wants to merge 12 commits intomainfrom
691-remove-deepcopy

Conversation

@alexfurmenkov
Copy link
Copy Markdown
Collaborator

No description provided.

@alexfurmenkov alexfurmenkov marked this pull request as ready for review April 15, 2026 14:44
@alexfurmenkov alexfurmenkov linked an issue Apr 22, 2026 that may be closed by this pull request
@alexfurmenkov alexfurmenkov changed the title tests for CoW behavior in pandas 691: Tests for CoW behavior in pandas Apr 22, 2026
# Adding DaskDataset will cause downstream issues since Dask does not support copy-on-write.
if type(cached) is PandasDataset:
cached.data = cached.data.copy(deep=False)
return cached
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the same cached wrapper is returned. Pandas CoW protects separate pandas objects sharing same underlying data. Here the change to cached.data mutates the wrapper but in end returns the same object.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only checks the local CacheService created in this test not the production InMemoryCacheService or PandasDataset. The changes will not be tested properly.

return PandasDataset(cached.data.copy(deep=False))
return cached

def get_all(self, cache_keys: List[str]):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other get functions like get-all and get_all_by_prefix etc still return directly which can mutate the cached dataset. We need to cover these too as for get() function.

from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata
from cdisc_rules_engine.enums.sensitivity import Sensitivity

pd.options.mode.copy_on_write = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will turn pandas CoW process wide so any one integrating rules engine in their workflow will have this enabled. We should document it by adding to the repository readme.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a little more testing that includes operations which will affect the length of the dataset for example add drop rows. The engine will perform these operations on the dataset. Confirming proper caching will be helpful.


def filter_cache(self, prefix: str) -> dict:
return {k: self.cache[k] for k in self.cache.keys() if k.startswith(prefix)}
return {k: self.cache.get(k) for k in self.cache.keys() if k.startswith(prefix)}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still return raw cached object for PandasDataset. Please update this too and I think we will be good to merge.

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR removes the expensive deepcopy() requirements on PandasDataset by utilizing pandas CoW. The validation was done by:

  1. Reviewing the PR for any unwanted code or comments.
  2. Reviewing the PR in accordance with AC.
  3. Reviewing the updated test to confirm coverage and cases.
  4. Ensuring all unit, regression and integration testing pass.
  5. Ensured the pandas CoW is turned on before cache service use in normal Validation path.
  6. Validated that all access to cached dataset will prevent mutation.
  7. Reviewed the added test and requested changes so they cover the operation on dataset that may change the size length of the dataset.
  8. Ensured the CoW behavior is documented because it turns it on process wide.
  9. Validated the Dask path is unaffected.
  10. Validated against any chance of regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rules_engine.execute_rules does a deepcopy of the dataset

2 participants