691: Tests for CoW behavior in pandas#1690
Conversation
| # Adding DaskDataset will cause downstream issues since Dask does not support copy-on-write. | ||
| if type(cached) is PandasDataset: | ||
| cached.data = cached.data.copy(deep=False) | ||
| return cached |
There was a problem hiding this comment.
Here the same cached wrapper is returned. Pandas CoW protects separate pandas objects sharing same underlying data. Here the change to cached.data mutates the wrapper but in end returns the same object.
There was a problem hiding this comment.
This test only checks the local CacheService created in this test not the production InMemoryCacheService or PandasDataset. The changes will not be tested properly.
| return PandasDataset(cached.data.copy(deep=False)) | ||
| return cached | ||
|
|
||
| def get_all(self, cache_keys: List[str]): |
There was a problem hiding this comment.
The other get functions like get-all and get_all_by_prefix etc still return directly which can mutate the cached dataset. We need to cover these too as for get() function.
| from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata | ||
| from cdisc_rules_engine.enums.sensitivity import Sensitivity | ||
|
|
||
| pd.options.mode.copy_on_write = True |
There was a problem hiding this comment.
This will turn pandas CoW process wide so any one integrating rules engine in their workflow will have this enabled. We should document it by adding to the repository readme.
There was a problem hiding this comment.
Could you please add a little more testing that includes operations which will affect the length of the dataset for example add drop rows. The engine will perform these operations on the dataset. Confirming proper caching will be helpful.
|
|
||
| def filter_cache(self, prefix: str) -> dict: | ||
| return {k: self.cache[k] for k in self.cache.keys() if k.startswith(prefix)} | ||
| return {k: self.cache.get(k) for k in self.cache.keys() if k.startswith(prefix)} |
There was a problem hiding this comment.
This will still return raw cached object for PandasDataset. Please update this too and I think we will be good to merge.
RamilCDISC
left a comment
There was a problem hiding this comment.
The PR removes the expensive deepcopy() requirements on PandasDataset by utilizing pandas CoW. The validation was done by:
- Reviewing the PR for any unwanted code or comments.
- Reviewing the PR in accordance with AC.
- Reviewing the updated test to confirm coverage and cases.
- Ensuring all unit, regression and integration testing pass.
- Ensured the pandas CoW is turned on before cache service use in normal Validation path.
- Validated that all access to cached dataset will prevent mutation.
- Reviewed the added test and requested changes so they cover the operation on dataset that may change the size length of the dataset.
- Ensured the CoW behavior is documented because it turns it on process wide.
- Validated the Dask path is unaffected.
- Validated against any chance of regression.
No description provided.