Conversation
|
@ehanson8 - Correct to assume
Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter. |
Ugh, my bad, fixed! |
* Add list of previously digitized bag UUIDs to replace an env var containing the bagnames. This allows stakeholders to manage any changes to the list while preventing exposure of potentially sensitive data.
* Add create_dataframe_for_date function to simplify the flow of the notebook * Add get_parquet_uris_from_symlinks, create_parquet_dataframes, process_inventory_data, and transform_cdps_data submethods to be called by create_dataframe_from_date * Add status spinner during collection of S3 inventory data * Update comments for accuracy
Why these changes are being introduced: * Stakeholders requested the ability to compare the inventory data from a date range for reporting purposes How this addresses that need: * Add file_count_change_by_field and storage_change_by_field functions * Update labels for brevity and consistency * Add comparison date range selector, inventory date verification, and dataframe creation * Add stakeholder-requested statistics and tables for changes associated with the selected date range Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1730
* Update change > difference for variables, labels, and comments * Add _calculate_difference_by_field function and update file_count_difference_by_field and storage_difference_by_field * Add mo.stop if the selected start date is after the selected end date * Add start and end data to summary stats
f8a935c to
312fac8
Compare
Coverage Report for CI Build 25398164906Coverage decreased (-2.2%) to 14.141%Details
Uncovered Changes
Coverage Regressions6 previously-covered lines in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
|
@ghukill Double ugh, I had that typo to the commit message so I fixed that as well and pushed the updated commit |
Thanks! Honestly, it's pretty easy to figure out if needed... but for reviewing specifically threw me for a minute. |
ghukill
left a comment
There was a problem hiding this comment.
Opting for a first pass commenting review.
Overall, good looking refactorings here! I like the consolidation of logic into reusable functions in some spots, which clearly benefits when you pluck other dates for comparison later.
A couple thoughts...
First, see comment about trying to establish a start_df and end_df variable to use for all the comparison logic. Seems nit picky, but in addition to being less verbose, I think it begins to set you up for reusing a dataframe for a given date if already created. This would require some shuffling of things.
It seems possible that for comparison a user might say "Date A and Date B".... do som reading... then think "Now Date A and Date C" where only one changes. Instead of crunching both, you could reuse Date A.
Now.... this might all be not worth it, and totally understand. Just observing that caching data could be done if desired.
Feel free to fully ignore the caching bit, but stand by the start_df and end_df part, as I think it simplifies the code and allows for the latitude of even thinking of those kind of optimizations.
| # Verify start date is before end date and that data exists for the selected dates | ||
| mo.stop( | ||
| start_date_selector.value > end_date_selector.value, | ||
| mo.md("Start date must be before end date, please select a valid date range"), | ||
| ) |
There was a problem hiding this comment.
Similar to this block, it feels like you might want to avoid a situation where the start and end dates are the same.
It does run, but then returns 0% changes across the board. Which while not wrong... is not helpful.
| comparison_dfs = {} | ||
| for date_key, date_value in {"start": start_date, "end": end_date}.items(): | ||
| dataframe, comparison_cache = create_dataframe_for_date( | ||
| s3, | ||
| date_value, | ||
| symlink_dict[date_value], | ||
| parquet_file_uri_cache, | ||
| ) | ||
| # Update the cache in-place so it persists for future date selections | ||
| parquet_file_uri_cache.update(comparison_cache) | ||
| comparison_dfs[date_key] = dataframe |
There was a problem hiding this comment.
I spend some time here for a bit.
I see that the parquet_file_uri_cache is used to remember parquet file URIs for a given date, which is great. Got me wondering if you could cache the resulting dataframe for a given date. That would allow, as you click through dates, to reuse dataframes already crunched.
This would require some reworking in cells below where comparison_dfs["start"] and comparison_dfs["end"] is used quite often. But I think that could be a benefit as well. For all comparisons, you might benefit from a dedicated cell that sets start_df and end_df and use them throughout.
If do that reworking, then you could also move this into a dedicated cell before the comparison date picker to avoid clobbering it each time:
# create dict that caches calculated dataframes for a given date
comparison_dfs = {}It's tricky to explain in a single comment, might require some (minimal) shuffling of where things happen, but it feels like you get two nice things from it:
- Not using the verbose
comparison_dfs["start"]andcomparison_dfs["end"]often, when you could be usingstart_dfandend_df - As you reuse dates in comparison, if the data already crunched, reuse.
There was a problem hiding this comment.
The plot thickens.
As-is, caching the dataframes for each date selected is not feasible, as they can be 800-900mb in memory:
logger.info(f"Memory usage for {date_key}")
logger.info(dataframe.info(memory_usage="deep"))I had assumed that for comparison the dataframe saved was crunched already, but I see it's the same dataframe used for detailed analysis above (clever unto itself!).
Updated recommendation:
- Do make the change to avoid verbose
comparison_dfs["start"]andcomparison_dfs["end"]usage by establishing astart_dfandend_df - Skip caching...unless you want to think more deeply about how to save only cached numbers for comparison only.
Purpose and background context
@cmhosale has approved the changes after some local testing. @ghukill I would like your review to focus on whether the code is as efficient as it could be. The major changes are:
digitized_bag_idslist of UUIDs to replace a deprecated env var. This list will be managed by stakeholders going forward rather than DataEngcreate_dataframe_for_datefunction and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionalityHow can a reviewer manually see the effects of these changes?
S3_INVENTORY_LOCATIONSin.envmake edit-notebookto view the updated dashboardIncludes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?