Skip to content

In 1730 date comparison#12

Open
ehanson8 wants to merge 5 commits intomainfrom
IN-1730-date-comparison
Open

In 1730 date comparison#12
ehanson8 wants to merge 5 commits intomainfrom
IN-1730-date-comparison

Conversation

@ehanson8
Copy link
Copy Markdown
Contributor

@ehanson8 ehanson8 commented May 4, 2026

Purpose and background context

@cmhosale has approved the changes after some local testing. @ghukill I would like your review to focus on whether the code is as efficient as it could be. The major changes are:

  • Updating variable names, comments, and labels at stakeholder's request and for consistency and clarity
  • Adds digitized_bag_ids list of UUIDs to replace a deprecated env var. This list will be managed by stakeholders going forward rather than DataEng
  • Add create_dataframe_for_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality
  • Add date range comparison functionality for stakeholders' reporting needs
  • Add functions to generate difference data tables for the date range comparison

How can a reviewer manually see the effects of these changes?

  • Set S3_INVENTORY_LOCATIONS in .env
  • Run make edit-notebook to view the updated dashboard

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

@ehanson8 ehanson8 requested a review from ghukill May 4, 2026 14:50
@ehanson8 ehanson8 requested a review from a team as a code owner May 4, 2026 14:50
@ghukill
Copy link
Copy Markdown

ghukill commented May 5, 2026

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

@ehanson8
Copy link
Copy Markdown
Contributor Author

ehanson8 commented May 5, 2026

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

Ugh, my bad, fixed!

ehanson8 added 5 commits May 5, 2026 15:38
* Add list of previously digitized bag UUIDs to replace an env var containing the bagnames. This allows stakeholders to manage any changes to the list while preventing exposure of potentially sensitive data.
* Add create_dataframe_for_date function to simplify the flow of the notebook
* Add get_parquet_uris_from_symlinks, create_parquet_dataframes, process_inventory_data, and transform_cdps_data submethods to be called by create_dataframe_from_date
* Add status spinner during collection of S3 inventory data
* Update comments for accuracy
Why these changes are being introduced:
*  Stakeholders requested the ability to compare the inventory data from a date range for reporting purposes

How this addresses that need:
* Add file_count_change_by_field and storage_change_by_field functions
* Update labels for brevity and consistency
* Add comparison date range selector, inventory date verification, and dataframe creation
* Add stakeholder-requested statistics and tables for changes associated with the selected date range

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1730
* Update change > difference for variables, labels, and comments
* Add _calculate_difference_by_field function and update file_count_difference_by_field and storage_difference_by_field
* Add mo.stop if the selected start date is after the selected end date
* Add start and end data to summary stats
@ehanson8 ehanson8 force-pushed the IN-1730-date-comparison branch from f8a935c to 312fac8 Compare May 5, 2026 19:39
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 25398164906

Coverage decreased (-2.2%) to 14.141%

Details

  • Coverage decreased (-2.2%) from the base build.
  • Patch coverage: 173 uncovered changes across 1 file (25 of 198 lines covered, 12.63%).
  • 6 coverage regressions across 1 file.

Uncovered Changes

File Changed Covered %
notebook.py 198 25 12.63%

Coverage Regressions

6 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
notebook.py 6 13.71%

Coverage Stats

Coverage Status
Relevant Lines: 396
Covered Lines: 56
Line Coverage: 14.14%
Coverage Strength: 0.14 hits per line

💛 - Coveralls

@ehanson8
Copy link
Copy Markdown
Contributor Author

ehanson8 commented May 5, 2026

@ghukill Double ugh, I had that typo to the commit message so I fixed that as well and pushed the updated commit

@ghukill
Copy link
Copy Markdown

ghukill commented May 5, 2026

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

Ugh, my bad, fixed!

Thanks! Honestly, it's pretty easy to figure out if needed... but for reviewing specifically threw me for a minute.

Copy link
Copy Markdown

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting for a first pass commenting review.

Overall, good looking refactorings here! I like the consolidation of logic into reusable functions in some spots, which clearly benefits when you pluck other dates for comparison later.

A couple thoughts...

First, see comment about trying to establish a start_df and end_df variable to use for all the comparison logic. Seems nit picky, but in addition to being less verbose, I think it begins to set you up for reusing a dataframe for a given date if already created. This would require some shuffling of things.

It seems possible that for comparison a user might say "Date A and Date B".... do som reading... then think "Now Date A and Date C" where only one changes. Instead of crunching both, you could reuse Date A.

Now.... this might all be not worth it, and totally understand. Just observing that caching data could be done if desired.

Feel free to fully ignore the caching bit, but stand by the start_df and end_df part, as I think it simplifies the code and allows for the latitude of even thinking of those kind of optimizations.

Comment thread notebook.py
Comment on lines +1164 to +1168
# Verify start date is before end date and that data exists for the selected dates
mo.stop(
start_date_selector.value > end_date_selector.value,
mo.md("Start date must be before end date, please select a valid date range"),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to this block, it feels like you might want to avoid a situation where the start and end dates are the same.

It does run, but then returns 0% changes across the board. Which while not wrong... is not helpful.

Comment thread notebook.py
Comment on lines +1192 to +1202
comparison_dfs = {}
for date_key, date_value in {"start": start_date, "end": end_date}.items():
dataframe, comparison_cache = create_dataframe_for_date(
s3,
date_value,
symlink_dict[date_value],
parquet_file_uri_cache,
)
# Update the cache in-place so it persists for future date selections
parquet_file_uri_cache.update(comparison_cache)
comparison_dfs[date_key] = dataframe
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spend some time here for a bit.

I see that the parquet_file_uri_cache is used to remember parquet file URIs for a given date, which is great. Got me wondering if you could cache the resulting dataframe for a given date. That would allow, as you click through dates, to reuse dataframes already crunched.

This would require some reworking in cells below where comparison_dfs["start"] and comparison_dfs["end"] is used quite often. But I think that could be a benefit as well. For all comparisons, you might benefit from a dedicated cell that sets start_df and end_df and use them throughout.

If do that reworking, then you could also move this into a dedicated cell before the comparison date picker to avoid clobbering it each time:

# create dict that caches calculated dataframes for a given date
comparison_dfs = {}

It's tricky to explain in a single comment, might require some (minimal) shuffling of where things happen, but it feels like you get two nice things from it:

  1. Not using the verbose comparison_dfs["start"] and comparison_dfs["end"] often, when you could be using start_df and end_df
  2. As you reuse dates in comparison, if the data already crunched, reuse.

Copy link
Copy Markdown

@ghukill ghukill May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plot thickens.

As-is, caching the dataframes for each date selected is not feasible, as they can be 800-900mb in memory:

logger.info(f"Memory usage for {date_key}")
logger.info(dataframe.info(memory_usage="deep"))

I had assumed that for comparison the dataframe saved was crunched already, but I see it's the same dataframe used for detailed analysis above (clever unto itself!).

Updated recommendation:

  1. Do make the change to avoid verbose comparison_dfs["start"] and comparison_dfs["end"] usage by establishing a start_df and end_df
  2. Skip caching...unless you want to think more deeply about how to save only cached numbers for comparison only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants