In 1730 date comparison by ehanson8 · Pull Request #12 · MITLibraries/marimo-cdps-dashboard

ehanson8 · 2026-05-04T14:50:08Z

Purpose and background context

@cmhosale has approved the changes after some local testing. @ghukill I would like your review to focus on whether the code is as efficient as it could be. The major changes are:

Updating variable names, comments, and labels at stakeholder's request and for consistency and clarity
Adds digitized_bag_ids list of UUIDs to replace a deprecated env var. This list will be managed by stakeholders going forward rather than DataEng
Add create_dataframe_for_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality
Add date range comparison functionality for stakeholders' reporting needs
Add functions to generate difference data tables for the date range comparison

How can a reviewer manually see the effects of these changes?

Set S3_INVENTORY_LOCATIONS in .env
Run make edit-notebook to view the updated dashboard

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1730

ghukill · 2026-05-05T19:11:47Z

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

ehanson8 · 2026-05-05T19:19:30Z

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

Ugh, my bad, fixed!

* Add list of previously digitized bag UUIDs to replace an env var containing the bagnames. This allows stakeholders to manage any changes to the list while preventing exposure of potentially sensitive data.

* Add create_dataframe_for_date function to simplify the flow of the notebook * Add get_parquet_uris_from_symlinks, create_parquet_dataframes, process_inventory_data, and transform_cdps_data submethods to be called by create_dataframe_from_date * Add status spinner during collection of S3 inventory data * Update comments for accuracy

…kets

Why these changes are being introduced: * Stakeholders requested the ability to compare the inventory data from a date range for reporting purposes How this addresses that need: * Add file_count_change_by_field and storage_change_by_field functions * Update labels for brevity and consistency * Add comparison date range selector, inventory date verification, and dataframe creation * Add stakeholder-requested statistics and tables for changes associated with the selected date range Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1730

* Update change > difference for variables, labels, and comments * Add _calculate_difference_by_field function and update file_count_difference_by_field and storage_difference_by_field * Add mo.stop if the selected start date is after the selected end date * Add start and end data to summary stats

coveralls · 2026-05-05T19:40:50Z

Coverage Report for CI Build 25398164906

Coverage decreased (-2.2%) to 14.141%

Details

Coverage decreased (-2.2%) from the base build.
Patch coverage: 173 uncovered changes across 1 file (25 of 198 lines covered, 12.63%).
6 coverage regressions across 1 file.

Uncovered Changes

File	Changed	Covered	%
notebook.py	198	25	12.63%

Coverage Regressions

6 previously-covered lines in 1 file lost coverage.

File	Lines Losing Coverage	Coverage
notebook.py	6	13.71%

Coverage Stats


Relevant Lines:	396
Covered Lines:	56
Line Coverage:	14.14%
Coverage Strength:	0.14 hits per line

💛 - Coveralls

ehanson8 · 2026-05-05T19:42:22Z

@ghukill Double ugh, I had that typo to the commit message so I fixed that as well and pushed the updated commit

ghukill · 2026-05-05T19:45:31Z

@ehanson8 - Correct to assume create_dataframe_from_date should be create_dataframe_for_date?

Add create_dataframe_from_date function and subfunctions to simplify the flow of the notebook and ease the path to the date range comparison functionality

Didn't want to edit the PR description myself, but not seeing the former while am seeing the latter.

Ugh, my bad, fixed!

Thanks! Honestly, it's pretty easy to figure out if needed... but for reviewing specifically threw me for a minute.

ghukill

Opting for a first pass commenting review.

Overall, good looking refactorings here! I like the consolidation of logic into reusable functions in some spots, which clearly benefits when you pluck other dates for comparison later.

A couple thoughts...

First, see comment about trying to establish a start_df and end_df variable to use for all the comparison logic. Seems nit picky, but in addition to being less verbose, I think it begins to set you up for reusing a dataframe for a given date if already created. This would require some shuffling of things.

It seems possible that for comparison a user might say "Date A and Date B".... do som reading... then think "Now Date A and Date C" where only one changes. Instead of crunching both, you could reuse Date A.

Now.... this might all be not worth it, and totally understand. Just observing that caching data could be done if desired.

Feel free to fully ignore the caching bit, but stand by the start_df and end_df part, as I think it simplifies the code and allows for the latitude of even thinking of those kind of optimizations.

ghukill · 2026-05-05T19:19:27Z

+    # Verify start date is before end date and that data exists for the selected dates
+    mo.stop(
+        start_date_selector.value > end_date_selector.value,
+        mo.md("Start date must be before end date, please select a valid date range"),
+    )


Similar to this block, it feels like you might want to avoid a situation where the start and end dates are the same.

It does run, but then returns 0% changes across the board. Which while not wrong... is not helpful.

ghukill · 2026-05-05T20:02:30Z

+    comparison_dfs = {}
+    for date_key, date_value in {"start": start_date, "end": end_date}.items():
+        dataframe, comparison_cache = create_dataframe_for_date(
+            s3,
+            date_value,
+            symlink_dict[date_value],
+            parquet_file_uri_cache,
+        )
+        # Update the cache in-place so it persists for future date selections
+        parquet_file_uri_cache.update(comparison_cache)
+        comparison_dfs[date_key] = dataframe


I spend some time here for a bit.

I see that the parquet_file_uri_cache is used to remember parquet file URIs for a given date, which is great. Got me wondering if you could cache the resulting dataframe for a given date. That would allow, as you click through dates, to reuse dataframes already crunched.

This would require some reworking in cells below where comparison_dfs["start"] and comparison_dfs["end"] is used quite often. But I think that could be a benefit as well. For all comparisons, you might benefit from a dedicated cell that sets start_df and end_df and use them throughout.

If do that reworking, then you could also move this into a dedicated cell before the comparison date picker to avoid clobbering it each time:

# create dict that caches calculated dataframes for a given date comparison_dfs = {}

It's tricky to explain in a single comment, might require some (minimal) shuffling of where things happen, but it feels like you get two nice things from it:

Not using the verbose comparison_dfs["start"] and comparison_dfs["end"] often, when you could be using start_df and end_df

As you reuse dates in comparison, if the data already crunched, reuse.

The plot thickens.

As-is, caching the dataframes for each date selected is not feasible, as they can be 800-900mb in memory:

logger.info(f"Memory usage for {date_key}") logger.info(dataframe.info(memory_usage="deep"))

I had assumed that for comparison the dataframe saved was crunched already, but I see it's the same dataframe used for detailed analysis above (clever unto itself!).

Updated recommendation:

Do make the change to avoid verbose comparison_dfs["start"] and comparison_dfs["end"] usage by establishing a start_df and end_df

Skip caching...unless you want to think more deeply about how to save only cached numbers for comparison only.

ehanson8 requested a review from ghukill May 4, 2026 14:50

ehanson8 requested a review from a team as a code owner May 4, 2026 14:50

ehanson8 added 5 commits May 5, 2026 15:38

Add digitized bag UUIDs and update dependencies

cd5e147

* Add list of previously digitized bag UUIDs to replace an env var containing the bagnames. This allows stakeholders to manage any changes to the list while preventing exposure of potentially sensitive data.

Update rename_bucket function to add dissemination and submission buc…

594ffc5

…kets

ehanson8 force-pushed the IN-1730-date-comparison branch from f8a935c to 312fac8 Compare May 5, 2026 19:39

ghukill reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In 1730 date comparison#12

In 1730 date comparison#12
ehanson8 wants to merge 5 commits intomainfrom
IN-1730-date-comparison

ehanson8 commented May 4, 2026 •

edited

Loading

Uh oh!

ghukill commented May 5, 2026

Uh oh!

ehanson8 commented May 5, 2026

Uh oh!

coveralls commented May 5, 2026

Uh oh!

ehanson8 commented May 5, 2026

Uh oh!

ghukill commented May 5, 2026

Uh oh!

ghukill left a comment

Uh oh!

ghukill May 5, 2026

Uh oh!

ghukill May 5, 2026

Uh oh!

ghukill May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ehanson8 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Uh oh!

ghukill commented May 5, 2026

Uh oh!

ehanson8 commented May 5, 2026

Uh oh!

coveralls commented May 5, 2026

Coverage Report for CI Build 25398164906

Coverage decreased (-2.2%) to 14.141%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

ehanson8 commented May 5, 2026

Uh oh!

ghukill commented May 5, 2026

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill May 5, 2026

Choose a reason for hiding this comment

Uh oh!

ghukill May 5, 2026

Choose a reason for hiding this comment

Uh oh!

ghukill May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ehanson8 commented May 4, 2026 •

edited

Loading

ghukill May 5, 2026 •

edited

Loading