Skip to content

Feat/skip archive dive#41

Merged
thompsonmj merged 7 commits into
mainfrom
feat/skip-archive-dive
May 29, 2026
Merged

Feat/skip archive dive#41
thompsonmj merged 7 commits into
mainfrom
feat/skip-archive-dive

Conversation

@thompsonmj
Copy link
Copy Markdown
Collaborator

@thompsonmj thompsonmj commented May 18, 2026

New CLI option to skip descent into archive contents and instead only checksum the archive file itself.

Addresses #36

Specifically:

  • --archive-dive (default, does descend—matches previous behavior before flag addition) and --no-archive-dive (does not descend)
  • implemented as a boolean passed through get_checksums() and mapper.gather_file_paths(). When the dive flag is false, archive files are returned as regular files by the mapper for simple opaque hashing. When true, previous behavior is preserved. (Note that no accounting for nested archive files is done here—still a concern for Add TAR support and decide how nested archives are dealt with. #34)
  • Tests for:
    • Mapper when flag is false
    • main() behavior when flag is false
    • integration output when flag is false
    • Default behavior for main()
  • Update relevant existing tests for new archive_dive addition

gpt-5.4 was used to implement tests based on my specifications and iterative feedback as well as provide review of implementation.

@thompsonmj thompsonmj requested a review from egrace479 May 18, 2026 21:19
@thompsonmj
Copy link
Copy Markdown
Collaborator Author

It also occurs to me that we are starting to accumulate default literals for policy configurations (e.g. algorithm='md5', include_hidden=False, archive_dive=True).

I think I will open a follow-up Issue to centralize these policies into a single config.py from which these and any new default policies can be imported.

(These are distinct though from e.g. ignore_file=None sentinel values.)

@egrace479
Copy link
Copy Markdown
Member

Testing locally, I had a zip file ignored in the root level to which sum-buddy was pointed, but the zip inside a folder was hashed as a file (with the --no-archive-dive flag).

With archive dive, I did also get these Mac files: Documents/BioCLIP-project/bioclip-ecosystem/pages/Archive.zip/__MACOSX/._community.html,._community.html,8aef6d7c5513c74965a82652157b66ab.

@thompsonmj thompsonmj force-pushed the feat/skip-archive-dive branch from 7b99dea to 9bc6ec5 Compare May 29, 2026 19:15
@egrace479
Copy link
Copy Markdown
Member

egrace479 commented May 29, 2026

Testing locally, I had a zip file ignored in the root level to which sum-buddy was pointed, but the zip inside a folder was hashed as a file (with the --no-archive-dive flag).

With archive dive, I did also get these Mac files: Documents/BioCLIP-project/bioclip-ecosystem/pages/Archive.zip/__MACOSX/._community.html,._community.html,8aef6d7c5513c74965a82652157b66ab.

It did not miss the file, but the order was different (it moved the pages.zip hash away from the other pages/ file hashes). I think we might want to consider a means of keeping the sorting consistent with each approach; the current setting reads the files first, then folders and isn't alphabetical at the same directory level. It'd be nice if it could be alphabetical (it's currently not).

@thompsonmj
Copy link
Copy Markdown
Collaborator Author

Great, I think that makes sense. We can follow up on that using the mapper's response merged by path components to keep containers (directories and/or archives) and contents together.

@thompsonmj
Copy link
Copy Markdown
Collaborator Author

On this:

With archive dive, I did also get these Mac files: Documents/BioCLIP-project/bioclip-ecosystem/pages/Archive.zip/__MACOSX/._community.html,._community.html,8aef6d7c5513c74965a82652157b66ab.

That behavior is intended, for a couple reasons:

  1. It's simpler 😊
  2. It's more honest about what the contents of an archive are (which I think is most important) because it is more consistent with what I think someone should most likely be using an archive file for or would want to see of an archive file. If a zip has a .git/ or a .DS_Store or __MACOSX/ 'junk', I would want to know that so I could exclude it from the zip before putting it somewhere or sharing it. If someone gave me a zip with those, I would also want to know that there's extra stuff that probably wasn't meant to be there.

I think fundamentally, archives should be handled differently than a working directory because they more represent fixed artifacts whose contents have already been vetted in some way. So I believe it would not be wise to observe .sbignore rules when diving into archive files or to ignore contents by default in the same way as the typical defaults do.

@egrace479
Copy link
Copy Markdown
Member

On this:

With archive dive, I did also get these Mac files: Documents/BioCLIP-project/bioclip-ecosystem/pages/Archive.zip/__MACOSX/._community.html,._community.html,8aef6d7c5513c74965a82652157b66ab.

That behavior is intended, for a couple reasons:

  1. It's simpler 😊
  2. It's more honest about what the contents of an archive are (which I think is most important) because it is more consistent with what I think someone should most likely be using an archive file for or would want to see of an archive file. If a zip has a .git/ or a .DS_Store or __MACOSX/ 'junk', I would want to know that so I could exclude it from the zip before putting it somewhere or sharing it. If someone gave me a zip with those, I would also want to know that there's extra stuff that probably wasn't meant to be there.

I think fundamentally, archives should be handled differently than a working directory because they more represent fixed artifacts whose contents have already been vetted in some way. So I believe it would not be wise to observe .sbignore rules when diving into archive files or to ignore contents by default in the same way as the typical defaults do.

That makes perfect sense. It does feel like a design decision worth documenting :)

@thompsonmj
Copy link
Copy Markdown
Collaborator Author

@egrace479 added rationale to README on hashing all archive contents.

Look good to merge?

Copy link
Copy Markdown
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🚀

@thompsonmj thompsonmj merged commit 15671df into main May 29, 2026
9 checks passed
@thompsonmj thompsonmj deleted the feat/skip-archive-dive branch May 29, 2026 20:39
@thompsonmj thompsonmj mentioned this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants