Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Test fixtures: output of sum-buddy's CSV writer, which emits CRLF line endings.
# Treat as binary so git never normalizes line endings across platforms.
examples/expected_outputs/*.csv -text
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ pip install sum-buddy
### Command Line Usage

```
usage: sum-buddy [-h] [-V] [-o OUTPUT_FILE] [-i IGNORE_FILE | -H] [-a ALGORITHM] input_path
usage: sum-buddy [-h] [-V] [-o OUTPUT_FILE] [-i IGNORE_FILE | -H] [-a ALGORITHM] [-l LENGTH] [--archive-dive | --no-archive-dive] input_path

Generate CSV with filepath, filename, and checksums for all files in a given directory (or a single file)

Expand All @@ -36,6 +36,8 @@ options:
Hash algorithm to use (default: md5; available: ripemd160, sha3_224, sha512_224, blake2b, sha384, sha256, sm3, sha3_256, shake_256, sha512, sha1, sha224, md5, md5-sha1, sha3_384, sha3_512, sha512_256, shake_128, blake2s)
-l LENGTH, --length LENGTH
Length of the digest for SHAKE (required) or BLAKE (optional) algorithms in bytes
--archive-dive, --no-archive-dive
Descend into archive files and hash their members (default). Use --no-archive-dive to hash archives as opaque files.
```

> Note: The available algorithms are determined by those available to `hashlib` and may vary depending on your system and OpenSSL version, so the set shown on your system with `sum-buddy -h` may be different from above. At a minimum, it should include: `{blake2s, blake2b, md5, sha1, sha224, sha256, sha384, sha512, sha3_224, sha3_256, sha3_384, sha3_512, shake_128, shake_256}`, which is given by `hashlib.algorithms_guaranteed`.
Expand Down Expand Up @@ -128,17 +130,31 @@ cat examples/checksums.csv
>```

- **ZIP Support:**
sum-buddy treats ZIP files as both a hashed artifact and a container. For each ZIP encountered during a walk, it:
- emits a row for the ZIP file itself, and
- emits a row for each non-directory member, with `filepath` of the form `path/to/archive.zip/inner/path`, computed via in-memory streaming (no extraction to disk).
By default, sum-buddy treats ZIP files as both a hashed artifact and a container. For each ZIP encountered during a walk, it emits a row for the ZIP itself and a row for each non-directory member, with `filepath` of the form `path/to/archive.zip/inner/path`, computed via in-memory streaming (no extraction to disk). Pass `--no-archive-dive` to hash each archive as a single file instead.

The basic-usage and include-hidden examples above include `examples/example_content/testzip.zip` to demonstrate this. Member ordering follows the archive's central directory.
Ignore patterns decide whether an archive file is included in the walk, but they do not apply *inside* an included archive: once an archive is expanded, all of its file members are hashed, including hidden and platform "junk" files such as `__MACOSX/`, `.DS_Store`, and `.git/`. This is deliberate. An archive is a fixed artifact rather than a live working directory, so the manifest reports exactly what it contains. If a an archive carries files that probably were not meant to be there, that is precisely what you would want surfaced. To filter such contents, extract the archive and run sum-buddy on the resulting directory (where the hidden-file defaults and `.sbignore` rules apply), or pass `--no-archive-dive` to hash the archive as a single opaque file.

The basic-usage and include-hidden examples above include `examples/example_content/testzip.zip` to demonstrate the default behavior. Member ordering follows the archive's central directory.

Example with `--no-archive-dive`:
```bash
sum-buddy --no-archive-dive examples/example_content/
```
> Output
> ```console
> filepath,filename,md5
> examples/example_content/file.txt,file.txt,7d52c7437e9af58dac029dd11b1024df
> examples/example_content/dir/file.txt,file.txt,7d52c7437e9af58dac029dd11b1024df
> examples/example_content/testzip.zip,testzip.zip,504185ad294a15ca2f9aab27a3ac34d8
> ```

The flag is a no-op when `input_path` is a single file; only directory inputs descend by default.

If only a target directory is passed, the default settings are to ignore hidden files and directories (those that begin with a `.`), use the `md5` algorithm, and print output to `stdout`, which can be piped (`|`).

To include all files and directories, including hidden ones, use the `--include-hidden` (or `-H`) option.

To ignore files based on patterns, use the `--ignore-file` (or `-i`) option with the path to a file containing patterns to ignore. The `--ignore-file` works identically to how `git` handles a `.gitignore` file using the implementation from [pathspec](https://github.com/cpburnz/python-pathspec).
To ignore files based on patterns, use the `--ignore-file` (or `-i`) option with the path to a file containing patterns to ignore. The `--ignore-file` works identically to how `git` handles a `.gitignore` file using the implementation from [pathspec](https://github.com/cpburnz/python-pathspec). Patterns apply to files in the directory tree, including archive files themselves, but not to members inside an included archive; use `--no-archive-dive` to skip archive members entirely.

You may explore the filtering capabilities of the `--ignore-file` option by using the provided example files under `examples/` and pointing at `examples/example_content`. The expected CSV output files are provided in `examples/expected_outputs/`.

Expand Down
4 changes: 4 additions & 0 deletions examples/expected_outputs/no_archive_dive.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
filepath,filename,md5
example_content/testzip.zip,testzip.zip,504185ad294a15ca2f9aab27a3ac34d8
example_content/file.txt,file.txt,7d52c7437e9af58dac029dd11b1024df
example_content/dir/file.txt,file.txt,7d52c7437e9af58dac029dd11b1024df
1 change: 1 addition & 0 deletions scripts/generate_fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ def _run(output_name: str, **kwargs) -> None:

_run("default.csv")
_run("include_hidden_true.csv", include_hidden=True)
_run("no_archive_dive.csv", archive_dive=False)


def main() -> int:
Expand Down
21 changes: 18 additions & 3 deletions src/sumbuddy/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import sys
import os

def get_checksums(input_path, output_filepath=None, ignore_file=None, include_hidden=False, algorithm='md5', length=None):
def get_checksums(input_path, output_filepath=None, ignore_file=None, include_hidden=False, algorithm='md5', length=None, archive_dive=True):
"""
Generate a CSV file with the filepath, filename, and checksum of all files in the input directory according to patterns to ignore. Checksum column is labeled by the selected algorithm (e.g., 'md5' or 'sha256').

Expand All @@ -22,6 +22,7 @@ def get_checksums(input_path, output_filepath=None, ignore_file=None, include_hi
include_hidden - Boolean [optional]. Whether to include hidden files. Default is False.
algorithm - String. Algorithm to use for checksums. Default: 'md5', see options with 'hashlib.algorithms_available'.
length - Integer [conditionally optional]. Length of the digest for SHAKE (required) and BLAKE (optional) algorithms in bytes.
archive_dive - Boolean [optional]. Whether to descend into archive files and hash their members. When False, archives are hashed as opaque files. Default: True.
"""
mapper = Mapper()

Expand All @@ -33,7 +34,12 @@ def get_checksums(input_path, output_filepath=None, ignore_file=None, include_hi
if include_hidden:
print("Warning: --include-hidden (-H) flag is ignored when input is a single file.")
else:
regular_files, archive_files = mapper.gather_file_paths(input_path, ignore_file=ignore_file, include_hidden=include_hidden)
regular_files, archive_files = mapper.gather_file_paths(
input_path,
ignore_file=ignore_file,
include_hidden=include_hidden,
archive_dive=archive_dive,
)

# Exclude the output file from being hashed
if output_filepath:
Expand Down Expand Up @@ -88,6 +94,7 @@ def main():
group.add_argument("-H", "--include-hidden", action="store_true", help="Include hidden files")
parser.add_argument("-a", "--algorithm", default="md5", help=f"Hash algorithm to use (default: md5; available: {available_algorithms})")
parser.add_argument("-l", "--length", type=int, help="Length of the digest for SHAKE (required) or BLAKE (optional) algorithms in bytes")
parser.add_argument("--archive-dive", action=argparse.BooleanOptionalAction, default=True, help="Descend into archive files and hash their members (default). Use --no-archive-dive to hash archives as opaque files.")

args = parser.parse_args()

Expand All @@ -100,7 +107,15 @@ def main():
sys.exit("Exited without executing")

try:
get_checksums(args.input_path, args.output_file, args.ignore_file, args.include_hidden, args.algorithm, args.length)
get_checksums(
args.input_path,
output_filepath=args.output_file,
ignore_file=args.ignore_file,
include_hidden=args.include_hidden,
algorithm=args.algorithm,
length=args.length,
archive_dive=args.archive_dive,
)
except (EmptyInputDirectoryError, NoFilesAfterFilteringError, LengthUsedForFixedLengthHashError) as e:
sys.exit(str(e))

Expand Down
9 changes: 5 additions & 4 deletions src/sumbuddy/mapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def reset_filter(self, ignore_file=None, include_hidden=False):
else:
self.filter_manager.read_ignore_patterns(include_hidden=False) # Default: ignore hidden files

def gather_file_paths(self, input_directory, ignore_file=None, include_hidden=False):
def gather_file_paths(self, input_directory, ignore_file=None, include_hidden=False, archive_dive=True):
"""
Generate list of file paths in the input directory based on ignore pattern rules.

Expand All @@ -36,11 +36,12 @@ def gather_file_paths(self, input_directory, ignore_file=None, include_hidden=Fa
input_directory - String. Directory to traverse for files.
ignore_file - String [optional]. Filepath for the ignore patterns file.
include_hidden - Boolean [optional]. Whether to include hidden files.
archive_dive - Boolean [optional]. Whether to classify supported archives separately so callers can descend into their members. When False, archive files are returned with regular_files. Default is True.

Returns:
---------
regular_files - List. Non-archive files in input_directory that are not ignored.
archive_files - List. Archive files in input_directory that are not ignored.
regular_files - List. Files in input_directory that are not ignored. When archive_dive is True, this excludes supported archives. When False, it includes them.
archive_files - List. Archive files in input_directory that are not ignored and should be expanded by the caller.
"""

if not os.path.isdir(input_directory):
Expand All @@ -59,7 +60,7 @@ def gather_file_paths(self, input_directory, ignore_file=None, include_hidden=Fa
for name in files:
file_path = os.path.normpath(os.path.join(root, name))
if self.filter_manager.should_include(file_path, root_directory):
if self.archive_handler.is_supported_archive(file_path):
if archive_dive and self.archive_handler.is_supported_archive(file_path):
archive_files.append(file_path)
else:
regular_files.append(file_path)
Expand Down
55 changes: 55 additions & 0 deletions tests/test_archive.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
import shutil
import sys
import tempfile
import zipfile
from pathlib import Path
from unittest.mock import patch

import pytest

Expand Down Expand Up @@ -80,6 +82,15 @@ def test_gather_file_paths_with_archive(self):
assert isinstance(regular_files, list)
assert isinstance(archive_files, list)

def test_gather_file_paths_with_archive_dive_disabled(self):
mapper = Mapper()
with tempfile.TemporaryDirectory() as temp_dir:
temp_zip_path = Path(temp_dir) / "test_archive.zip"
shutil.copy2(TEST_ZIP, temp_zip_path)
regular_files, archive_files = mapper.gather_file_paths(temp_dir, archive_dive=False)
assert str(temp_zip_path) in regular_files
assert archive_files == []

def test_gather_file_paths_with_archive_and_filter(self):
mapper = Mapper()
with tempfile.TemporaryDirectory() as temp_dir:
Expand Down Expand Up @@ -124,3 +135,47 @@ def test_integration_archive_support_matches_default_fixture(monkeypatch, tmp_pa
actual = output_file.read_text().splitlines()
expected = (examples_dir / "expected_outputs" / "default.csv").read_text().splitlines()
assert sorted(actual) == sorted(expected)


def test_integration_no_archive_dive_omits_members(monkeypatch, tmp_path):
"""End-to-end: get_checksums with archive_dive=False matches the no_archive_dive.csv fixture (archive present, members absent)."""
from sumbuddy import get_checksums

examples_dir = Path(__file__).parent.parent / "examples"
monkeypatch.chdir(examples_dir)
output_file = tmp_path / "checksums.csv"
get_checksums("example_content", str(output_file), archive_dive=False)

actual = output_file.read_text().splitlines()
expected = (examples_dir / "expected_outputs" / "no_archive_dive.csv").read_text().splitlines()
assert sorted(actual) == sorted(expected)


def test_main_passes_archive_dive_false_to_get_checksums(monkeypatch, tmp_path):
"""--no-archive-dive on the CLI threads archive_dive=False into get_checksums."""
from sumbuddy import __main__ as sb_main

examples_dir = Path(__file__).parent.parent / "examples"
monkeypatch.chdir(examples_dir)
output_file = tmp_path / "checksums.csv"

monkeypatch.setattr(sys, "argv", ["sum-buddy", "--no-archive-dive", "-o", str(output_file), "example_content"])
with patch("sumbuddy.__main__.get_checksums") as mock_gc:
sb_main.main()
mock_gc.assert_called_once()
assert mock_gc.call_args.kwargs["archive_dive"] is False


def test_main_passes_archive_dive_true_by_default(monkeypatch, tmp_path):
"""Without the flag, get_checksums receives archive_dive=True (the default)."""
from sumbuddy import __main__ as sb_main

examples_dir = Path(__file__).parent.parent / "examples"
monkeypatch.chdir(examples_dir)
output_file = tmp_path / "checksums.csv"

monkeypatch.setattr(sys, "argv", ["sum-buddy", "-o", str(output_file), "example_content"])
with patch("sumbuddy.__main__.get_checksums") as mock_gc:
sb_main.main()
mock_gc.assert_called_once()
assert mock_gc.call_args.kwargs["archive_dive"] is True
16 changes: 13 additions & 3 deletions tests/test_getChecksums.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,12 @@ def test_get_checksums_to_stdout(self, mock_checksum, mock_gather, mock_open, mo
@patch('sumbuddy.Hasher.checksum_file', side_effect=lambda x, **kwargs: 'dummychecksum')
def test_get_checksums_with_ignore_file(self, mock_checksum, mock_gather, mock_open, mock_exists, mock_abspath):
get_checksums(self.input_path, output_filepath=None, ignore_file=self.ignore_file, include_hidden=False, algorithm=self.algorithm)
mock_gather.assert_called_with(self.input_path, ignore_file=self.ignore_file, include_hidden=False)
mock_gather.assert_called_with(
self.input_path,
ignore_file=self.ignore_file,
include_hidden=False,
archive_dive=True,
)

@patch('os.path.abspath', side_effect=lambda x: x)
@patch('os.path.exists', return_value=True)
Expand All @@ -129,8 +134,13 @@ def test_get_checksums_with_ignore_file(self, mock_checksum, mock_gather, mock_o
@patch('sumbuddy.Hasher.checksum_file', side_effect=lambda x, **kwargs: 'dummychecksum')
def test_get_checksums_include_hidden(self, mock_checksum, mock_gather, mock_open, mock_exists, mock_abspath):
get_checksums(self.input_path, output_filepath=None, ignore_file=None, include_hidden=True, algorithm=self.algorithm)
mock_gather.assert_called_with(self.input_path, ignore_file=None, include_hidden=True)

mock_gather.assert_called_with(
self.input_path,
ignore_file=None,
include_hidden=True,
archive_dive=True,
)

@patch('os.path.abspath', side_effect=lambda x: x)
@patch('os.path.exists', return_value=True)
@patch('builtins.open', new_callable=mock_open)
Expand Down
Loading