Skip to content

Add backend.ls_dirs()#291

Open
hagenw wants to merge 17 commits into
mainfrom
add-ls-dirs
Open

Add backend.ls_dirs()#291
hagenw wants to merge 17 commits into
mainfrom
add-ls-dirs

Conversation

@hagenw
Copy link
Copy Markdown
Member

@hagenw hagenw commented Apr 1, 2026

Closes #252

Originally, we did not cover folders as some backends do not have the concept of folders. But with our path syntax in audbackend we clearly follow sub-directory structures, e.g. /sub0/sub1/file.txt. This means if you ask to list all folders under /sub0/ this is well defined.

The Versioned and Maven interfaces further use this folder like structure to encode the versions, e.g. /sub/version/file.zip. When we want to get all possible versions of file.zip we use ls() to first list all files that are stored under /sub and then look for the potential version. There can be a lot of files, and that is the reason why interface.versions() can be very slow.

This pull request improves this by adding backend.ls_dirs() to return folders under a given path.
To handle backends that do not understand a folder structure ls_dirs() defaults to use _ls() under the hood. Backends that support a faster lookup of folders need then to implement a custom _ls_dirs() method, which I have done for all our current backends.

image

Example

Here an example file structure from audb using audbackend.interface.Versioned:

name/1.0.0/db.yaml
name/1.0.0/db.zip
name/2.0.0/db.yaml
name/2.0.0/db.zip
name/media/1.0.0/...
name/media/2.0.0/...
name/meta/1.0.0/...
name/meta/2.0.0/...

Current implementation

When you would use audbackend.interface.Versioned.versions("/name/db.yaml") it would currently (main branch) execute

paths = self.ls(path, suppress_backend_errors=suppress_backend_errors)

which would call

root, file = self.split(path)
paths = self.backend.ls(root, suppress_backend_errors=suppress_backend_errors)

root would be "/name/", which means ls() would need to list potentially millions of files.

Proposed implementation

When you would use audbackend.interface.Versioned.versions("/name/db.yaml") it would (add-ls-dirs branch) execute

root, file = self.split(path)
root_dir = root if root.endswith("/") else root + "/"
dirs = self.backend.ls_dirs(root_dir, suppress_backend_errors=suppress_backend_errors)

where root_dir would be "/name/", which would then run for the minio backend

objects = list(
    self._client.list_objects(bucket_name=self.repository, prefix=path, recursive=False)
)

which only needs to return all files/folders under "/name/".

Benchmark

Execution time for running audb.versions():

Dataset Backend Current Proposed
librispeech minio 103.469s 1.773s
aisoundlab-covid-19 artifactory 3.544s 2.303s

Note: the artifactory backend in audb.versions() is not relying on audbackend.interface.Maven.versions(), but on custom code at https://github.com/audeering/audb/blob/ccadb4424b8e808aa8a7061efdd1bdd513853a45/audb/core/api.py#L620-L648. With the changes proposed here, we no longer need that custom code for the artifactory backend in audb. The results reported for Proposed in the table are run with a modified version of audb that has no special handling of the artifactory backend. If we run the same code for Current the execution time would be 6.675s.

Discussion

  • The main downside of the current implementation is the need for calling exists() in versions() on the backend multiple times. But as backends can also be handled by other players besides audbackend (e.g. web interfaces), I don't think we can avoid this
  • An alternative to adding ls_dirs() would be to add a recursive parameter directly to ls() and let it return all folders and files within a folder when recursive=False
  • We could add latest_version() in a follow up pull request as there we could first get a list of sub-folders and then start checking for existing files traversing the sub-folders in reversed order

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 1, 2026

Reviewer's Guide

Introduce a directory-listing abstraction on backends (ls_dirs/_ls_dirs) and refactor Versioned and Maven interfaces to use directory-based version discovery, with optimized implementations for each concrete backend and corresponding tests, significantly speeding up version lookups.

Sequence diagram for Versioned.versions() using backend.ls_dirs()

sequenceDiagram
    actor Client
    participant Versioned as VersionedInterface
    participant Backend as BackendBase

    Client->>Versioned: versions(path, suppress_backend_errors)
    Versioned->>Versioned: utils.check_path(path)
    Versioned->>Versioned: root, file = split(path)
    Versioned->>Versioned: root_dir = root + "/" if needed

    Versioned->>Backend: ls_dirs(root_dir, suppress_backend_errors)
    alt backend_error and not suppress_backend_errors
        Backend-->>Versioned: raise BackendError
        Versioned-->>Client: BackendError
    else success
        Backend-->>Versioned: dirs (list of version candidates)
        loop for each d in dirs
            Versioned->>Backend: exists(join(root, d, file))
            Backend-->>Versioned: bool
            alt exists
                Versioned->>Versioned: append d to vs
            end
        end
        alt vs is empty and not suppress_backend_errors
            Versioned->>Versioned: utils.raise_file_not_found_error(path)
            Versioned-->>Client: BackendError(FileNotFoundError)
        else
            Versioned->>Versioned: audeer.sort_versions(vs)
            Versioned-->>Client: sorted versions
        end
    end
Loading

Sequence diagram for Maven.versions() using backend.ls_dirs()

sequenceDiagram
    actor Client
    participant Maven as MavenInterface
    participant Backend as BackendBase

    Client->>Maven: versions(path, suppress_backend_errors)
    Maven->>Maven: utils.check_path(path)
    Maven->>Maven: root, name = split(path)
    Maven->>Maven: base, ext = _split_ext(name)
    Maven->>Maven: base_dir = join(root, base) + "/"

    Maven->>Backend: ls_dirs(base_dir, suppress_backend_errors)
    Backend-->>Maven: dirs (version folder names)

    loop for each d in dirs
        Maven->>Maven: versioned_file = join(root, base, d, base + "-" + d + ext)
        Maven->>Backend: exists(versioned_file)
        Backend-->>Maven: bool
        alt exists
            Maven->>Maven: append d to vs
        end
    end

    alt vs is empty and not suppress_backend_errors
        Maven->>Maven: utils.raise_file_not_found_error(path)
        Maven-->>Client: BackendError(FileNotFoundError)
    else
        Maven->>Maven: audeer.sort_versions(vs)
        Maven-->>Client: sorted versions
    end
Loading

Updated class diagram for BackendBase and concrete backends with ls_dirs

classDiagram
    class BackendBase {
        bool opened
        string sep
        +list~string~ ls(path="/", suppress_backend_errors=False)
        +list~string~ ls_dirs(path="/", suppress_backend_errors=False)
        -list~string~ _ls(path) *abstract*
        -list~string~ _ls_dirs(path)
    }

    class FilesystemBackend {
        -string root
        -list~string~ _ls(path)
        -list~string~ _ls_dirs(path)
    }

    class MinioBackend {
        -object _client
        -string repository
        -list~string~ _ls(path)
        -list~string~ _ls_dirs(path)
    }

    class ArtifactoryBackend {
        -object _client
        -string repository
        -list~string~ _ls(path)
        -list~string~ _ls_dirs(path)
    }

    BackendBase <|-- FilesystemBackend
    BackendBase <|-- MinioBackend
    BackendBase <|-- ArtifactoryBackend

    class Utils {
        +string check_path(path, allow_sub_path)
        +T call_function_on_backend(func, path, suppress_backend_errors, fallback_return_value)
    }

    BackendBase ..> Utils : uses
    FilesystemBackend ..> os : uses
    MinioBackend ..> MinioClient : uses
    ArtifactoryBackend ..> ArtifactoryPath : uses
Loading

Updated class diagram for Versioned and Maven interfaces using backend.ls_dirs

classDiagram
    class BackendBase {
        +list~string~ ls(path="/", suppress_backend_errors=False)
        +list~string~ ls_dirs(path="/", suppress_backend_errors=False)
        +bool exists(path)
    }

    class VersionedInterface {
        +tuple~string, string~ split(path)
        +string join(part1, part2, part3)
        +list~string~ versions(path, suppress_backend_errors=False)
        -BackendBase backend
    }

    class MavenInterface {
        +tuple~string, string~ split(path)
        +tuple~string, string~ _split_ext(name)
        +string join(part1, part2, part3, part4)
        +list~string~ versions(path, suppress_backend_errors=False)
        -BackendBase backend
    }

    class Utils {
        +string check_path(path)
        +void raise_file_not_found_error(path)
    }

    class Audeer {
        +list~string~ sort_versions(versions)
    }

    VersionedInterface --> BackendBase : uses backend
    MavenInterface --> BackendBase : uses backend
    VersionedInterface ..> Utils : uses
    MavenInterface ..> Utils : uses
    VersionedInterface ..> Audeer : uses
    MavenInterface ..> Audeer : uses
Loading

File-Level Changes

Change Details Files
Add a generic backend directory listing API (ls_dirs/_ls_dirs) with validation, error handling, and a default implementation based on existing ls() behavior.
  • Introduce _ls_dirs(path) on the base backend that derives immediate subdirectory names from _ls() results and raises FileNotFoundError when no paths are found.
  • Expose public ls_dirs(path, suppress_backend_errors=False) on the base backend, enforcing open-state, path validation, trailing slash requirement, and using utils.call_function_on_backend with a sorted result.
  • Extend backend error tests to assert that ls_dirs() also fails when the backend is not opened.
audbackend/core/backend/base.py
tests/test_backend_base.py
Implement efficient backend-specific _ls_dirs() for concrete backends (filesystem, Artifactory, MinIO).
  • Filesystem backend: implement _ls_dirs() using os.scandir() over the expanded path and raise FileNotFoundError if the directory does not exist.
  • Artifactory backend: implement _ls_dirs() by wrapping ArtifactoryPath iteration to collect immediate child directories, raising FileNotFoundError when the path is missing.
  • MinIO backend: implement _ls_dirs() using list_objects with recursive=False, mapping directory objects (is_dir) to their final path component, and raising FileNotFoundError on empty results.
  • Add backend-specific tests ensuring ls_dirs() returns expected subdirectory names and raises BackendError for non-existent paths on Artifactory and MinIO.
audbackend/core/backend/filesystem.py
audbackend/core/backend/artifactory.py
audbackend/core/backend/minio.py
tests/test_backend_filesystem.py
tests/test_backend_artifactory.py
tests/test_backend_minio.py
Refactor Versioned and Maven interfaces to compute versions via directory enumeration instead of listing all files, improving performance and aligning with folder-like storage layouts.
  • Change Versioned.versions() to derive the parent directory, call backend.ls_dirs() to get candidate version directories, and filter by existing versioned paths before returning sorted versions; raise BackendError via FileNotFoundError mapping when no versions exist and errors are not suppressed.
  • Add a Maven.versions() method that follows the Maven on-disk layout (/root/base/version/base-version.ext), uses backend.ls_dirs() to enumerate version directories, checks for the expected versioned file, sorts versions, and mirrors the same error-handling semantics as Versioned.
  • Add tests for Versioned and Maven versions() using ls_dirs(), covering non-existent files, mismatched extensions, version discovery, suppression of backend errors, and correct ordering of versions.
  • Adjust an existing Versioned interface archive test to use tmpdir as the source directory instead of '.', keeping tests aligned with the new behavior.
audbackend/core/interface/versioned.py
audbackend/core/interface/maven.py
tests/test_interface_versioned.py
tests/test_interface_maven.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (6a5ff4d) to head (f080616).

Additional details and impacted files
Files with missing lines Coverage Δ
audbackend/core/backend/artifactory.py 100.0% <100.0%> (ø)
audbackend/core/backend/base.py 100.0% <100.0%> (ø)
audbackend/core/backend/filesystem.py 100.0% <100.0%> (ø)
audbackend/core/backend/minio.py 100.0% <100.0%> (ø)
audbackend/core/interface/maven.py 100.0% <100.0%> (ø)
audbackend/core/interface/versioned.py 100.0% <100.0%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hagenw hagenw marked this pull request as ready for review April 2, 2026 09:59
sourcery-ai[bot]

This comment was marked as resolved.

@hagenw hagenw self-assigned this Apr 2, 2026
@hagenw hagenw requested a review from frankenjoe April 2, 2026 13:13
@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Apr 7, 2026

If we would not test for the existence of the actual files in versions() but just use the folders, we would further speed up to 0.177s for librispeech on minio and 0.934s for aisoundlab-covid-19 on artifactory.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Apr 17, 2026

If we would not test for the existence of the actual files in versions() but just use the folders, we would further speed up to 0.177s for librispeech on minio and 0.934s for aisoundlab-covid-19 on artifactory.

I would vote to stick for testing if a file exists in versions() in audbackend as the following is a completely valid folder structure, but would return the wrong versions for sub/file.txt if we do not check for its existence.

sub/1.0.0/file.txt
sub/2.0.0/folder/1.0.0/other.txt

@frankenjoe
Copy link
Copy Markdown
Collaborator

The speed-up for librispeech on minio is impressive. So yes, we need it :)

  • An alternative to adding ls_dirs() would be to add a recursive parameter directly to ls() and let it return all folders and files within a folder when recursive=False

But I guess you prefer the current solution, correct?

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented May 22, 2026

I don't have a strong opinion on that. But I guess I slightly preferred ls_dirs() when implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider adding a recursive argument to ls()

2 participants