Add backend.ls_dirs()#291
Conversation
Reviewer's GuideIntroduce a directory-listing abstraction on backends (ls_dirs/_ls_dirs) and refactor Versioned and Maven interfaces to use directory-based version discovery, with optimized implementations for each concrete backend and corresponding tests, significantly speeding up version lookups. Sequence diagram for Versioned.versions() using backend.ls_dirs()sequenceDiagram
actor Client
participant Versioned as VersionedInterface
participant Backend as BackendBase
Client->>Versioned: versions(path, suppress_backend_errors)
Versioned->>Versioned: utils.check_path(path)
Versioned->>Versioned: root, file = split(path)
Versioned->>Versioned: root_dir = root + "/" if needed
Versioned->>Backend: ls_dirs(root_dir, suppress_backend_errors)
alt backend_error and not suppress_backend_errors
Backend-->>Versioned: raise BackendError
Versioned-->>Client: BackendError
else success
Backend-->>Versioned: dirs (list of version candidates)
loop for each d in dirs
Versioned->>Backend: exists(join(root, d, file))
Backend-->>Versioned: bool
alt exists
Versioned->>Versioned: append d to vs
end
end
alt vs is empty and not suppress_backend_errors
Versioned->>Versioned: utils.raise_file_not_found_error(path)
Versioned-->>Client: BackendError(FileNotFoundError)
else
Versioned->>Versioned: audeer.sort_versions(vs)
Versioned-->>Client: sorted versions
end
end
Sequence diagram for Maven.versions() using backend.ls_dirs()sequenceDiagram
actor Client
participant Maven as MavenInterface
participant Backend as BackendBase
Client->>Maven: versions(path, suppress_backend_errors)
Maven->>Maven: utils.check_path(path)
Maven->>Maven: root, name = split(path)
Maven->>Maven: base, ext = _split_ext(name)
Maven->>Maven: base_dir = join(root, base) + "/"
Maven->>Backend: ls_dirs(base_dir, suppress_backend_errors)
Backend-->>Maven: dirs (version folder names)
loop for each d in dirs
Maven->>Maven: versioned_file = join(root, base, d, base + "-" + d + ext)
Maven->>Backend: exists(versioned_file)
Backend-->>Maven: bool
alt exists
Maven->>Maven: append d to vs
end
end
alt vs is empty and not suppress_backend_errors
Maven->>Maven: utils.raise_file_not_found_error(path)
Maven-->>Client: BackendError(FileNotFoundError)
else
Maven->>Maven: audeer.sort_versions(vs)
Maven-->>Client: sorted versions
end
Updated class diagram for BackendBase and concrete backends with ls_dirsclassDiagram
class BackendBase {
bool opened
string sep
+list~string~ ls(path="/", suppress_backend_errors=False)
+list~string~ ls_dirs(path="/", suppress_backend_errors=False)
-list~string~ _ls(path) *abstract*
-list~string~ _ls_dirs(path)
}
class FilesystemBackend {
-string root
-list~string~ _ls(path)
-list~string~ _ls_dirs(path)
}
class MinioBackend {
-object _client
-string repository
-list~string~ _ls(path)
-list~string~ _ls_dirs(path)
}
class ArtifactoryBackend {
-object _client
-string repository
-list~string~ _ls(path)
-list~string~ _ls_dirs(path)
}
BackendBase <|-- FilesystemBackend
BackendBase <|-- MinioBackend
BackendBase <|-- ArtifactoryBackend
class Utils {
+string check_path(path, allow_sub_path)
+T call_function_on_backend(func, path, suppress_backend_errors, fallback_return_value)
}
BackendBase ..> Utils : uses
FilesystemBackend ..> os : uses
MinioBackend ..> MinioClient : uses
ArtifactoryBackend ..> ArtifactoryPath : uses
Updated class diagram for Versioned and Maven interfaces using backend.ls_dirsclassDiagram
class BackendBase {
+list~string~ ls(path="/", suppress_backend_errors=False)
+list~string~ ls_dirs(path="/", suppress_backend_errors=False)
+bool exists(path)
}
class VersionedInterface {
+tuple~string, string~ split(path)
+string join(part1, part2, part3)
+list~string~ versions(path, suppress_backend_errors=False)
-BackendBase backend
}
class MavenInterface {
+tuple~string, string~ split(path)
+tuple~string, string~ _split_ext(name)
+string join(part1, part2, part3, part4)
+list~string~ versions(path, suppress_backend_errors=False)
-BackendBase backend
}
class Utils {
+string check_path(path)
+void raise_file_not_found_error(path)
}
class Audeer {
+list~string~ sort_versions(versions)
}
VersionedInterface --> BackendBase : uses backend
MavenInterface --> BackendBase : uses backend
VersionedInterface ..> Utils : uses
MavenInterface ..> Utils : uses
VersionedInterface ..> Audeer : uses
MavenInterface ..> Audeer : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
|
If we would not test for the existence of the actual files in |
I would vote to stick for testing if a file exists in |
|
The speed-up for librispeech on minio is impressive. So yes, we need it :)
But I guess you prefer the current solution, correct? |
|
I don't have a strong opinion on that. But I guess I slightly preferred |
Closes #252
Originally, we did not cover folders as some backends do not have the concept of folders. But with our
pathsyntax inaudbackendwe clearly follow sub-directory structures, e.g./sub0/sub1/file.txt. This means if you ask to list all folders under/sub0/this is well defined.The
VersionedandMaveninterfaces further use this folder like structure to encode the versions, e.g./sub/version/file.zip. When we want to get all possible versions offile.zipwe usels()to first list all files that are stored under/suband then look for the potential version. There can be a lot of files, and that is the reason whyinterface.versions()can be very slow.This pull request improves this by adding
backend.ls_dirs()to return folders under a given path.To handle backends that do not understand a folder structure
ls_dirs()defaults to use_ls()under the hood. Backends that support a faster lookup of folders need then to implement a custom_ls_dirs()method, which I have done for all our current backends.Example
Here an example file structure from
audbusingaudbackend.interface.Versioned:Current implementation
When you would use
audbackend.interface.Versioned.versions("/name/db.yaml")it would currently (mainbranch) executewhich would call
rootwould be"/name/", which meansls()would need to list potentially millions of files.Proposed implementation
When you would use
audbackend.interface.Versioned.versions("/name/db.yaml")it would (add-ls-dirsbranch) executewhere
root_dirwould be"/name/", which would then run for the minio backendwhich only needs to return all files/folders under
"/name/".Benchmark
Execution time for running
audb.versions():Note: the artifactory backend in
audb.versions()is not relying onaudbackend.interface.Maven.versions(), but on custom code at https://github.com/audeering/audb/blob/ccadb4424b8e808aa8a7061efdd1bdd513853a45/audb/core/api.py#L620-L648. With the changes proposed here, we no longer need that custom code for the artifactory backend inaudb. The results reported for Proposed in the table are run with a modified version ofaudbthat has no special handling of the artifactory backend. If we run the same code for Current the execution time would be 6.675s.Discussion
exists()inversions()on the backend multiple times. But as backends can also be handled by other players besidesaudbackend(e.g. web interfaces), I don't think we can avoid thisls_dirs()would be to add arecursiveparameter directly tols()and let it return all folders and files within a folder whenrecursive=Falselatest_version()in a follow up pull request as there we could first get a list of sub-folders and then start checking for existing files traversing the sub-folders in reversed order