chore(deps): update dependency datasets to v4 #1138
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.21.0->==4.0.0Warning
Some dependencies could not be looked up. Check the Dependency Dashboard for more information.
Release Notes
huggingface/datasets (datasets)
v4.0.0Compare Source
New Features
Add
IterableDataset.push_to_hub()by @lhoestq in https://github.com/huggingface/datasets/pull/7595Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
New
ColumnobjectSyntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
Iterate on a column:
for text in ds["text"]:
...
Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
torch>=2.7.0and FFmpeg >= 4datasets<4.0AudioDecoder:VideoDecoder:Breaking changes
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_codeis no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
ListtypeSequencewas a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aListor adictdepending on the subfeatureOther improvements and bug fixes
Dataset.mapto reuse cache files mapped with differentnum_procby @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterableby @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.pyto useco_linetablefor Python 3.10+ in place ofco_lnotabby @qgallouedec in https://github.com/huggingface/datasets/pull/7609New Contributors
Full Changelog: huggingface/datasets@3.6.0...4.0.0
v3.6.0Compare Source
Dataset Features
Other improvements and bug fixes
aiohttpfrom direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294New Contributors
Full Changelog: huggingface/datasets@3.5.1...3.6.0
v3.5.1Compare Source
Bug fixes
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'Other improvements
New Contributors
Full Changelog: huggingface/datasets@3.5.0...3.5.1
v3.5.0Compare Source
Datasets Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.4.1...3.5.0
v3.4.1Compare Source
Bug Fixes
Full Changelog: huggingface/datasets@3.4.0...3.4.1
v3.4.0Compare Source
Dataset Features
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
decordwithtorchvisionto read videos, sincedecordis not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. TheVideotype is still marked as experimental is this versionmetadata.parquetin addition tometadata.csvormetadata.jsonlfor the metadata of the image/audio/video filesAdd IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368
General improvements and bug fixes
string_to_dictto returnNoneif there is no match instead of raisingValueErrorby @ringohoffman in https://github.com/huggingface/datasets/pull/7435ds.set_epoch(new_epoch)by @lhoestq in https://github.com/huggingface/datasets/pull/7451New Contributors
Full Changelog: huggingface/datasets@3.3.2...3.4.0
v3.3.2Compare Source
Bug fixes
Other general improvements
New Contributors
Full Changelog: huggingface/datasets@3.3.1...3.3.2
v3.3.1Compare Source
Bug fixes
Full Changelog: huggingface/datasets@3.3.0...3.3.1
v3.3.0Compare Source
Dataset Features
Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
Support faster processing using pandas or polars functions in
IterableDataset.map()by @lhoestq in https://github.com/huggingface/datasets/pull/7370Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.2.0...3.3.0
v3.2.0Compare Source
Dataset Features
Other improvements and bug fixes
ClassLabelby @sergiopaniego in https://github.com/huggingface/datasets/pull/7293New Contributors
Full Changelog: huggingface/datasets@3.1.0...3.2.0
v3.1.0Compare Source
Dataset Features
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.2...3.1.0
v3.0.2Compare Source
Main bug fixes
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.1...3.0.2
v3.0.1Compare Source
What's Changed
New Contributors
Full Changelog: huggingface/datasets@3.0.0...3.0.1
v3.0.0Compare Source
Dataset Features
.map()Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762
Example:
Cache Changes
huggingface_hubcache by @lhoestq in https://github.com/huggingface/datasets/pull/7105huggingface_hubcache for files downloaded from HF, by default at~/.cache/huggingface/hubdatasetscache, by default at~/.cache/huggingface/datasetsBreaking changes
use_auth_token,fsorignore_verificationsload_metric, please use theevaluatelibrary insteadtaskargument inload_dataset().prepare_for_task()method,datasets.tasksmoduleGeneral improvements and bug fixes
cache_dirfromcache_file_nameby @ringohoffman in https://github.com/huggingface/datasets/pull/7096New Contributors
Full Changelog: huggingface/datasets@2.21.0...3.0.0
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Renovate Bot.