Skip to content

Change get / update dataset metadata to handle two more metadata fields#958

Merged
jmasukawa merged 7 commits intomainfrom
datasets-create-more-metadata
Apr 8, 2026
Merged

Change get / update dataset metadata to handle two more metadata fields#958
jmasukawa merged 7 commits intomainfrom
datasets-create-more-metadata

Conversation

@jmasukawa
Copy link
Copy Markdown
Contributor

@jmasukawa jmasukawa commented Apr 7, 2026

Leverages changes to DatasetApiService/V1/UpdateDatasetMetadataHandler and DatasetApiService/V1/GetDatasetMetadataHandler to accept expected update frequency and user specified sources.

Wait before merging:

Other changes:

  • Pin virtualenv version so ./docker-hatch works
  • Run lint / format before making any changes (there was drift)
  • Docs updates for datasets / datasets metadata

Local testing:

cc: @goeffthomas

http://b/500108129

@jmasukawa jmasukawa force-pushed the datasets-create-more-metadata branch from 2272478 to 334381e Compare April 7, 2026 09:06
@jmasukawa jmasukawa requested a review from jplotts April 7, 2026 09:10
Copy link
Copy Markdown
Contributor

@jplotts jplotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few questions / comments, but overall LGTM.

with tempfile.TemporaryDirectory() as tmpdir:
meta_file = os.path.join(tmpdir, "kernel-metadata.json")
(fd, code_file) = tempfile.mkstemp("code", "py", tmpdir, text=True)
fd, code_file = tempfile.mkstemp("code", "py", tmpdir, text=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j/w - did you do these formatting changes manually? I know that @stevemessick has some autoformatting set up in this repo, so just want to make sure we're not messing with that.

Copy link
Copy Markdown
Contributor Author

@jmasukawa jmasukawa Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is from the tooling. it came from ./docker-hatch run lint:fmt which should be the correct tool path.

something i did notice is that if someone else uses hatch run lint:fmt then it will use their machine's python version, which might format differently.

i used the docker one because i assumed it was the way we keep it consistent. but, the docker python version is a bit outdated. so i'm guessing what happened is someone used the linter w/local python instead of docker one time, and it slipped in.

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
]:
return expected_update_frequency

return "not specified"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we notify the user in any way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, yeah that's a good point. i wrote this before i wrote the MT code, but i think instead, we should not try to sanitize their input and just let the MT fail with the nice user-facing error message from the MT (which shows the list of valid values)

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
if not expected_update_frequency:
return "not specified"

if expected_update_frequency in [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this; now the canonical list is stored in 2 places. What about just sending the bad value off to the MT to be rejected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree. i just had the same thought when responding to your comment above. i wrote the CLI code first before the MT code, and didn't think about it.

will change to your suggestion.

Comment thread docs/datasets_metadata.md
]
],
"expectedUpdateFrequency": "monthly",
"userSpecifiedSources": "World Bank and OECD ([link](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD))",
Copy link
Copy Markdown
Contributor Author

@jmasukawa jmasukawa Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the internal field name. open to better alternatives.

I had wanted to just call it sources, but given we follow the Data Package format as much as possible, it conflicts with their "sources". which are objects that have a Title (required) and Path (optional): https://specs.frictionlessdata.io//data-package/#sources

However, DatasetVersion.UserSpecifiedSources is a markdown string, so that's not feasible to turn into an array of objects and back, without some heavy assumptions that could be wrong (comma separation, etc).

It seemed like too much work atm to refactor how we store UserSpecifiedSources to use a collection of objects just for this, so i went with a name that was significantly deviated from the Data Package spec to make it clear it's something else.

Open to other suggestions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goeffthomas for your thoughts here, given you mentioned Data Package adherence

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for laying all of that out. +1 to your assessment and naming. I think my earlier comment about adhering and using this property was before knowing how they recommend extending. So yeah, if we want something that more matches how we let users specify the sources, this all makes sense to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! SG, i'll continue on this way to match our UI

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
"weekly",
"daily",
"hourly",
"unspecified", # Some datasets have this still, but it is not canonical
Copy link
Copy Markdown
Contributor Author

@jmasukawa jmasukawa Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's only 94 datasets using unspecified instead of not specified, but it felt wrong that you could theoretically pull metadata down via get, which you couldn't push back up in update: http://screen/AU3qjn9KnaWYyfk

We should probably do a data migration to get things to line up to the enum, then migrate that column to an enum. filed a ticket for it: http://b/500450622

with tempfile.TemporaryDirectory() as tmpdir:
meta_file = os.path.join(tmpdir, "kernel-metadata.json")
(fd, code_file) = tempfile.mkstemp("code", "py", tmpdir, text=True)
fd, code_file = tempfile.mkstemp("code", "py", tmpdir, text=True)
Copy link
Copy Markdown
Contributor Author

@jmasukawa jmasukawa Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is from the tooling. it came from ./docker-hatch run lint:fmt which should be the correct tool path.

something i did notice is that if someone else uses hatch run lint:fmt then it will use their machine's python version, which might format differently.

i used the docker one because i assumed it was the way we keep it consistent. but, the docker python version is a bit outdated. so i'm guessing what happened is someone used the linter w/local python instead of docker one time, and it slipped in.

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
]:
return expected_update_frequency

return "not specified"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, yeah that's a good point. i wrote this before i wrote the MT code, but i think instead, we should not try to sanitize their input and just let the MT fail with the nice user-facing error message from the MT (which shows the list of valid values)

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
if not expected_update_frequency:
return "not specified"

if expected_update_frequency in [
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree. i just had the same thought when responding to your comment above. i wrote the CLI code first before the MT code, and didn't think about it.

will change to your suggestion.

@jmasukawa jmasukawa merged commit 1dca025 into main Apr 8, 2026
5 checks passed
@jmasukawa jmasukawa deleted the datasets-create-more-metadata branch April 8, 2026 22:29
jmasukawa added a commit that referenced this pull request Apr 9, 2026
Created with: `pip-compile pyproject.toml -o requirements.lock`.
`kagglesdk` dep was bumped from
#958

(needed to `pip install piptools` w/ --break-system-packages but that
seems fine)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants