Change get / update dataset metadata to handle two more metadata fields#958
Change get / update dataset metadata to handle two more metadata fields#958
Conversation
2272478 to
334381e
Compare
jplotts
left a comment
There was a problem hiding this comment.
Thanks! Just a few questions / comments, but overall LGTM.
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| meta_file = os.path.join(tmpdir, "kernel-metadata.json") | ||
| (fd, code_file) = tempfile.mkstemp("code", "py", tmpdir, text=True) | ||
| fd, code_file = tempfile.mkstemp("code", "py", tmpdir, text=True) |
There was a problem hiding this comment.
j/w - did you do these formatting changes manually? I know that @stevemessick has some autoformatting set up in this repo, so just want to make sure we're not messing with that.
There was a problem hiding this comment.
yep, this is from the tooling. it came from ./docker-hatch run lint:fmt which should be the correct tool path.
something i did notice is that if someone else uses hatch run lint:fmt then it will use their machine's python version, which might format differently.
i used the docker one because i assumed it was the way we keep it consistent. but, the docker python version is a bit outdated. so i'm guessing what happened is someone used the linter w/local python instead of docker one time, and it slipped in.
| ]: | ||
| return expected_update_frequency | ||
|
|
||
| return "not specified" |
There was a problem hiding this comment.
Should we notify the user in any way?
There was a problem hiding this comment.
hm, yeah that's a good point. i wrote this before i wrote the MT code, but i think instead, we should not try to sanitize their input and just let the MT fail with the nice user-facing error message from the MT (which shows the list of valid values)
| if not expected_update_frequency: | ||
| return "not specified" | ||
|
|
||
| if expected_update_frequency in [ |
There was a problem hiding this comment.
Not sure about this; now the canonical list is stored in 2 places. What about just sending the bad value off to the MT to be rejected?
There was a problem hiding this comment.
yeah i agree. i just had the same thought when responding to your comment above. i wrote the CLI code first before the MT code, and didn't think about it.
will change to your suggestion.
| ] | ||
| ], | ||
| "expectedUpdateFrequency": "monthly", | ||
| "userSpecifiedSources": "World Bank and OECD ([link](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD))", |
There was a problem hiding this comment.
This is the internal field name. open to better alternatives.
I had wanted to just call it sources, but given we follow the Data Package format as much as possible, it conflicts with their "sources". which are objects that have a Title (required) and Path (optional): https://specs.frictionlessdata.io//data-package/#sources
However, DatasetVersion.UserSpecifiedSources is a markdown string, so that's not feasible to turn into an array of objects and back, without some heavy assumptions that could be wrong (comma separation, etc).
It seemed like too much work atm to refactor how we store UserSpecifiedSources to use a collection of objects just for this, so i went with a name that was significantly deviated from the Data Package spec to make it clear it's something else.
Open to other suggestions.
There was a problem hiding this comment.
@goeffthomas for your thoughts here, given you mentioned Data Package adherence
There was a problem hiding this comment.
Thanks for laying all of that out. +1 to your assessment and naming. I think my earlier comment about adhering and using this property was before knowing how they recommend extending. So yeah, if we want something that more matches how we let users specify the sources, this all makes sense to me.
There was a problem hiding this comment.
Thanks for the feedback! SG, i'll continue on this way to match our UI
| "weekly", | ||
| "daily", | ||
| "hourly", | ||
| "unspecified", # Some datasets have this still, but it is not canonical |
There was a problem hiding this comment.
there's only 94 datasets using unspecified instead of not specified, but it felt wrong that you could theoretically pull metadata down via get, which you couldn't push back up in update: http://screen/AU3qjn9KnaWYyfk
We should probably do a data migration to get things to line up to the enum, then migrate that column to an enum. filed a ticket for it: http://b/500450622
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| meta_file = os.path.join(tmpdir, "kernel-metadata.json") | ||
| (fd, code_file) = tempfile.mkstemp("code", "py", tmpdir, text=True) | ||
| fd, code_file = tempfile.mkstemp("code", "py", tmpdir, text=True) |
There was a problem hiding this comment.
yep, this is from the tooling. it came from ./docker-hatch run lint:fmt which should be the correct tool path.
something i did notice is that if someone else uses hatch run lint:fmt then it will use their machine's python version, which might format differently.
i used the docker one because i assumed it was the way we keep it consistent. but, the docker python version is a bit outdated. so i'm guessing what happened is someone used the linter w/local python instead of docker one time, and it slipped in.
| ]: | ||
| return expected_update_frequency | ||
|
|
||
| return "not specified" |
There was a problem hiding this comment.
hm, yeah that's a good point. i wrote this before i wrote the MT code, but i think instead, we should not try to sanitize their input and just let the MT fail with the nice user-facing error message from the MT (which shows the list of valid values)
| if not expected_update_frequency: | ||
| return "not specified" | ||
|
|
||
| if expected_update_frequency in [ |
There was a problem hiding this comment.
yeah i agree. i just had the same thought when responding to your comment above. i wrote the CLI code first before the MT code, and didn't think about it.
will change to your suggestion.
Created with: `pip-compile pyproject.toml -o requirements.lock`. `kagglesdk` dep was bumped from #958 (needed to `pip install piptools` w/ --break-system-packages but that seems fine)
Leverages changes to
DatasetApiService/V1/UpdateDatasetMetadataHandlerandDatasetApiService/V1/GetDatasetMetadataHandlerto accept expected update frequency and user specified sources.Wait before merging:
Other changes:
virtualenvversion so./docker-hatchworksLocal testing:
cc: @goeffthomas
http://b/500108129