-
Notifications
You must be signed in to change notification settings - Fork 4
Make XML metadata optional, extract from HDF if XML not found #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make XML metadata optional, extract from HDF if XML not found #97
Conversation
|
@karthick-rn don't worry about the CI errors yet, this package has bit-rotted pretty hard, I'll clean it up. |
gadomski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a simple test function that exercises this functionality against https://github.com/stactools-packages/modis/blob/main/tests/data-files/MOD10A2.A2022033.h09v05.061.2022042050729.hdf? You can copy it to a temp directory to ensure there's no alongside XML asset.
Also needs a changelog entry.
@gadomski, Added the test function. Note: The MOD10A2 (snow cover) product doesn't have the QAPERCENTNOTPRODUCEDCLOUD tag in its HDF metadata, so I also made that field optional in from_cog_tags() to handle this case gracefully.
done |
This is a bit of a flag — do we know if there are other tags that will be missing for other products? We should probably test against every product type, even if they aren't automated (e.g. the "external data" tests). |
@gadomski, I manually verified all 53 XML test files (which contain the same metadata as HDF files):
The only missing tag is QAPERCENTNOTPRODUCEDCLOUD, which is already made optional. No other tags are missing. |
gadomski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer, a couple of small tweaks to the test setup and one question about a possibly-optional attribute.
tests/test_stac.py
Outdated
| item.validate() | ||
|
|
||
|
|
||
| def test_create_item_from_hdf_without_xml() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest provides a fixture for a temporary directory, let's use that instead.
| def test_create_item_from_hdf_without_xml() -> None: | |
| def test_create_item_from_hdf_without_xml(tmp_path: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/test_stac.py
Outdated
| """Test that an item can be created from an HDF file when XML is not available. | ||
| This tests the fallback to extracting metadata directly from the HDF file | ||
| when the accompanying XML metadata file is not present. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally 👎🏼 on docstrings for tests, as I prefer to let the test function name and content speak for themselves.
| """Test that an item can be created from an HDF file when XML is not available. | |
| This tests the fallback to extracting metadata directly from the HDF file | |
| when the accompanying XML metadata file is not present. | |
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're auto-generated comments and I have removed them.
tests/test_stac.py
Outdated
| # Copy only the HDF file (not the XML) to ensure XML is not available | ||
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | ||
| shutil.copyfile(source_hdf_path, temp_hdf_path) | ||
|
|
||
| # Verify XML does not exist in temp directory | ||
| temp_xml_path = f"{temp_hdf_path}.xml" | ||
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | ||
|
|
||
| # Create item from HDF only - should extract metadata from HDF | ||
| item = stactools.modis.stac.create_item(temp_hdf_path) | ||
|
|
||
| # Verify item was created with correct metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like auto-gen comments, again I generally prefer to let the code speak for itself.
| # Copy only the HDF file (not the XML) to ensure XML is not available | |
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | |
| shutil.copyfile(source_hdf_path, temp_hdf_path) | |
| # Verify XML does not exist in temp directory | |
| temp_xml_path = f"{temp_hdf_path}.xml" | |
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | |
| # Create item from HDF only - should extract metadata from HDF | |
| item = stactools.modis.stac.create_item(temp_hdf_path) | |
| # Verify item was created with correct metadata | |
| temp_hdf_path = os.path.join(temporary_directory, hdf_file) | |
| shutil.copyfile(source_hdf_path, temp_hdf_path) | |
| temp_xml_path = f"{temp_hdf_path}.xml" | |
| assert not os.path.exists(temp_xml_path), "XML file should not exist" | |
| item = stactools.modis.stac.create_item(temp_hdf_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/test_stac.py
Outdated
| assert item is not None | ||
| assert item.id.startswith("MOD10A2.A2022033.h09v05") | ||
| assert "hdf" in item.assets | ||
| assert "metadata" not in item.assets # XML asset should not be present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert "metadata" not in item.assets # XML asset should not be present | |
| assert "metadata" not in item.assets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/stactools/modis/metadata.py
Outdated
| with warnings.catch_warnings(): | ||
| warnings.simplefilter("ignore", category=NotGeoreferencedWarning) | ||
| with rasterio.open(read_href) as dataset: | ||
| hdf_tags = dataset.tags() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to catch this warning? Feels like we do want to propagate this up to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the suppression. It is now visible in the output.
- Revert TileID to required (present in all products) - Remove NotGeoreferencedWarning suppression - Use pytest tmp_path fixture in test - Clean up test comments
@gadomski, Thanks for reviewing. I have updated the code based on your comments and also tested it, all looks good. When you get a moment, could you review and approve it? |
Description: There is no xml file along with .hdf file for modis dataset anymore. Updated the code accordingly, also for STAC item creation it expects the metadata which is now extracted from the .hdf file.
These changes have been tested by rebuilding the image with the new commit hash from the fork and it ingested the STAC item successfully. This is the link to the ingested item into Open PC - https://planetarycomputer.microsoft.com/explore?c=-134.4470%2C55.3130&z=4.32&v=2&d=modis-43A4-061&m=cql%3A5f2d6e100cd82009329bc4df3e51de56&r=Natural+color&s=false%3A%3A100%3A%3Atrue&sr=desc&ae=0
PR checklist:
scripts/format).scripts/lint).scripts/test).