Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ jobs:
skip_mdx_pr_step: ${{ steps.skip_mdx_pr_step_setting.outputs.skip_mdx_pr_step }}
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Set SKIP_MDX_PR_STEP condition
id: skip_mdx_pr_step_setting
Expand Down
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# veda-data

This repository houses data and config used to create STAC records to be published to the veda STAC catalog.

# Repository layout
## Repository layout

The repo follows the following folder structure:

```
```plain
| ingestion-data
| collections
| archive
Expand All @@ -14,10 +16,10 @@ The repo follows the following folder structure:
. collection-n.json
| discovery-items
| archive
. archived-discovery-items-1.json
. archived-discovery-items-2.json
. archived-discovery-items-1.json
. archived-discovery-items-2.json
...
. archived-discovery-items-n.json
. archived-discovery-items-n.json
. discovery-items-1.json
. discovery-items-2.json
...
Expand All @@ -28,12 +30,15 @@ The repo follows the following folder structure:
## ingestion-data

### collections

These are STAC collection records for all the available datasets. They should conform to the [STAC specification for a collection](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md).

#### archive

These are the collections that we no longer update. However, we might still maintain them in the catalog.

### discovery-items

These are the items ingestion config files that are used by our data pipelines (airflow), specifically the `veda_discover` DAG in [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), which discovers all the specified files and triggers the `veda_ingest_raster` DAG which takes care of creating the stac items and publishing them.

The format looks like this:
Expand Down Expand Up @@ -63,10 +68,10 @@ The format looks like this:
```

### transfer-config
These are the configs used to transfer assets from the dev bucket (`ghgc-data-store-develop` - where the data was delivered) to the production bucket (`ghgc-data-store` - where the data is moved after it is finalized). The files from the production bucket is used to publish to the catalog. The transfer is done via triggering `veda_transfer` DAG in [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow).

##### Description of each field:
These are the configs used to transfer assets from the dev bucket (`ghgc-data-store-develop` - where the data was delivered) to the production bucket (`ghgc-data-store` - where the data is moved after it is finalized). The files from the production bucket is used to publish to the catalog. The transfer is done via triggering `veda_transfer` DAG in [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow).

#### Description of each field

| Field | Description |
|--------------------|--------------------------------------------------|
Expand All @@ -75,16 +80,17 @@ These are the configs used to transfer assets from the dev bucket (`ghgc-data-st
| `prefix` | The s3 prefix under which to search for the files |
| `filename_regex` | The regex pattern that the files to be discovered should match |
| `id_regex` | Specifies in regex what part of the filename (usually the datetime) should be used to group assets into item. Example: if the filenames are `asset1_20151201.tif`, `asset2_20151201.tif`, `asset1_20161201.tif`, `asset2_20161201.tif`; the item should be based on the datetime part, hence it'd be `".*_(.*).tif$"`. The part should be specified using round brackets. The is also the part of the filenames that will be used to form the item id, together with the `id_template` field. |
| `id_template` | This is a python f-string formatted string that is used to define the `id` of the STAC item. It's used together with the value of `id_regex`. So, going off of the example above, if the `id_template` is `eccodarwin-{}`, then the two item `id`s would be `eccodarwin-20151201` and `eccodarwin-20161201` |
| `id_template` | This is a python f-string formatted string that is used to define the `id` of the STAC item. It's used together with the value of `id_regex`. So, going off of the example above, if the `id_template` is `eccodarwin-{}`, then the two item `id`s would be `eccodarwin-20151201` and `eccodarwin-20161201` |
| `datetime_range` | This is used to extract the datetime range from the filename. Valid values are `day`, `month` and `year`. Example: if the filename has `20160104` in it, and `datetime_range` is `day` - the `start_datetime` and `end_datetime` are the start and end of the day. For `month`, they are the start and end of the month and so on. |
| `<asset_name>` | An `id` for the asset |
| `assets.<asset_name>.title` | A title for the asset |
| `assets.<asset_name>.description` | A description for the asset |
| `assets.<asset_name>.regex` | The regex pattern that matches a filename to its respective asset |

#### config archive

#### archive
These are the discovery-items config for collections that we no longer update.

## notebooks
Sometimes, there are exceptional datasets that might require a one-off ingestion that is not supported by the current state of our data pipelines. In such cases, we create notebooks/python scripts that can be used to ingest those data. This is where those notebooks/python scripts live.

Sometimes, there are exceptional datasets that might require a one-off ingestion that is not supported by the current state of our data pipelines. In such cases, we create notebooks/python scripts that can be used to ingest those data. This is where those notebooks/python scripts live.
2 changes: 1 addition & 1 deletion requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ pip-tools
pre-commit
pystac[validation]
pytest
ruff
ruff
2 changes: 1 addition & 1 deletion scripts/generate_mdx.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,4 @@ def safe_open_w(path):
ofile.write(new_content)

collection_id = input_data["collection"]
print(collection_id)
print(collection_id)
2 changes: 1 addition & 1 deletion scripts/promote_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ def trigger_collection_dag(payload: Dict[str, Any], stage: str):
except FileNotFoundError:
print(f"Error: File '{sys.argv[1]}' not found.")
except json.JSONDecodeError:
raise ValueError(f"Invalid JSON content in file {sys.argv[1]}")
raise ValueError(f"Invalid JSON content in file {sys.argv[1]}")
2 changes: 1 addition & 1 deletion scripts/promote_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,4 +131,4 @@ def promote_to_production(payload):
except FileNotFoundError:
print(f"Error: File '{sys.argv[1]}' not found.")
except json.JSONDecodeError:
raise ValueError(f"Invalid JSON content in file {sys.argv[1]}")
raise ValueError(f"Invalid JSON content in file {sys.argv[1]}")
2 changes: 1 addition & 1 deletion tests/test_collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@
@pytest.mark.parametrize("path", COLLECTIONS_PATH.rglob("*.json"))
def test_validate(path: Path) -> None:
collection = Collection.from_file(str(path))
collection.validate()
collection.validate()
Loading