Skip to content

Bug in chunk creation with multiple uris in asset_storage #199

@TomAugspurger

Description

@TomAugspurger

In #198, I'm attempting to use multiple uris to collect assets under two different prefixes into the same collection

collections:
  - id: cil-gdpcir-cc-by
    template: ${{ local.path(./collection/cil-gdpcir-cc-by) }}
    class: pctasks.dataset.collection:PremadeItemCollection
    asset_storage:
      - uri: blob://rhgeuwest/cil-gdpcir-stac
        token: ${{ pc.get_token(rhgeuwest, cil-gdpcir-stac) }}
        chunks:
          options:
            name_starts_with: CC-BY-4.0/

      - uri: blob://rhgeuwest/cil-gdpcir-stac
        token: ${{ pc.get_token(rhgeuwest, cil-gdpcir-stac) }}
        chunks:
          options:
            name_starts_with: CC-BY-SA-4.0/

When I run this, I noticed that both output to the same file. Here are their outputs:

      {
        "uri": "blob://rhgeuwest/cil-gdpcir-etl-data/chunks/cc-by/2023-05-03-cc-by-fix/assets/all/rhgeuwest/cil-gdpcir-stac/0/uris-list.csv",
        "chunk_id": "rhgeuwest/cil-gdpcir-stac/0/uris-list.csv"
      }

and

      {
        "uri": "blob://rhgeuwest/cil-gdpcir-etl-data/chunks/cc-by/2023-05-03-cc-by-fix/assets/all/rhgeuwest/cil-gdpcir-stac/0/uris-list.csv",
        "chunk_id": "rhgeuwest/cil-gdpcir-stac/0/uris-list.csv"
      }

Perhaps we need to coordinate partition numbers among the two? Or add some extra level to the path; something like the index of the mapping under assset_storage?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions