Skip to content

Feature proposal: IdSetSparse for Ids spanning a large value range #406

@danpat

Description

@danpat

We're using libosmium and osmium-tool for many things. One of those things is manipulating OSM-format data that has been manipulated/augmented - new ways, nodes, merged additional data, edits, etc.

In order to avoid conflict with the OSM ID space, we typically give the new or modified entities IDs somewhere in a much higher part of the 64bit ID number space. Very large IDs, like 3847193284910278 are not uncommon - sometimes, IDs are generated spatially by casting integer lon/lat values into masked parts of the 64bit ID. Sometimes we use hashes of various properties to create semi-stable IDs.

When using IdSetDense (which is what osmium-tool uses), this leads to massive memory allocation problems - a very high ID is encountered, and the dense ID set is resized to something very very large (hundreds of GB or more). We often want to manipulate fairly large parts of the dataset, and IdSetSmall doesn't perform well once it gets too big.

I'd like to propose that libosmium gains support for an IdSetSparse - a hash-backed ID set for use when the IDs are broadly spread across the 64bit space. In local testing (and hacking a fork of libosmium), it makes using osmium extract quite usable on .pbf files with sparsely allocated IDs that the mainline tool cannot handle (OOM errors).

@joto I'm happy to write and submit a PR for this if you think it's a good idea. In my local experiments, a couple of things popped up that should be discussed:

  • use of third-party includes - I used ankerl::unordered_dense::set from https://github.com/martinus/unordered_dense - I gather that's not your preference in libosmium, but please say if it's otherwise
  • std::unordered_set is pretty slow - it works, but ankerl blows it away
  • I could implement a standalone open-addressed set just for this, similar in style to what IdSetDense does. That would keep the implementation tightly within libosmium, at the expense of more lines of code

Thoughts? If this is acceptable, I can make the PR, and I would additionally make a follow-up submission to osmium-tool to add support for runtime selection of the ID set store. This would be a handy feature for anyone working with OSM-like data with sparse IDs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions