Feature proposal: IdSetSparse for Ids spanning a large value range

We're using `libosmium` and `osmium-tool` for many things.  One of those things is manipulating OSM-format data that has been manipulated/augmented - new ways, nodes, merged additional data, edits, etc.

In order to avoid conflict with the OSM ID space, we typically give the new or modified entities IDs somewhere in a much higher part of the 64bit ID number space.  Very large IDs, like `3847193284910278` are not uncommon - sometimes, IDs are generated spatially by casting integer lon/lat values into masked parts of the 64bit ID.  Sometimes we use hashes of various properties to create semi-stable IDs.

When using `IdSetDense` (which is what `osmium-tool` uses), this leads to massive memory allocation problems - a very high ID is encountered, and the dense ID set is resized to something very very large (hundreds of GB or more).  We often want to manipulate fairly large parts of the dataset, and `IdSetSmall` doesn't perform well once it gets too big.

I'd like to propose that `libosmium` gains support for an `IdSetSparse` - a hash-backed ID set for use when the IDs are broadly spread across the 64bit space.  In local testing (and hacking a fork of libosmium), it makes using `osmium extract` quite usable on `.pbf` files with sparsely allocated IDs that the mainline tool cannot handle (OOM errors).

@joto I'm happy to write and submit a PR for this if you think it's a good idea.  In my local experiments, a couple of things popped up that should be discussed:

  - use of third-party includes - I used `ankerl::unordered_dense::set` from https://github.com/martinus/unordered_dense - I gather that's not your preference in `libosmium`, but please say if it's otherwise
  - `std::unordered_set` is pretty slow - it works, but `ankerl` blows it away
  - I could implement a standalone open-addressed set just for this, similar in style to what `IdSetDense` does.  That would keep the implementation tightly within libosmium, at the expense of more lines of code

Thoughts?  If this is acceptable, I can make the PR, and I would additionally make a follow-up submission to `osmium-tool` to add support for runtime selection of the ID set store.  This would be a handy feature for anyone working with OSM-like data with sparse IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature proposal: IdSetSparse for Ids spanning a large value range #406

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature proposal: IdSetSparse for Ids spanning a large value range #406

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions