We're using libosmium and osmium-tool for many things. One of those things is manipulating OSM-format data that has been manipulated/augmented - new ways, nodes, merged additional data, edits, etc.
In order to avoid conflict with the OSM ID space, we typically give the new or modified entities IDs somewhere in a much higher part of the 64bit ID number space. Very large IDs, like 3847193284910278 are not uncommon - sometimes, IDs are generated spatially by casting integer lon/lat values into masked parts of the 64bit ID. Sometimes we use hashes of various properties to create semi-stable IDs.
When using IdSetDense (which is what osmium-tool uses), this leads to massive memory allocation problems - a very high ID is encountered, and the dense ID set is resized to something very very large (hundreds of GB or more). We often want to manipulate fairly large parts of the dataset, and IdSetSmall doesn't perform well once it gets too big.
I'd like to propose that libosmium gains support for an IdSetSparse - a hash-backed ID set for use when the IDs are broadly spread across the 64bit space. In local testing (and hacking a fork of libosmium), it makes using osmium extract quite usable on .pbf files with sparsely allocated IDs that the mainline tool cannot handle (OOM errors).
@joto I'm happy to write and submit a PR for this if you think it's a good idea. In my local experiments, a couple of things popped up that should be discussed:
- use of third-party includes - I used
ankerl::unordered_dense::set from https://github.com/martinus/unordered_dense - I gather that's not your preference in libosmium, but please say if it's otherwise
std::unordered_set is pretty slow - it works, but ankerl blows it away
- I could implement a standalone open-addressed set just for this, similar in style to what
IdSetDense does. That would keep the implementation tightly within libosmium, at the expense of more lines of code
Thoughts? If this is acceptable, I can make the PR, and I would additionally make a follow-up submission to osmium-tool to add support for runtime selection of the ID set store. This would be a handy feature for anyone working with OSM-like data with sparse IDs.
We're using
libosmiumandosmium-toolfor many things. One of those things is manipulating OSM-format data that has been manipulated/augmented - new ways, nodes, merged additional data, edits, etc.In order to avoid conflict with the OSM ID space, we typically give the new or modified entities IDs somewhere in a much higher part of the 64bit ID number space. Very large IDs, like
3847193284910278are not uncommon - sometimes, IDs are generated spatially by casting integer lon/lat values into masked parts of the 64bit ID. Sometimes we use hashes of various properties to create semi-stable IDs.When using
IdSetDense(which is whatosmium-tooluses), this leads to massive memory allocation problems - a very high ID is encountered, and the dense ID set is resized to something very very large (hundreds of GB or more). We often want to manipulate fairly large parts of the dataset, andIdSetSmalldoesn't perform well once it gets too big.I'd like to propose that
libosmiumgains support for anIdSetSparse- a hash-backed ID set for use when the IDs are broadly spread across the 64bit space. In local testing (and hacking a fork of libosmium), it makes usingosmium extractquite usable on.pbffiles with sparsely allocated IDs that the mainline tool cannot handle (OOM errors).@joto I'm happy to write and submit a PR for this if you think it's a good idea. In my local experiments, a couple of things popped up that should be discussed:
ankerl::unordered_dense::setfrom https://github.com/martinus/unordered_dense - I gather that's not your preference inlibosmium, but please say if it's otherwisestd::unordered_setis pretty slow - it works, butankerlblows it awayIdSetDensedoes. That would keep the implementation tightly within libosmium, at the expense of more lines of codeThoughts? If this is acceptable, I can make the PR, and I would additionally make a follow-up submission to
osmium-toolto add support for runtime selection of the ID set store. This would be a handy feature for anyone working with OSM-like data with sparse IDs.