Streaming input? #285

unphased · 2025-09-25T13:49:04Z

unphased
Sep 25, 2025

dwarfs is really impressive in so many ways. I was curious if its design is amenable to streaming input. For example I want to create a system that optimizes the compression of high bandwidth log streams. From what I've read dwarfs has already solved very well the general compression of the type of data I want to compress. However if the input is coming in as a stream I just know that redundancy will be left on the table when only segments are being compressed at a time.

One of the big issues with compression is that you can take compressible content and get say a 10% compression ratio out of it, but attempting to compress already-compressed content together won't yield much gains. That is what's being demonstrated with the redundant video file examples. using a naive compression algorithm on huge content that has redundancy that is not "reachable" by that algorithm means it will fail to compress it out.

The 50GB of perl files is quite a good test case, anyway.

If we compress the whole thing in one go, spectacular sub-1% compression can be achieved. But suppose I want to try to compress this dataset by streaming the dataset and run it on a much weaker machine with a much smaller working memory size. Suppose we have to split it into 5GB chunks each. Something tells me that the compression ratio won't be nearly as good.

Or maybe it actually still would be. Chunking will guarantee the ratio to become worse, so the issue is just how much worse. I'll come back with performance numbers when I test!

mhx · 2025-09-30T07:48:05Z

mhx
Sep 30, 2025
Maintainer

dwarfs is really impressive in so many ways.

Thanks! :)

I was curious if its design is amenable to streaming input.

In principle, yes. And there's already #181 with some related discussion.

"In principle", because the segmenting stage is basically using a stream of input data to produce the output file system image. I'm currently working on a new abstraction layer that allows the ubiquitous uses of mmap to be replaced with other means of accessing input data (e.g. pread, or just smaller mmap chunks). There are quite a few reasons for doing this (avoiding SIGBUS with faulty hardware, virtual address space limits on 32-bit platforms, ...) and it is certainly going to help if we want to add streaming input capability.

For example I want to create a system that optimizes the compression of high bandwidth log streams. From what I've read dwarfs has already solved very well the general compression of the type of data I want to compress. However if the input is coming in as a stream I just know that redundancy will be left on the table when only segments are being compressed at a time.

Definitely. There are quite a few things you just cannot do with streaming input. E.g., whole file deduplication won't work because the files can be too large to be kept in memory (it could still be done for files up to a certain size). Also, any kind of ordering that is usually applied by mkdwarfs is disabled by the implicit streaming order (well, you could theoretically split the single input stream into multiple streams and route each input entity to the stream with the closest similarity — but that would be much more involved and I'm not sure if the actual gains would be high enough).

In other words: while it would be nice to have streaming input capability, I wouldn't expect it to produce impressively small output artifacts.

Finally, the streaming input would need to have some structure, e.g. TAR as suggested by #181.

When you're talking about "high bandwidth log streams", I assume you're talking about just that, a single stream of log messages? If that is the case, there's no "magic" in DwarFS that would make this compress any better that if you'd run it through, say, zstd or xz.

One of the big issues with compression is that you can take compressible content and get say a 10% compression ratio out of it, but attempting to compress already-compressed content together won't yield much gains. That is what's being demonstrated with the redundant video file examples. using a naive compression algorithm on huge content that has redundancy that is not "reachable" by that algorithm means it will fail to compress it out.

The 50GB of perl files is quite a good test case, anyway.

If we compress the whole thing in one go, spectacular sub-1% compression can be achieved. But suppose I want to try to compress this dataset by streaming the dataset and run it on a much weaker machine with a much smaller working memory size. Suppose we have to split it into 5GB chunks each. Something tells me that the compression ratio won't be nearly as good.

Correct. In case of the 50 GB of perl files, almost 60% of the data is in files that are exact duplicates. A large fraction of the remaining 40% are near duplicates, and these generally end up next to each other thanks to similarity ordering. Segmentation de-duplicates large chunks (4 KiB or more by default) and saves more than 3/4 of the near duplicate files. Ultimately, this leaves only about 4 GB to the "real" compression algorithm.

Segmenting can still be done with streaming input, but will be less effective without ordering. It is still surprisingly effective, though. I've ran a quick test where I've disabled both similarity ordering and whole-file de-duplication, and the segmenter is still able to discard about 2/3 of the input data. Still, the resulting image is almost an order of magnitude larger:

$ mkdwarfs -i ~/perl-install -o /dev/null --force -l9 --order=path --file-hash=none
[...]
330,733 dirs, 0/2,440 soft/hard links, 1,927,501/1,927,501 files, 0 other
original size: 47.49 GiB, hashed: 0 B (0 files, 0 B/s)
scanned: 0 B (0 files, 0 B/s), categorizing: 0 B/s
saved by deduplication: 0 B (0 files), saved by segmenting: 31.62 GiB
filesystem: 15.87 GiB in 254 blocks (2,744,459 chunks, 1,899,542/1,899,542 fragments, 1,925,061 inodes)
compressed filesystem: 254 blocks/2.055 GiB written, using 1.071 GiB of RAM

4 replies

unphased Sep 30, 2025
Author

Thanks so much for breaking it down in detail. It seems like a lot of the wins are from practical "simple" techniques (not to discount of course the engineering that's gone into them), and it also seems like those techniques could each be extended with more involved techniques. Streaming input mode is clearly a wrench-throw onto similarity ordering, but maybe a clever approach could allow similarity ordering to be workable without having to retain an actual copy of the entire history of the stream (hence making it no longer a true stream). Another example is how you de-duplicate chunks at 4+KB and it culls out "over 3/4" of duplicate information on a bunch of files; perhaps some even more granular scanning could trade a bit more CPU (or leverage more cores during any parallellism downtime at other stages of processing) for discovering even more deduplication.

For streaming compression my thinking is that similarity ordering metadata can be retained during the ongoing compression and DwarFS's inherent capability of efficiently reaching into the compressed image to fetch some file you have a handle on could be leveraged DURING the stream ingestion process? How close would you say the architecture is to supporting that kind of shenanigans? 😁

unphased Sep 30, 2025
Author

I imagine trying to decompress and unearth similarity hits from earlier discarded input stream areas of content (just to try to handle a newly seen file that has similarities) is a really bad idea as it will tank compression performance. On the other hand though perhaps the retained metadata about the old (already compressed, seen before) data could simply store a hash and we could just emit for the newly found duplicated data a small record saying that it's actually the same as that one from before!

perhaps the same idea could be put towards those deduplication chunks too.

unphased Sep 30, 2025
Author

When it comes to "huge streaming logs" yeah i would say i could sometimes send in one giant streaming single file, or it could be multiple streaming files (e.g. log streams from different services) and furthermore these streams could be periodically getting rotated. General enough to cover any type of streaming output really. Typically I'll see comp ratios up to like 0.05 for zstd and 0.02 for brotli, so compression is already a huge win but I am confident for typical application logging use, adding simple heuristics like splitting on newlines (or even at letter/digit boundaries) can reveal huge quantities of redundancy and I'd expect 0.001 ratios to be realistically achievable (though much more sensitive to content type).

mhx Sep 30, 2025
Maintainer

perhaps some even more granular scanning could trade a bit more CPU

The 4 KiB window is configurable. However, it's a trade-off. Smaller windows will obviously find more matches and potentially shrink the overall output, but they will increase the metadata size since the files become more fragmented. Also, potentially because smaller windows remove redundancy from the input and may cause the block data to be less compressible by the final compression stage. So this isn't guaranteed to be a win.

For streaming compression my thinking is that similarity ordering metadata can be retained during the ongoing compression and DwarFS's inherent capability of efficiently reaching into the compressed image to fetch some file you have a handle on could be leveraged DURING the stream ingestion process? How close would you say the architecture is to supporting that kind of shenanigans?

The problem isn't reaching into the already compressed parts of the image (which isn't possible at the moment; unlike streaming input, mkdwarfs actually supports streaming output, and the bits that are streamed out we no longer have access to). You want to use similarity to determine which uncompressed block/stream to add a new piece of input data to. In any case, doing anything like this will be a major undertaking.

On the other hand though perhaps the retained metadata about the old (already compressed, seen before) data could simply store a hash and we could just emit for the newly found duplicated data a small record saying that it's actually the same as that one from before!

This sounds similar to the ideas in #138. Yes, something like this could be desirable, but mostly in cases where the output blocks are not actually compressed (as in the video example). The problem with large lookback is growing virtual block size.

Streaming input? #285

Uh oh!

Uh oh!

unphased Sep 25, 2025

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

mhx Sep 30, 2025 Maintainer

Uh oh!

Uh oh!

unphased Sep 30, 2025 Author

Uh oh!

Uh oh!

unphased Sep 30, 2025 Author

Uh oh!

Uh oh!

unphased Sep 30, 2025 Author

Uh oh!

mhx Sep 30, 2025 Maintainer

unphased
Sep 25, 2025

Replies: 1 comment 4 replies

mhx
Sep 30, 2025
Maintainer

unphased Sep 30, 2025
Author

unphased Sep 30, 2025
Author

unphased Sep 30, 2025
Author

mhx Sep 30, 2025
Maintainer