Replies: 1 comment 4 replies
-
Thanks! :)
In principle, yes. And there's already #181 with some related discussion. "In principle", because the segmenting stage is basically using a stream of input data to produce the output file system image. I'm currently working on a new abstraction layer that allows the ubiquitous uses of
Definitely. There are quite a few things you just cannot do with streaming input. E.g., whole file deduplication won't work because the files can be too large to be kept in memory (it could still be done for files up to a certain size). Also, any kind of ordering that is usually applied by In other words: while it would be nice to have streaming input capability, I wouldn't expect it to produce impressively small output artifacts. Finally, the streaming input would need to have some structure, e.g. TAR as suggested by #181. When you're talking about "high bandwidth log streams", I assume you're talking about just that, a single stream of log messages? If that is the case, there's no "magic" in DwarFS that would make this compress any better that if you'd run it through, say,
Correct. In case of the 50 GB of perl files, almost 60% of the data is in files that are exact duplicates. A large fraction of the remaining 40% are near duplicates, and these generally end up next to each other thanks to similarity ordering. Segmentation de-duplicates large chunks (4 KiB or more by default) and saves more than 3/4 of the near duplicate files. Ultimately, this leaves only about 4 GB to the "real" compression algorithm. Segmenting can still be done with streaming input, but will be less effective without ordering. It is still surprisingly effective, though. I've ran a quick test where I've disabled both similarity ordering and whole-file de-duplication, and the segmenter is still able to discard about 2/3 of the input data. Still, the resulting image is almost an order of magnitude larger: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
dwarfs is really impressive in so many ways. I was curious if its design is amenable to streaming input. For example I want to create a system that optimizes the compression of high bandwidth log streams. From what I've read dwarfs has already solved very well the general compression of the type of data I want to compress. However if the input is coming in as a stream I just know that redundancy will be left on the table when only segments are being compressed at a time.
One of the big issues with compression is that you can take compressible content and get say a 10% compression ratio out of it, but attempting to compress already-compressed content together won't yield much gains. That is what's being demonstrated with the redundant video file examples. using a naive compression algorithm on huge content that has redundancy that is not "reachable" by that algorithm means it will fail to compress it out.
The 50GB of perl files is quite a good test case, anyway.
If we compress the whole thing in one go, spectacular sub-1% compression can be achieved. But suppose I want to try to compress this dataset by streaming the dataset and run it on a much weaker machine with a much smaller working memory size. Suppose we have to split it into 5GB chunks each. Something tells me that the compression ratio won't be nearly as good.
Or maybe it actually still would be. Chunking will guarantee the ratio to become worse, so the issue is just how much worse. I'll come back with performance numbers when I test!
Beta Was this translation helpful? Give feedback.
All reactions