-
Notifications
You must be signed in to change notification settings - Fork 12
binseq support #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
binseq support #31
Conversation
… avoid mutable and immutable usage
|
Hi Noam, nice PR! I will look more closely at this when I have time, but for now here are the results on the same machine for which I posted benchmarks on Bluesky, with plain FASTQ added for good measure. Results look great. As with uncompressed FASTA/FASTQ, scaling falls off above ~2Gbp/s with 16 threads, where Deacon may be saturating memory bandwidth (cc @RagnarGrootKoerkamp). Great to hear that you've good results with hundreds of threads for other applications, is this involving Paraseq out of interest?
I used rsviruses17900.fastq.gz encoded as vbq like so: and e.g. I appreciate this test is far from ideal. I measured the average sustained throughput, excluding index loading time. These PBSIM3-simulated long reads occasionally contain ambiguous bases, and it's interesting to see the impact substituting these for random nucleotides during encoding has on classification accuracy. I've only skimmed the Binseq paper – can the vbq format accommodate Ns somehow? We want to skip ambiguous minimizers ideally. |
|
ah that's fascinating that the memory bandwidth saturates at those 16 threads. probably doesn't help that bq and vbq are sharing that memory bandwidth decoding back from binary to ascii and then back to binary.
paraseq oftentimes can't scale to hundreds of threads. it does scale very well though when the per-sequence task is complex enough that it doesnt completely saturate the reader threads (like in mmr). But the largest advantage binseq has over compressed fastq is in very fast per-sequence tasks where decompression becomes the largest bottleneck. I'm working on a project now I hope will come out soon where paraseq can scale up to maybe 8 or so threads, but binseq can get up past 128.
not as it is now, but will be coming soon! interesting that it's making a large difference in the classification accuracy, I'd actually be curious to see what would happen if you were to change the ambiguous nucleotide policy (default: random) to some fixed nucleotide to see if that would change the results. |
We certainly agree on this 🙂. Deacon is now heavily rate-limited by gzip decompression.
From my perspective, N support in vbq and bqtools would be a really nice feature, allowing Deacon to generate identical results for fastq and vbq even in the presence of ambiguity. I would be happiest to merge this PR once vbq and bqtools have N support. Another change (to Paraseq) that would be very impactful for Deacon would be parallel paired fastq reading from separate files. This would almost double decompression-limited processing of paired reads. Ragnar has opened an issue and I think is also looking into it. |
08ec7de to
97868a0
Compare

hey @bede
this p.r. adds support for bq and vbq.
I saw your thread scaling post on bsky and I was super curious to see how binseq files would do.
The current implementation just checks if the input file ends in
{bq,vbq}to determine if it's a binseq input. I then just implemented thebinseq::ParallelReadertrait which handles either single/paired records and follows the same logic as theparaseqimpl.I tried to minimize the amount of code changed, but unfortunately the
should_keep_sequencewas a little difficult to work around because I needed to borrow immutably and mutably at the same time. My solution was just to make it an associated function of the struct, but I think a better solution exists. Shouldn't change much but requires more arguments at calltime which is not as ergonomic.I've just tried this with 100M random human sequences from
wgsimand I got the following at 16 threads (though roughly 5s of each was spent loading the index). For thesingleversion I just ran theR1of the fastq through and converted theR1into either bq or vbq usingbqtools. For the paired version I passed in bothR1andR2and created a paired bq or vbq withbqtools.fastq-singlebq-singlevbq-singlefastq-pairedbq-pairedvbq-pairedI've found with other programs that bq and vbq and scale linearly with the number of threads well into 128+ threads. Would love to see that same benchmark you ran with binseq inputs as well!
Cheers,
Noam