binseq support #31

noamteyssier · 2025-08-16T19:52:30Z

this p.r. adds support for bq and vbq.

I saw your thread scaling post on bsky and I was super curious to see how binseq files would do.

The current implementation just checks if the input file ends in {bq,vbq} to determine if it's a binseq input. I then just implemented the binseq::ParallelReader trait which handles either single/paired records and follows the same logic as the paraseq impl.

I tried to minimize the amount of code changed, but unfortunately the should_keep_sequence was a little difficult to work around because I needed to borrow immutably and mutably at the same time. My solution was just to make it an associated function of the struct, but I think a better solution exists. Shouldn't change much but requires more arguments at calltime which is not as ergonomic.

I've just tried this with 100M random human sequences from wgsim and I got the following at 16 threads (though roughly 5s of each was spent loading the index). For the single version I just ran the R1 of the fastq through and converted the R1 into either bq or vbq using bqtools. For the paired version I passed in both R1 and R2 and created a paired bq or vbq with bqtools.

Command	Mean [s]	Min [s]	Max [s]	Relative
`fastq-single`	17.767 ± 0.060	17.730	17.836	1.67 ± 0.01
`bq-single`	10.668 ± 0.040	10.627	10.706	1.00
`vbq-single`	11.298 ± 0.064	11.225	11.345	1.06 ± 0.01

Command	Mean [s]	Min [s]	Max [s]	Relative
`fastq-paired`	33.552 ± 0.270	33.382	33.864	1.57 ± 0.02
`bq-paired`	21.313 ± 0.270	21.123	21.622	1.00
`vbq-paired`	22.549 ± 0.069	22.477	22.615	1.06 ± 0.01

I've found with other programs that bq and vbq and scale linearly with the number of threads well into 128+ threads. Would love to see that same benchmark you ran with binseq inputs as well!

Cheers,
Noam

… avoid mutable and immutable usage

…oning arc

bede · 2025-08-17T00:32:34Z

Hi Noam, nice PR! I will look more closely at this when I have time, but for now here are the results on the same machine for which I posted benchmarks on Bluesky, with plain FASTQ added for good measure. Results look great. As with uncompressed FASTA/FASTQ, scaling falls off above ~2Gbp/s with 16 threads, where Deacon may be saturating memory bandwidth (cc @RagnarGrootKoerkamp). Great to hear that you've good results with hundreds of threads for other applications, is this involving Paraseq out of interest?

[
    {"threads": 1, "mbps": 139.1, "format": "vbq (PR #31)"},
    {"threads": 2, "mbps": 275.2, "format": "vbq (PR #31)"},
    {"threads": 4, "mbps": 537.9, "format": "vbq (PR #31)"},
    {"threads": 8, "mbps": 1053.1, "format": "vbq (PR #31)"},
    {"threads": 16, "mbps": 1956.2, "format": "vbq (PR #31)"},
    {"threads": 32, "mbps": 2500.1, "format": "vbq (PR #31)"}
]

I used rsviruses17900.fastq.gz encoded as vbq like so:

gzip -dc data/rsviruses17900/rsviruses17900.fastq.gz | bqtools encode -p r -f a -o data/rsviruses17900/rsviruses17900.vbq

and e.g.

cargo run -r -- filter -t 32 data/panhuman-1.k31w15.idx data/rsviruses17900/rsviruses17900.vbq > /dev/null

I appreciate this test is far from ideal. I measured the average sustained throughput, excluding index loading time. These PBSIM3-simulated long reads occasionally contain ambiguous bases, and it's interesting to see the impact substituting these for random nucleotides during encoding has on classification accuracy. I've only skimmed the Binseq paper – can the vbq format accommodate Ns somehow? We want to skip ambiguous minimizers ideally.

noamteyssier · 2025-08-18T17:49:38Z

ah that's fascinating that the memory bandwidth saturates at those 16 threads. probably doesn't help that bq and vbq are sharing that memory bandwidth decoding back from binary to ascii and then back to binary.

Great to hear that you've good results with hundreds of threads for other applications, is this involving Paraseq out of interest?

paraseq oftentimes can't scale to hundreds of threads. it does scale very well though when the per-sequence task is complex enough that it doesnt completely saturate the reader threads (like in mmr).

But the largest advantage binseq has over compressed fastq is in very fast per-sequence tasks where decompression becomes the largest bottleneck. I'm working on a project now I hope will come out soon where paraseq can scale up to maybe 8 or so threads, but binseq can get up past 128.

can the vbq format accommodate Ns somehow?

not as it is now, but will be coming soon! interesting that it's making a large difference in the classification accuracy, I'd actually be curious to see what would happen if you were to change the ambiguous nucleotide policy (default: random) to some fixed nucleotide to see if that would change the results.

bede · 2025-08-18T18:23:39Z

But the largest advantage binseq has over compressed fastq is in very fast per-sequence tasks where decompression becomes the largest bottleneck

We certainly agree on this 🙂. Deacon is now heavily rate-limited by gzip decompression.

not as it is now, but will be coming soon!

From my perspective, N support in vbq and bqtools would be a really nice feature, allowing Deacon to generate identical results for fastq and vbq even in the presence of ambiguity. I would be happiest to merge this PR once vbq and bqtools have N support.

Another change (to Paraseq) that would be very impactful for Deacon would be parallel paired fastq reading from separate files. This would almost double decompression-limited processing of paired reads. Ragnar has opened an issue and I think is also looking into it.

noamteyssier added 4 commits August 16, 2025 12:08

dep: added binseq dependency

5a60dc7

feat(filter): added binseq support to filter struct

137e3d9

refactor(filter): make should_keep_sequence an associated function to…

275423f

… avoid mutable and immutable usage

refactor(filter): pass in reference to minimizer_hashes instead of cl…

ffcee86

…oning arc

bede mentioned this pull request Aug 18, 2025

Speed up short reads filtering by processing multiple records together #32

Open

noamteyssier mentioned this pull request Aug 18, 2025

uBAM support #33

Open

bede force-pushed the main branch 2 times, most recently from 08ec7de to 97868a0 Compare November 20, 2025 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

binseq support #31

binseq support #31

Uh oh!

noamteyssier commented Aug 16, 2025

Uh oh!

bede commented Aug 17, 2025 •

edited

Loading

Uh oh!

noamteyssier commented Aug 18, 2025

Uh oh!

bede commented Aug 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

binseq support #31

Are you sure you want to change the base?

binseq support #31

Uh oh!

Conversation

noamteyssier commented Aug 16, 2025

Uh oh!

bede commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noamteyssier commented Aug 18, 2025

Uh oh!

bede commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bede commented Aug 17, 2025 •

edited

Loading

bede commented Aug 18, 2025 •

edited

Loading