Speed up short reads filtering by processing multiple records together #32

imartayan · 2025-08-18T16:18:40Z

This PR aims to speed up short reads filtering by packing multiple records until we reach a length threshold (currently 8000 bp) and processing this batch with a single call to simd_minimizers. It introduces a RecordBuffer type that packs multiples sequences and headers in a single Vec<u8>, and uses this type internally in FilterProcessor to batch the operations. Currently, it is only implemented for the ParallelProcessor trait and isn't used for paired reads yet.

This might use a bit more copy than before since we have to keep some records longer (thus slightly slowing down long reads processing) but should bring a significant speedup for short reads.

Let me know if you're happy with the new performances and if you get consistent results with the previous implementation. If so, I can adapt the code to support paired reads as well.

Best,
Igor

bede · 2025-08-18T17:26:35Z

Many thanks @imartayan!
Results look great at a glance. For the uncompressed fastq R1 reads (forward reads only) of the 2x150bp simulated reads for rsviruses17900, this PR increases throughput from 300Mbp/s to 542Mbp/s on my local M1 machine. For fastq.gz, throughput remains capped at ~190Mbp/s (though as discussed we hopefully can 2x in the future with parallel readers). This approach would also deliver improvements in conjunction with faster compression approaches, and binary formats like uBAM (#33) and vbq (#31) that may be supported in future.

bede · 2025-09-10T12:05:20Z

Hi Igor,
I'm sorry for not responding here yet; first impressions are really good. I'll get back to you after a closer look.

Thanks,
Bede

imartayan added 2 commits August 18, 2025 18:00

Batch processing of short records by packing them together

fc42693

Fix MinimalRecord id range

ab318c0

bede force-pushed the main branch 2 times, most recently from 08ec7de to 97868a0 Compare November 20, 2025 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up short reads filtering by processing multiple records together #32

Speed up short reads filtering by processing multiple records together #32

Uh oh!

imartayan commented Aug 18, 2025

Uh oh!

bede commented Aug 18, 2025 •

edited

Loading

Uh oh!

bede commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Speed up short reads filtering by processing multiple records together #32

Are you sure you want to change the base?

Speed up short reads filtering by processing multiple records together #32

Uh oh!

Conversation

imartayan commented Aug 18, 2025

Uh oh!

bede commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bede commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bede commented Aug 18, 2025 •

edited

Loading