Skip to content

Buffer overflow in k-mer prefilter with highly conserved sequences #1092

@mzueva

Description

@mzueva

mmseqs easy-search fails at prefilter step with segfault when running an antibody query against a multi-million antibody database (e.g. 10M clonotype sequences). Any dataset where conserved k-mers match a large fraction of targets will trigger it.

Problem is caused by underestimation of output size when checking for buffer overflow in CacheFriendlyOperations::findDuplicates. The check uses std::min(elementCount, currBinSize/2), assuming at most half of bin entries are duplicates. This assumption breaks when query k-mers are shared across a large fraction of the target database — as it happens with antibody variable region sequences, where conserved framework k-mers match ~70% of targets on consistent diagonals.

Here is a suggested fix for the issue: #1091

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions