Skip to content

Alignments extending into non-homologous regions in cluster and linclust #1104

@alephreish

Description

@alephreish

Sorry if this has been reported before.

I find the following behavior very counter-intuitive. Take the following two sequences:

>HBB
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
>HBB_alt
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
VKVYKVTYRGAHPPAEHFQWQPRKLAQ

They are identical in the first 120 residues and differ in the last 27 residues:

HBB:     MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
         mvhltpeeksavtalwgkvnvdevggealgrllvvypwtqrffesfgdlstpdavmgnpkvkahgkkvlgafsdglahldnlkgtfatlselhcdklhvdpenfrllgnvlvcvlahhfg
HBB_alt: MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGVKVYKVTYRGAHPPAEHFQWQPRKLAQ

A natural expectation, I think, is that the local alignment would cover only the first 120 residues, yet it clearly covers the whole sequences, such that when clustering at -c 1 the two sequences end up in the same cluster:

$ mmseqs  | grep Version
MMseqs2 Version: 45111b641859ed0ddd875b94d6fd1aef1a675b7e
$ mmseqs easy-cluster seqs.faa clust tmp -c 1 --min-seq-id 0.5
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
1 cluster(s)

but when clustering with relaxed coverage but increased identity, they don't:

$ mmseqs easy-cluster seqs.faa clust tmp -c 0.5 --min-seq-id 0.95
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
2 cluster(s)

Coverage mode and gap penalties have no influence on the outcome. The same happens with linclust.

Is there any way of controlling the alignment extension? I noticed that this does not happen with --single-step-clustering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions