Sorry if this has been reported before.
I find the following behavior very counter-intuitive. Take the following two sequences:
>HBB
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH
>HBB_alt
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
VKVYKVTYRGAHPPAEHFQWQPRKLAQ
They are identical in the first 120 residues and differ in the last 27 residues:
HBB: MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
mvhltpeeksavtalwgkvnvdevggealgrllvvypwtqrffesfgdlstpdavmgnpkvkahgkkvlgafsdglahldnlkgtfatlselhcdklhvdpenfrllgnvlvcvlahhfg
HBB_alt: MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGVKVYKVTYRGAHPPAEHFQWQPRKLAQ
A natural expectation, I think, is that the local alignment would cover only the first 120 residues, yet it clearly covers the whole sequences, such that when clustering at -c 1 the two sequences end up in the same cluster:
$ mmseqs | grep Version
MMseqs2 Version: 45111b641859ed0ddd875b94d6fd1aef1a675b7e
$ mmseqs easy-cluster seqs.faa clust tmp -c 1 --min-seq-id 0.5
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
1 cluster(s)
but when clustering with relaxed coverage but increased identity, they don't:
$ mmseqs easy-cluster seqs.faa clust tmp -c 0.5 --min-seq-id 0.95
$ awk '{_[$1]=1}END{print length(_) " cluster(s)"}' clust_cluster.tsv
2 cluster(s)
Coverage mode and gap penalties have no influence on the outcome. The same happens with linclust.
Is there any way of controlling the alignment extension? I noticed that this does not happen with --single-step-clustering.
Sorry if this has been reported before.
I find the following behavior very counter-intuitive. Take the following two sequences:
They are identical in the first 120 residues and differ in the last 27 residues:
A natural expectation, I think, is that the local alignment would cover only the first 120 residues, yet it clearly covers the whole sequences, such that when clustering at
-c 1the two sequences end up in the same cluster:but when clustering with relaxed coverage but increased identity, they don't:
Coverage mode and gap penalties have no influence on the outcome. The same happens with linclust.
Is there any way of controlling the alignment extension? I noticed that this does not happen with
--single-step-clustering.