ArXiv TLDR

Polynomial-time completion of phylogenetic tree sets

🐦 Tweet
2604.23984

Aleksandr Koshkarov, Nadia Tahiri

q-bio.PEcs.CC

TLDR

This paper introduces a polynomial-time algorithm for completing phylogenetic tree sets with partial taxon overlap, preserving data lost by pruning.

Key contributions

  • Presents a polynomial-time algorithm for completing phylogenetic tree sets with partial taxon overlap.
  • Identifies maximal completion subtrees and uses a weighted majority-rule consensus for robust completion.
  • Preserves distances among original taxa and ensures order-independent, unique completion.
  • Experimentally outperforms existing methods in preserving topology and branch lengths across datasets.

Why it matters

Phylogenetic analyses often struggle with trees having distinct but overlapping taxa, leading to data loss if pruned. This algorithm offers an efficient, robust solution to complete these tree sets, preserving crucial phylogenetic signal. It ensures more accurate and comprehensive comparative analyses.

Original Abstract

Comparative analyses of phylogenetic trees typically require identical taxon sets, however, in practice, trees often include distinct but overlapping taxa. Pruning non-shared leaves discards phylogenetic signal, whereas tree completion can preserve both taxa and branch-length information. This work introduces a polynomial-time algorithm for set-wide completion of phylogenetic trees with partial taxon overlap. The proposed method identifies and extracts maximal completion subtrees that frequently appear across the source trees and constructs a weighted majority-rule consensus. Branch lengths are scaled using rates derived from common leaves. Each consensus subtree is inserted at the position that minimizes the quadratic distance error measured against information from the source trees, with candidate positions restricted to the original branches of the target tree. We demonstrate that the algorithm runs in polynomial time and preserves distances among the original taxa, yielding a unique completion that is order-independent with respect to the processing order of target trees. An experimental evaluation on amphibians, mammals, sharks, and squamates shows that the proposed method consistently achieves the lowest distance to the subset reference trees across subsets among all methods, in both topology and branch lengths. An open-source Python implementation of the proposed algorithm and the biological datasets utilized in this study are publicly available at: https://github.com/tahiri-lab/overlap-treeset-completion/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.