Phylo cookbook

From Biopython
Revision as of 15:13, 26 June 2010 by EricTalevich (Talk | contribs)
Jump to: navigation, search

Here are some examples of using Bio.Phylo for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54.


Convenience functions

Index clades by name

For large trees it can be useful to be able to select a clade by name, or some other unique identifier, rather than searching the whole tree for it during each operation.

def lookup_by_names(tree):
    names = {}
    for clade in tree.find_clades():
            if in names:
                raise ValueError("Duplicate key: %s" %
            names[] = clade
    return names

Now you can retrieve a clade by name in constant time:

tree ='ncbi_taxonomy.xml', 'phyloxml')
names = lookup_by_names(tree)
for phylum in ('Apicomplexa', 'Euglenozoa', 'Fungi'):
    print "Phylum size", len(names[phylum].get_terminals())

A potential issue: The above implementation of lookup_by_names doesn't include unnamed clades, generally internal nodes. We can fix this by adding a unique identifier for each clade. Here, all clade names are prefixed with a unique number (which can be useful for searching, too):

def tabulate_names(tree):
    names = {}
    for idx, clade in enumerate(tree.find_clades()):
   = '%d_%s' % (idx,
   = str(idx)
        names[] = clade
    return clade

Calculate distances between neighboring terminals

Suggested by Joel Berendzen

import itertools
def terminal_neighbor_dists(self):
    """Return a list of distances between adjacent terminals."""
    def generate_pairs(self):
        pairs = itertools.tee(self)
        return itertools.izip(pairs[0], pairs[1])
    return [self.distance(*i) for i in

Test for "semi-preterminal" clades

Suggested by Joel Berendzen

The existing tree method is_preterminal returns True if all of the direct descendants are terminal. This snippet will instead return True if any direct descendent is terminal, but still False if the given clade itself is terminal.

def is_semipreterminal(clade):
    """True if any direct descendent is terminal."""
    for child in clade:
        if child.is_terminal():
            return True
    return False

In Python 2.5 and later, this is simplified with the built-in any function:

def is_semipreterminal(clade):
    return any(child.is_terminal() for child in clade)

Comparing trees


  • Symmetric difference / partition metric, a.k.a. topological distance
  • Quartets distance
  • Nearest-neighbor interchange
  • Path-length-difference

Consensus methods


Rooting methods


  • Root at the midpoint between the two most distant nodes (or "center" of all tips)
  • Root with the given outgroup (terminal or nonterminal)



  • Party tricks with draw_graphviz, covering each keyword argument

Exporting to other types

Convert to a PyCogent tree

The tree objects used by Biopython and PyCogent are different. Nonetheless, both toolkits support the Newick file format, so interoperability is straightforward at that level:

from Bio import Phylo
import cogent
Phylo.write(bptree, 'mytree.nwk', 'newick')  # Biopython tree
ctree = cogent.LoadTree('mytree.nwk')        # PyCogent tree


  • Convert objects directly, preserving some PhyloXML annotations if possible

Convert to a NumPy array or matrix


  • Adjacency matrix: cells are True if parent-child relationship exists, otherwise False
  • Distance matrix: cells are branch lengths if a branch exists, otherwise Inf or NaN
  • Relationship matrix? See Martins and Housworth 2002
Personal tools