Phylo

From Biopython
(Difference between revisions)
Jump to: navigation, search
(Exporting to other object representations: to_adjacency_matrix was removed from utils)
(Example pipeline)
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
This module provides classes, functions and I/O support for working with phylogenetic trees.
 
This module provides classes, functions and I/O support for working with phylogenetic trees.
  
This code is included in Biopython 1.54, and also described in the Phylogenetics chapter of the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Biopython Tutorial] and the Bio.Phylo [http://biopython.org/DIST/docs/api/Bio.Phylo-module.html API pages] generated from the source code.
+
For more complete documentation, see the Phylogenetics chapter of the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Biopython Tutorial] and the Bio.Phylo [http://biopython.org/DIST/docs/api/Bio.Phylo-module.html API pages] generated from the source code. The [[Phylo cookbook]] page has more examples of how to use this module, and the [[PhyloXML]] page describes how to attach graphical cues and additional information to a tree.
  
 
==Availability==
 
==Availability==
  
The source code for this module is included in Biopython 1.54. If you're interested in testing newer additions to this code before the next official release, see [[SourceCode]] for instructions on getting a copy of the development branch.
+
This module is included in Biopython 1.54 and later. If you're interested in testing newer additions to this code before the next official release, see [[SourceCode]] for instructions on getting a copy of the development branch.
 
+
Requirements:
+
* Python 2.4 or newer
+
* ElementTree module
+
  
 
To draw trees (optional), you'll also need these packages:
 
To draw trees (optional), you'll also need these packages:
* [http://networkx.lanl.gov/index.html NetworkX] 1.0rc1 (or 0.36 for snapshot at the end of GSoC 2009)
 
* [http://networkx.lanl.gov/pygraphviz/ PyGraphviz] 0.99.1 or [http://dkbza.org/pydot.html pydot]
 
 
* [http://matplotlib.sourceforge.net/ matplotlib]
 
* [http://matplotlib.sourceforge.net/ matplotlib]
 +
* [http://networkx.lanl.gov/index.html NetworkX] -- for the functions 'draw_graphviz' and 'to_networkx'
 +
* [http://networkx.lanl.gov/pygraphviz/ PyGraphviz] or [http://dkbza.org/pydot.html pydot] -- for the functions 'draw_graphviz' and 'to_networkx'
  
The I/O and tree-manipulation functionality will work without them; they're imported on demand when the functions to_networkx() and draw_graphviz() are called.
+
The I/O and tree-manipulation functionality will work without them; they're imported on demand when the functions draw(), draw_graphviz() and to_networkx() are called.
  
The XML parser used in the PhyloXMLIO sub-module is ElementTree, added to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:
+
To use this module in older versions of Biopython with Python 2.4, you'll also need to install a separate package that provides the ElementTree interface for XML parsing. Either of these will work:
 
* [http://codespeak.net/lxml/ lxml]
 
* [http://codespeak.net/lxml/ lxml]
 
* [http://effbot.org/zone/element-index.htm elementtree] (or cElementTree)
 
* [http://effbot.org/zone/element-index.htm elementtree] (or cElementTree)
  
PhyloXMLIO attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information.
+
The Phylo module has also been successfully tested on Jython 2.5.1, minus the Graphviz- and NetworkX-based functions. However, parsing phyloXML files is noticeably slower because Jython uses a different version of the underlying XML parsing library.
 
+
This module has also been successfully tested on Jython 2.5.1, minus the Graphviz- and NetworkX-based functions. However, parsing phyloXML files is noticeably slower because Jython uses a different version of the underlying XML parsing library.
+
  
 
==I/O functions==
 
==I/O functions==
Line 35: Line 29:
  
 
The second argument to each function is the target format. Currently, the following formats are supported:
 
The second argument to each function is the target format. Currently, the following formats are supported:
* phyloxml
 
 
* newick
 
* newick
 
* nexus
 
* nexus
 +
* nexml
 +
* phyloxml
 +
* cdao
  
 
See the [[PhyloXML]] page for more examples of using tree objects.
 
See the [[PhyloXML]] page for more examples of using tree objects.
Line 76: Line 72:
  
 
<python>
 
<python>
tree = Phylo.read('example.xml', 'phyloxml')
+
tree = Phylo.read('example.dnd', 'newick')
 
print tree
 
print tree
 
</python>
 
</python>
 +
 +
If you have your tree data already loaded as a Python string, you can parse it with the help of StringIO (in Python's standard library):
 +
 +
<python>
 +
from cStringIO import StringIO
 +
 +
treedata = "(A, (B, C), (D, E))"
 +
handle = StringIO(treedata)
 +
tree = Phylo.read(handle, "newick")
 +
</python>
 +
 +
In one line:
 +
 +
<python>
 +
tree = Phylo.read(StringIO("(A, (B, C), (D, E))"), "newick")
 +
</python>
 +
 +
The other I/O functions also can be used with StringIO.
 +
 +
(General tip: if you write to the StringIO object and want to re-read the contents, you'll need to call the <code>seek(0)</code> method to move the handle back to the start of the StringIO data -- the same as an open file handle. See examples of this in the unit tests for Phylo, <code>Tests/test_Phylo.py</code>, in the Biopython source code.)
  
 
===write()===
 
===write()===
Line 104: Line 120:
 
'''PhyloXMLIO:'''
 
'''PhyloXMLIO:'''
 
Support for the [http://www.phyloxml.org/ phyloXML] format. See the [[PhyloXML]] page for details.
 
Support for the [http://www.phyloxml.org/ phyloXML] format. See the [[PhyloXML]] page for details.
 +
 +
'''NeXMLIO:'''
 +
Support for the [http://www.nexml.org/ NeXML] format.
  
 
'''NewickIO:'''
 
'''NewickIO:'''
Line 111: Line 130:
 
Wrappers around Bio.Nexus to support the Nexus tree format.
 
Wrappers around Bio.Nexus to support the Nexus tree format.
  
The Nexus format actually contains several sub-formats for different kinds of data; to represent trees, Nexus provides a block containing some metadata and one or more Newick trees. (Another kind of Nexus block can represent alignments; this is handled in [[AlignIO]].) So to parse a complete Nexus file with all block types handled, use Bio.Nexus directly, and to extract just the trees, use Bio.Phylo. Integration between Bio.Nexus and Bio.Phylo will be improved in the future.
+
'''CDAOIO:'''
 +
Support for the Comparative Data Analysis Ontology (CDAO). Requires [https://github.com/RDFLib/rdflib RDFlib].
 +
 
 +
The Nexus format actually contains several sub-formats for different kinds of data; to represent trees, Nexus provides a block containing some metadata and one or more Newick trees. (Another kind of Nexus block can represent alignments; this is handled in [[AlignIO]].) So to parse a complete Nexus file with all block types handled, use Bio.Nexus directly, and to extract just the trees, use Bio.Phylo.
  
 
==Tree and Subtree classes==
 
==Tree and Subtree classes==
Line 134: Line 156:
  
 
Where a third-party package is required, that package is imported when the function itself is called, so these dependencies are not necessary to install or use the rest of the Tree module.
 
Where a third-party package is required, that package is imported when the function itself is called, so these dependencies are not necessary to install or use the rest of the Tree module.
 
===Exporting to other object representations===
 
 
Although any phylogenetic tree can reasonably be represented by a directed acyclic graph, the Phylo module does not attempt to provide a generally usable graph library -- only the minimum functionality to represent phylogenetic trees. Instead, it provides functions for exporting tree objects to the standard graph representations, adjacency list (dict) and adjacency matrix, using third-party libraries.
 
 
'''to_networkx''' returns the given tree as a [http://networkx.lanl.gov/ NetworkX] LabeledDiGraph or LabeledGraph object (depending on whether the tree is rooted). You'll probably need to import networkx directly for subsequent operations on the graph object. From this point you can also try using one of networkx's drawing functions to display the tree, and for simple, fully labeled trees it may even work -- but you'll have better results with Phylo's own draw_graphviz function, discussed below.
 
 
<python>
 
import networkx, pylab
 
tree = Phylo.read('example.xml', 'phyloxml')
 
net = Phylo.to_networkx(tree)
 
networkx.draw(net)
 
pylab.show()
 
</python>
 
  
 
===Displaying trees===
 
===Displaying trees===
  
'''pretty_print''' produces a plain-text representation of the entire tree. Uses str() to display nodes by default; for the longer repr() representation, add the argument show_all=True.
+
'''str('''tree''')''' produces a plain-text representation of the entire tree. Strings are automatically truncated to ensure reasonable display.
  
Strings are automatically truncated to ensure reasonable display.
+
Use this with the print statement to get a quick overview of your tree:
  
<pre>
 
>>> phx = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next()
 
>>> Phylo.pretty_print(phx)
 
Phylogeny example from Prof. Joe Felsenstein's book "Inferring Phylogenies"
 
Clade
 
Clade
 
Clade A
 
Clade B
 
Clade C
 
...
 
</pre>
 
  
<pre>>>> Phylo.pretty_print(phx, show_all=True)
+
<pre>
 +
>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next()
 +
>>> print tree
 
Phylogeny(description='phyloXML allows to use either a "branch_length"
 
Phylogeny(description='phyloXML allows to use either a "branch_length"
 
attribute or element to indicate branch lengths.', name='example from
 
attribute or element to indicate branch lengths.', name='example from
Line 178: Line 177:
 
...
 
...
 
</pre>
 
</pre>
 +
 +
[[File:phylo-draw-apaf1.png|256px|thumb|right|Rooted phylogram, via Phylo.draw]]
 +
'''draw''' displays a rooted phylogram using matplotlib or pylab. ''New in Biopython 1.58.''
 +
 +
Try this:
 +
<python>
 +
tree = Phylo.read('apaf.xml', 'phyloxml')
 +
tree.ladderize()  # Flip branches so deeper clades are displayed at top
 +
Phylo.draw(tree)
 +
</python>
 +
  
 
'''draw_graphviz''' mimics the networkx function of the same name, with some tweaks to improve the display of the graph. If a file name is given, the graph is drawn directly to that file, and options such as image format (default PDF) may be used.
 
'''draw_graphviz''' mimics the networkx function of the same name, with some tweaks to improve the display of the graph. If a file name is given, the graph is drawn directly to that file, and options such as image format (default PDF) may be used.
  
[[File:Phylo-apaf.png|256px|thumb|right|Phylogram with colored nodes]]
+
[[File:Phylo-apaf.png|256px|thumb|right|Unrooted tree with colored nodes]]
  
 
Prerequisites: In addition to networkx, you'll need a local installation of Graphviz, [http://matplotlib.sourceforge.net/ matplotlib] and either [http://networkx.lanl.gov/pygraphviz/ PyGraphviz] or [http://dkbza.org/pydot.html pydot].
 
Prerequisites: In addition to networkx, you'll need a local installation of Graphviz, [http://matplotlib.sourceforge.net/ matplotlib] and either [http://networkx.lanl.gov/pygraphviz/ PyGraphviz] or [http://dkbza.org/pydot.html pydot].
Line 199: Line 209:
 
</python>
 
</python>
  
See the function's docstring for more explanation.
+
See the [[Phylo cookbook]] page for more drawing features and options.
 
+
''TODO: Set up a cookbook page to demonstrate the more exotic options.''
+
  
 
'''draw_ascii''' prints an ascii-art rooted phylogram to standard output, or another file handle if specified. Only terminal node labels are shown; these are the result of ''str(clade)'' (usually clade names). The width of the text field used for drawing is 80 characters by default, adjustable with the ''column_width'' keyword argument, and the height in character rows is twice the number of terminals in the tree.
 
'''draw_ascii''' prints an ascii-art rooted phylogram to standard output, or another file handle if specified. Only terminal node labels are shown; these are the result of ''str(clade)'' (usually clade names). The width of the text field used for drawing is 80 characters by default, adjustable with the ''column_width'' keyword argument, and the height in character rows is twice the number of terminals in the tree.
Line 292: Line 300:
 
</pre>
 
</pre>
  
 +
 +
===Exporting to other object representations===
 +
 +
Although any phylogenetic tree can reasonably be represented by a directed acyclic graph, the Phylo module does not attempt to provide a generally usable graph library -- only the minimum functionality to represent phylogenetic trees. Instead, it provides functions for exporting tree objects to the standard graph representations, adjacency list (dict) and adjacency matrix, using third-party libraries.
 +
 +
'''to_networkx''' returns the given tree as a [http://networkx.lanl.gov/ NetworkX] LabeledDiGraph or LabeledGraph object (depending on whether the tree is rooted). You'll probably need to import networkx directly for subsequent operations on the graph object. From this point you can also try using one of networkx's drawing functions to display the tree, and for simple, fully labeled trees it may even work -- but you'll have better results with Phylo's own '''draw_graphviz''' function, discussed above.
 +
 +
<python>
 +
import networkx, pylab
 +
tree = Phylo.read('example.xml', 'phyloxml')
 +
net = Phylo.to_networkx(tree)
 +
networkx.draw(net)
 +
pylab.show()
 +
</python>
 +
 +
Recipes for exporting to other libraries, including '''ape''' (via Rpy2) and '''PyCogent''', are available on the [[Phylo_cookbook|Phylo cookbook]] page.
 +
 +
== Upcoming GSoC 2013 features ==
 +
 +
Many new features for building and processing phylogenetic trees were developed by Yanbo Ye for [[Google Summer of Code]] 2013. They are available on the development branch (see [[SourceCode]]) but are not yet included in an official Biopython release version. Note that the behavior and API of these features may change before the upcoming official release.
 +
 +
 +
=== Tree Construction ===
 +
 +
In addition to wrappers of tree construction programs(PHYLIP programs through EMBOSS wrappers in Bio.Emboss.Applications), now Biopython also provides several tree construction algorithm implementations in pure python in the <code>Bio.Phylo.TreeConstruction</code> module.
 +
 +
All algorithms are designed as worker subclasses of a base class <code>TreeConstructor</code>. All constructors have the same method <code>build_tree</code> that accept a <code>MultipleSeqAlignment</code> object to construct the tree. Currently there are two types of tree constructors: <code>DistanceTreeConstructor</code> and <code>ParsimonyTreeConstructor</code>.
 +
 +
==== DistanceTreeConstructor ====
 +
 +
The <code>DistanceTreeConstructor</code> has two algorithms: UPGMA(Unweighted Pair Group Method with Arithmetic Mean) and NJ(Neighbor Joining).
 +
 +
Both algorithms construct trees based on a distance matrix. So before using these algorithms, let me introduce the <code>DistanceCalculator</code> to generate the distance matrix from a <code>MultipleSeqAlignment</code> object. The following code shows a common way to do this:
 +
 +
<pre>&gt;&gt;&gt; from Bio.Phylo.TreeConstruction import DistanceCalculator
 +
&gt;&gt;&gt; from Bio import AlignIO
 +
&gt;&gt;&gt; aln = AlignIO.read('Tests/TreeConstruction/msa.phy', 'phylip')
 +
&gt;&gt;&gt; print aln
 +
SingleLetterAlphabet() alignment with 5 rows and 13 columns
 +
AACGTGGCCACAT Alpha
 +
AAGGTCGCCACAC Beta
 +
GAGATTTCCGCCT Delta
 +
GAGATCTCCGCCC Epsilon
 +
CAGTTCGCCACAA Gamma
 +
&gt;&gt;&gt; calculator = DistanceCalculator('identity')
 +
&gt;&gt;&gt; dm = calculator.get_distance(aln)
 +
&gt;&gt;&gt; dm
 +
DistanceMatrix(names=['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon'], matrix=[[0], [0.23076923076923073, 0], [0.3846153846153846, 0.23076923076923073, 0], [0.5384615384615384, 0.5384615384615384, 0.5384615384615384, 0], [0.6153846153846154, 0.3846153846153846, 0.46153846153846156, 0.15384615384615385, 0]])
 +
&gt;&gt;&gt; print dm
 +
Alpha  0
 +
Beta    0.230769230769  0
 +
Gamma  0.384615384615  0.230769230769  0
 +
Delta  0.538461538462  0.538461538462  0.538461538462  0
 +
Epsilon 0.615384615385  0.384615384615  0.461538461538  0.153846153846  0
 +
    Alpha  Beta    Gamma  Delta  Epsilon</pre>
 +
As you see, we create a <code>DistanceCalculator</code> object with a string 'identity', which is the name of the model(scoring matrix) to calculate the distance. The 'identity' model is the default one and can be used both for DNA and Protein sequence. To check available models for DNA, protein or all, use the attribute of the calculator <code>dna_models</code>, <code>protein_models</code>, <code>models</code> respectively.
 +
 +
After the calculator is created with the model, simply use the <code>get_distance()</code> method to get the distance matrix of a given alignment object. Then you will get a <code>DistanceMatrix</code> object, a subclass of <code>Matrix</code>(we will talk about this later).
 +
 +
Now, let's get back to the <code>DistanceTreeConstructor</code>. We can pass the <code>DistanceCalculator</code> object and a string parameter('nj' or 'upgma') to initialize it, and then call its <code>build_tree()</code> as mentioned before.
 +
 +
<pre>&gt;&gt;&gt; from TreeConstruction import DistanceTreeConstructor
 +
&gt;&gt;&gt; constructor = DistanceTreeConstructor(calculator, 'nj')
 +
&gt;&gt;&gt; tree = constructor.build_tree(aln)
 +
&gt;&gt;&gt; print tree
 +
Tree(rooted=False)
 +
    Clade(branch_length=0, name='Inner3')
 +
        Clade(branch_length=0.182692307692, name='Alpha')
 +
        Clade(branch_length=0.0480769230769, name='Beta')
 +
        Clade(branch_length=0.0480769230769, name='Inner2')
 +
            Clade(branch_length=0.278846153846, name='Inner1')
 +
                Clade(branch_length=0.0512820512821, name='Epsilon')
 +
                Clade(branch_length=0.102564102564, name='Delta')
 +
            Clade(branch_length=0.144230769231, name='Gamma')</pre>
 +
While sometimes you might want to use your own <code>DistanceMatrix</code> directly instead of the raw alignment, we provide another direct way to use both algorithms.
 +
 +
<pre>&gt;&gt;&gt; from TreeConstruction import DistanceTreeConstructor
 +
&gt;&gt;&gt; constructor = DistanceTreeConstructor()
 +
&gt;&gt;&gt; tree = constructor.nj(dm)
 +
&gt;&gt;&gt; print tree
 +
Tree(rooted=False)
 +
    Clade(branch_length=0, name='Inner3')
 +
        Clade(branch_length=0.182692307692, name='Alpha')
 +
        Clade(branch_length=0.0480769230769, name='Beta')
 +
        Clade(branch_length=0.0480769230769, name='Inner2')
 +
            Clade(branch_length=0.278846153846, name='Inner1')
 +
                Clade(branch_length=0.0512820512821, name='Epsilon')
 +
                Clade(branch_length=0.102564102564, name='Delta')
 +
            Clade(branch_length=0.144230769231, name='Gamma')
 +
&gt;&gt;&gt; tree = constructor.upgma(dm)
 +
&gt;&gt;&gt; print tree
 +
Tree(rooted=True)
 +
    Clade(branch_length=0, name='Inner4')
 +
        Clade(branch_length=0.1875, name='Inner1')
 +
            Clade(branch_length=0.0769230769231, name='Epsilon')
 +
            Clade(branch_length=0.0769230769231, name='Delta')
 +
        Clade(branch_length=0.110576923077, name='Inner3')
 +
            Clade(branch_length=0.0384615384615, name='Inner2')
 +
                Clade(branch_length=0.115384615385, name='Gamma')
 +
                Clade(branch_length=0.115384615385, name='Beta')
 +
            Clade(branch_length=0.153846153846, name='Alpha')
 +
</pre>
 +
 +
==== ParsimonyTreeConstructor ====
 +
 +
Unlike <code>DistanceTreeConstructor</code>, the concrete algorithm of <code>ParsimonyTreeConstructor</code> is delegated to two different worker classes: the <code>ParsimonyScorer</code> to calculate the parsimony score of a target tree by the given alignment, and the <code>TreeSearcher</code> to search the best tree that minimize the parsimony score. A typical usage example can be as follows:
 +
 +
<pre>&gt;&gt;&gt; from Bio import AlignIO
 +
&gt;&gt;&gt; from TreeConstruction import *
 +
&gt;&gt;&gt; aln = AlignIO.read(open('Tests/TreeConstruction/msa.phy'), 'phylip')
 +
&gt;&gt;&gt; starting_tree = Phylo.read('Tests/TreeConstruction/nj.tre', 'newick')
 +
&gt;&gt;&gt; scorer = ParsimonyScorer()
 +
&gt;&gt;&gt; searcher = NNITreeSearcher(scorer)
 +
&gt;&gt;&gt; constructor = ParsimonyTreeConstructor(searcher, starting_tree)
 +
&gt;&gt;&gt; pars_tree = constructor.build_tree(aln)
 +
&gt;&gt;&gt; print pars_tree
 +
Tree(weight=1.0, rooted=True)
 +
    Clade(branch_length=0.0)
 +
        Clade(branch_length=0.197335, name='Inner1')
 +
            Clade(branch_length=0.13691, name='Delta')
 +
            Clade(branch_length=0.08531, name='Epsilon')
 +
        Clade(branch_length=0.041935, name='Inner2')
 +
            Clade(branch_length=0.01421, name='Inner3')
 +
                Clade(branch_length=0.17523, name='Gamma')
 +
                Clade(branch_length=0.07477, name='Beta')
 +
            Clade(branch_length=0.29231, name='Alpha')</pre>
 +
The <code>ParsimonyScorer</code> is a combination of the Fitch algorithm and Sankoff algorithm. It will work as Fitch algorithm by default if no parameter is provide, and work as Sankoff algorithm if a parsimony scoring matrix(a <code>Matrix</code> object) is given, .
 +
 +
Then the scorer is passed to a <code>TreeSearcher</code> to tell it how to evaluate different trees during searching. Currently, only one searcher <code>NNITreeSearcher</code>, the Nearest Neighbor Interchange(NNI) algorithm, is implemented.
 +
 +
By passing the searcher and a starting tree to the <code>ParsimonyTreeConstructor</code>, we finally get the instance of it. If no starting tree is provided, a simple upgma tree will be created instead, with the 'identity' model. To use this parsimony constructor, just simply call the <code>build_tree</code> method with an alignment.
 +
 +
=== Consensus Tree ===
 +
 +
==== Strict, Majority Rule and Adam Consensus ====
 +
 +
Same as tree construction algorithms, three consensus tree algorithms(Strict, Majority Rule and Adam Consensus) in pure python are also implemented in the <code>Bio.Phylo.Consensus</code> module. You can directly call <code>strict_consensus</code>, <code>majority_consensus</code> and <code>adam_consensus</code> to use these algorithms with a list of trees as the input.
 +
 +
<pre>&gt;&gt;&gt; from Bio import Phylo
 +
&gt;&gt;&gt; from Bio.Phylo.Consensus import *
 +
&gt;&gt;&gt; trees = list(Phylo.parse('Tests/TreeConstruction/trees.tre', 'newick'))
 +
&gt;&gt;&gt; strict_tree = strict_consensus(trees)
 +
&gt;&gt;&gt; majority_tree = majority_consensus(trees, 0.5)
 +
&gt;&gt;&gt; adam_tree = adam_consensus(trees)</pre>
 +
Instead of using 50% as the cutoff, the <code>majority_consensus</code> method allows you to set your own one by providing an extra <code>cutoff</code> parameter(0~1, 0 by default). That means it can also work as strict consensus algorithm when the <code>cutoff</code> equals 1. One more thing different to strict and adam consensus tree, the result majority rule consensus tree has branch support value that are automatically assigned during calculation.
 +
 +
==== Bootstrap Methods ====
 +
 +
To get the consensus tree, we must construct a list of bootstrap replicate trees. So in the <code>Bio.Phylo.Consensus</code> module, we also provide several useful bootstrap methods to achieve this.
 +
 +
<pre>&gt;&gt;&gt; from Bio import Phylo
 +
&gt;&gt;&gt; from Bio.Phylo.Consensus import *
 +
&gt;&gt;&gt; msa = AlignIO.read('Tests/TreeConstruction/msa.phy', 'phylip')
 +
&gt;&gt;&gt; msas = bootstrap(msa, 100)</pre>
 +
As you see, the <code>bootstrap</code> method accepts a <code>MultipleSeqAlignment</code> object and generates its bootstrap replicate 100 times. The you can use them to build replicate trees. While, we also provide a convenient method to do this.
 +
 +
<pre>&gt;&gt;&gt; calculator = DistanceCalculator('blosum62')
 +
&gt;&gt;&gt; constructor = DistanceTreeConstructor(calculator)
 +
&gt;&gt;&gt; trees = bootstrap_trees(msa, 100, constructor)</pre>
 +
This time we pass an extra <code>DistanceTreeConstructor</code> object to a <code>bootstrap_trees</code> method and finally got the replicate trees. Note that both <code>bootstrap</code> and <code>bootstrap_trees</code> are generator functions. You need to use <code>list()</code> function to turn the result into a list of alignment or trees.
 +
 +
Another useful function is <code>bootstrap_consensus</code>. By passing the consensus method as another extra parameter, we can directly get the consensus tree.
 +
 +
<pre>&gt;&gt;&gt; consensus_tree = bootstrap_consensus(msa, 100, constructor, majority_consensus)</pre>
 +
 +
==== Branch Support ====
 +
 +
To get the branch support of a specific tree, we can use the <code>get_support</code> method.
 +
 +
<pre>&gt;&gt;&gt; from Bio import Phylo
 +
&gt;&gt;&gt; from Bio.Phylo.Consensus import *
 +
&gt;&gt;&gt; trees = list(Phylo.parse('Tests/TreeConstruction/trees.tre', 'newick'))
 +
&gt;&gt;&gt; target_tree = trees[0]
 +
&gt;&gt;&gt; support_tree = get_support(target_tree, trees)</pre>
 +
In the above code, we use the first tree as the target tree that we want to calculate its branch support. The <code>get_support</code> method accepts the target tree and a list of trees, and returns a tree, with all internal clades assigned with branch support values.
 +
 +
=== Other Useful Classes ===
 +
 +
There are some other classes in both <code>TreeConstruction</code> and <code>Consensus</code> module that are used in those algorithms. They may not be used commonly, but might be useful to you in some cases.
 +
 +
==== _Matrix ====
 +
 +
The <code>_Matrix</code> class in the <code>TreeConstruction</code> module is the super class of <code>_DistanceMatrix</code>. They are both actually constructed by a list of names and a nested list of numbers in lower triangular matrix format. The only difference between them is that the diagonal elements in <code>_DistanceMatrix</code> will all be 0 no matter what values are assigned.
 +
 +
To create a <code>_Matrix</code> object:
 +
 +
<pre>&gt;&gt;&gt; from Bio.Phylo.TreeConstruction import _Matrix
 +
&gt;&gt;&gt; names = ['Alpha', 'Beta', 'Gamma', 'Delta']
 +
&gt;&gt;&gt; matrix = [[0], [1, 0], [2, 3, 0], [4, 5, 6, 0]]
 +
&gt;&gt;&gt; m = _Matrix(names, matrix)
 +
&gt;&gt;&gt; m
 +
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [1, 0], [2, 3, 0], [4, 5, 6, 0]])
 +
&gt;&gt;&gt; print m
 +
Alpha  0
 +
Beta    1  0
 +
Gamma  2  3  0
 +
Delta  4  5  6  0
 +
    Alpha  Beta    Gamma  Delta</pre>
 +
You can use two indices to get or assign an element in the matrix, and the indices are exchangeable.
 +
 +
<pre>&gt;&gt;&gt; m[1,2]
 +
3
 +
&gt;&gt;&gt; m[2,1]
 +
3
 +
&gt;&gt;&gt; m['Beta','Gamma']
 +
3
 +
&gt;&gt;&gt; m['Beta','Gamma'] = 4
 +
&gt;&gt;&gt; m['Beta','Gamma']
 +
4</pre>
 +
Further more, you can use one index to get or assign a list of elements related to that index:
 +
 +
<pre>&gt;&gt;&gt; m[0]
 +
[0, 1, 2, 4]
 +
&gt;&gt;&gt; m['Alpha']
 +
[0, 1, 2, 4]
 +
&gt;&gt;&gt; m['Alpha'] = [0, 7, 8, 9]
 +
&gt;&gt;&gt; m[0]
 +
[0, 7, 8, 9]
 +
&gt;&gt;&gt; m[0,1]
 +
7</pre>
 +
Also you can delete or insert a column&amp;row of elements by index:
 +
 +
<pre>&gt;&gt;&gt; m
 +
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [7, 0], [8, 4, 0], [9, 5, 6, 0]])
 +
&gt;&gt;&gt; del m['Alpha']
 +
&gt;&gt;&gt; m
 +
_Matrix(names=['Beta', 'Gamma', 'Delta'], matrix=[[0], [4, 0], [5, 6, 0]])
 +
&gt;&gt;&gt; m.insert('Alpha', [0, 7, 8, 9] , 0)
 +
&gt;&gt;&gt; m
 +
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [7, 0], [8, 4, 0], [9, 5, 6, 0]])</pre>
 +
 +
==== _BitString ====
 +
 +
<code>_BitString</code> is an assistant class used frequently in the algorithms in the <code>Consensus</code> module. It's a sub-class of <code>str</code> object that only accepts two characters('0' and '1'), with additional functions for binary-like manipulation(&amp;|^~). Common usage is as follows:
 +
 +
<pre>&gt;&gt;&gt; from Bio.Phylo.Consensus import _BitString
 +
&gt;&gt;&gt; bitstr1 = _BitString('11111')
 +
&gt;&gt;&gt; bitstr2 = _BitString('11100')
 +
&gt;&gt;&gt; bitstr3 = _BitString('01101')
 +
&gt;&gt;&gt; bitstr1
 +
_BitString('11111')
 +
&gt;&gt;&gt; bitstr2 &amp; bitstr3
 +
_BitString('01100')
 +
&gt;&gt;&gt; bitstr2 | bitstr3
 +
_BitString('11101')
 +
&gt;&gt;&gt; bitstr2 ^ bitstr3
 +
_BitString('10001')
 +
&gt;&gt;&gt; bitstr2.index_one()
 +
[0, 1, 2]
 +
&gt;&gt;&gt; bitstr3.index_one()
 +
[1, 2, 4]
 +
&gt;&gt;&gt; bitstr3.index_zero()
 +
[0, 3]
 +
&gt;&gt;&gt; bitstr1.contains(bitstr2)
 +
True
 +
&gt;&gt;&gt; bitstr2.contains(bitstr3)
 +
False
 +
&gt;&gt;&gt; bitstr2.independent(bitstr3)
 +
False
 +
&gt;&gt;&gt; bitstr2.independent(bitstr4)
 +
True
 +
&gt;&gt;&gt; bitstr1.iscompatible(bitstr2)
 +
True
 +
&gt;&gt;&gt; bitstr2.iscompatible(bitstr3)
 +
False
 +
&gt;&gt;&gt; bitstr2.iscompatible(bitstr4)
 +
True</pre>
 +
In consensus and branch support algorithms, it is used to count and store the clades in multiple trees. During counting, the clades will be considered the same if their terminals(in terms of <code>name</code> attribute) are the same.
 +
 +
For example, let's say two trees are provided as below to search their strict consensus tree:
 +
 +
<pre>tree1: (((A, B), C),(D, E))
 +
tree2: ((A, (B, C)),(D, E))</pre>
 +
For both trees, a <code>_BitString</code> object '11111' will represent their root clade. Each '1' stands for the terminal clade in the list [A, B, C, D, E] (the order might not be the same, it's determined by the <code>get_terminal</code> method of the first tree provided). For the clade ((A, B), C) in tree1 and (A, (B, C)) in tree2, they both can be represented by '11100'. Similarly, '11000' represents clade (A, B) in tree1, '01100' represents clade (B, C) in tree2, and '00011' represents clade (D, E) in both trees.
 +
 +
So, with the <code>_count_clades</code> function in this module, finally we can get the clade counts and their _BitString representation as follows(the root and terminals are omitted):
 +
 +
<pre>clade  _BitString  count
 +
ABC    '11100'    2
 +
DE      '00011'    2
 +
AB      '11000'    1
 +
BC      '01100'    1</pre>
 +
To get the _BitString representation of a clade, we can use the following code snippet:
 +
 +
<pre># suppose we are provided with a tree list, the first thing to do is
 +
# to get all the terminal names in the first tree
 +
term_names = [term.name for term in trees[0].get_terminals()]
 +
# for a specific clade in any of the tree, also get its terminal names
 +
clade_term_names = [term.name for term in clade.get_terminals()]
 +
# then create a boolean list
 +
boolvals = [name in clade_term_names for name in term_names]
 +
# create the string version and pass it to _BitString 
 +
bitstr = _BitString(''.join(map(str, map(int, boolvals))))</pre>
 +
To convert back:
 +
 +
<pre># get all the terminal clades of the first tree
 +
terms = [term for term in trees[0].get_terminals()]
 +
# get the index of terminal clades in bitstr
 +
index_list = bitstr.index_one()
 +
# get all terminal clades by index
 +
clade_terms = [terms[i] for i in index_list]
 +
# create a new calde and append all the terminal clades
 +
new_clade = BaseTree.Clade()
 +
new_clade.clades.extend(clade_terms)</pre>
 +
This is how the <code>_BitString</code> is used in the consensus and branch support algorithms. And I think it can be used in many other conditions.
 +
 +
For example, we can even use it to check whether the structures of two trees are the same.
 +
 +
<pre># store and return all _BitStrings
 +
def _bitstrs(tree):
 +
    bitstrs = set()
 +
    term_names = [term.name for term in tree.get_terminals()]
 +
    term_names.sort()
 +
    for clade in tree.get_nonterminals():
 +
        clade_term_names = [term.name for term in clade.get_terminals()]
 +
        boolvals = [name in clade_term_names for name in term_names] 
 +
        bitstr = _BitString(''.join(map(str, map(int, boolvals))))
 +
        bitstrs.add(bitstr)
 +
    return bitstrs
 +
   
 +
def compare(tree1, tree2):
 +
    term_names1 =  [term.name for term in tree1.get_terminals()]
 +
    term_names2 =  [term.name for term in tree2.get_terminals()]
 +
    # false if terminals are not the same
 +
    if set(term_names1) != set(term_names2):
 +
        return False
 +
    # true if _BitStrings are the same
 +
    if _bitstrs(tree1) == _bitstrs(tree2):
 +
        return True
 +
    else:
 +
        return False</pre>
 +
To use it:
 +
 +
<pre>&gt;&gt;&gt; tree1 = Phylo.read('Tests/TreeConstruction/upgma.tre', 'newick')
 +
&gt;&gt;&gt; tree2 = Phylo.read('Tests/TreeConstruction/nj.tre', 'newick')
 +
&gt;&gt;&gt; tree3 = Phylo.read('Tests/TreeConstruction/pars1.tre', 'newick')
 +
&gt;&gt;&gt; compare(tree1, tree2)
 +
False
 +
&gt;&gt;&gt; compare(tree1, tree3)
 +
True
 +
&gt;&gt;&gt; compare(tree2, tree3)
 +
False</pre>
 +
 +
==External Applications==
 +
 +
* Bio.Phylo.PhymlCommandline provides a wrapper for [http://www.atgc-montpellier.fr/phyml/ PhyML] following the usual Bio.Applications API.
 +
* Bio.Phylo.[[PAML]] provides wrappers, parsers and utilities for working with the PAML suite of programs.
 +
 +
==Example pipeline==
 +
 +
See the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Biopython Tutorial] sections on sequence alignment and BLAST for explanations of the first few steps shown here.
 +
 +
1. Search for homologs of a protein sequence using BLAST.
 +
<python>
 +
from Bio.Blast import NBCIStandalone, NCBIXML
 +
 +
query_fname = 'AAG35789.fasta'
 +
result_handle, error_handle = NCBIStandalone.blastall('/usr/bin/blastall', 'blastp',
 +
                                                      '/db/fasta/swissprot', query_fname)
 +
blast_record = NCBIXML.read(result_handle)  # This takes some time to run
 +
</python>
 +
 +
2. Extract the best hits from the BLAST result.
 +
<python>
 +
from Bio import SeqIO
 +
from Bio.Seq import Seq
 +
from Bio.SeqRecord import SeqRecord
 +
 +
def get_seqrecs(alignments, threshold):
 +
    for aln in alignments:
 +
        for hsp in aln.hsps:
 +
            if hsp.expect < threshold:
 +
                yield SeqRecord(Seq(hsp.sbjct), id=aln.accession)
 +
                break
 +
 +
best_seqs = get_seqrecs(blast_record.alignments, 1e-90)
 +
SeqIO.write(best_seqs, 'egfr-family.fasta', 'fasta')
 +
</python>
 +
 +
To help with annotating to your tree later, pick a lookup key here (e.g. accession number) and build a dictionary mapping that key to any additional data you can glean from the BLAST output, such as taxonomy and GI numbers. In this example, we're only keeping the original sequence and accession number.
 +
 +
3. Re-align the sequences using Muscle. The program creates an alignment file in Clustal format, "egfr-family.aln".
 +
 +
<python>
 +
from Bio.Align.Applications import MuscleCommandline
 +
 +
cmdline = MuscleCommandline(input="egfr-family.fasta", out="efgr-family.aln", clw=True)
 +
cmdline()
 +
</python>
 +
 +
4. Infer a gene tree using PhyML. First, convert the alignment from step 3 to "relaxed Phylip" format (new in Biopython 1.58):
 +
 +
<python>
 +
from Bio import AlignIO
 +
 +
AlignIO.convert("egfr-family.aln", "clustal", "egfr-family.phy", "phylip-relaxed")
 +
</python>
 +
 +
Feed the alignment to PhyML using the command line wrapper:
 +
 +
<python>
 +
from Bio.Phylo.Applications import PhymlCommandline
 +
 +
cmdline = PhymlCommandline(input='egfr-family.phy', datatype='aa', model='WAG', alpha='e', bootstrap=100)
 +
out_log, err_log = cmdline()
 +
</python>
 +
 +
5. Load the gene tree with Phylo, and take a quick look at the topology. (PhyML writes the tree to a file named after the input file plus "_phyml_tree.txt".)
 +
 +
<python>
 +
from Bio import Phylo
 +
 +
egfr_tree = Phylo.read("egfr-family.phy_phyml_tree.txt", "newick")
 +
Phylo.draw_ascii(egfr_tree)
 +
</python>
 +
 +
6. Add accession numbers and sequences to the tree -- now we're using [[PhyloXML]]'s extra features.
 +
<python>
 +
from Bio.Phylo import PhyloXML
 +
 +
# Promote the basic tree to PhyloXML
 +
egfr_phy = egfr_tree.as_phyloxml()
 +
 +
# Make a lookup table for sequences
 +
lookup = dict((rec.id, str(rec.seq)) for rec in best_seqs)
 +
 +
for clade in egfr_phy.get_terminals():
 +
    key = clade.name
 +
    accession = PhyloXML.Accession(key, 'NCBI')
 +
    mol_seq = PhyloXML.MolSeq(lookup[key], is_aligned=True)
 +
    sequence = PhyloXML.Sequence(type='aa', accession=accession, mol_seq=mol_seq)
 +
    clade.sequences.append(sequence)
 +
 +
# Save the annotated phyloXML file
 +
Phylo.write(egfr_phy, 'egfr-family-annotated.xml', 'phyloxml')
 +
</python>
  
 
[[Category:Wiki Documentation]]
 
[[Category:Wiki Documentation]]

Latest revision as of 13:49, 16 January 2014

This module provides classes, functions and I/O support for working with phylogenetic trees.

For more complete documentation, see the Phylogenetics chapter of the Biopython Tutorial and the Bio.Phylo API pages generated from the source code. The Phylo cookbook page has more examples of how to use this module, and the PhyloXML page describes how to attach graphical cues and additional information to a tree.

Contents

Availability

This module is included in Biopython 1.54 and later. If you're interested in testing newer additions to this code before the next official release, see SourceCode for instructions on getting a copy of the development branch.

To draw trees (optional), you'll also need these packages:

The I/O and tree-manipulation functionality will work without them; they're imported on demand when the functions draw(), draw_graphviz() and to_networkx() are called.

To use this module in older versions of Biopython with Python 2.4, you'll also need to install a separate package that provides the ElementTree interface for XML parsing. Either of these will work:

The Phylo module has also been successfully tested on Jython 2.5.1, minus the Graphviz- and NetworkX-based functions. However, parsing phyloXML files is noticeably slower because Jython uses a different version of the underlying XML parsing library.

I/O functions

Wrappers for supported file formats are available from the top level of the module:

from Bio import Phylo

Like SeqIO and AlignIO, this module provides four I/O functions: parse(), read(), write() and convert(). Each function accepts either a file name or an open file handle, so data can be also loaded from compressed files, StringIO objects, and so on. If the file name is passed as a string, the file is automatically closed when the function finishes; otherwise, you're responsible for closing the handle yourself.

The second argument to each function is the target format. Currently, the following formats are supported:

  • newick
  • nexus
  • nexml
  • phyloxml
  • cdao

See the PhyloXML page for more examples of using tree objects.

parse()

Incrementally parse each tree in the given file or handle, returning an iterator of Tree objects (i.e. some subclass of the Bio.Phylo.BaseTree Tree class, depending on the file format).

>>> trees = Phylo.parse('phyloxml_examples.xml', 'phyloxml')
>>> for tree in trees:
...     print tree.name
example from Prof. Joe Felsenstein's book "Inferring Phylogenies"
example from Prof. Joe Felsenstein's book "Inferring Phylogenies"
same example, with support of type "bootstrap"
same example, with species and sequence
same example, with gene duplication information and sequence relationships
similar example, with more detailed sequence data
network, node B is connected to TWO nodes: AB and C
...

If there's only one tree, then the next() method on the resulting generator will return it.

>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next()
>>> tree.name
'example from Prof. Joe Felsenstein\'s book "Inferring Phylogenies"'

Note that this doesn't immediately reveal whether there are any remaining trees -- if you want to verify that, use read() instead.

read()

Parse and return exactly one tree from the given file or handle. If the file contains zero or multiple trees, a ValueError is raised. This is useful if you know a file contains just one tree, to load that tree object directly rather than through parse() and next(), and as a safety check to ensure the input file does in fact contain exactly one phylogenetic tree at the top level.

tree = Phylo.read('example.dnd', 'newick')
print tree

If you have your tree data already loaded as a Python string, you can parse it with the help of StringIO (in Python's standard library):

from cStringIO import StringIO
 
treedata = "(A, (B, C), (D, E))"
handle = StringIO(treedata)
tree = Phylo.read(handle, "newick")

In one line:

tree = Phylo.read(StringIO("(A, (B, C), (D, E))"), "newick")

The other I/O functions also can be used with StringIO.

(General tip: if you write to the StringIO object and want to re-read the contents, you'll need to call the seek(0) method to move the handle back to the start of the StringIO data -- the same as an open file handle. See examples of this in the unit tests for Phylo, Tests/test_Phylo.py, in the Biopython source code.)

write()

Write a sequence of Tree objects to the given file or handle. Passing a single Tree object instead of a list or iterable will also work. (See, Phylo is friendly.)

tree1 = Phylo.read('example1.xml', 'phyloxml')
tree2 = Phylo.read('example2.xml', 'phyloxml')
Phylo.write([tree1, tree2], 'example-both.xml', 'phyloxml')

convert()

Given two files (or handles) and two formats, both supported by Bio.Phylo, convert the first file from the first format to the second format, writing the output to the second file.

Phylo.convert('example.nhx', 'newick', 'example2.nex', 'nexus')

Sub-modules

Within the Phylo module are parsers and writers for specific file formats, conforming to the basic top-level API and sometimes adding additional features.

PhyloXMLIO: Support for the phyloXML format. See the PhyloXML page for details.

NeXMLIO: Support for the NeXML format.

NewickIO: A port of the parser in Bio.Nexus.Trees to support the Newick (a.k.a. New Hampshire) format through the Phylo API.

NexusIO: Wrappers around Bio.Nexus to support the Nexus tree format.

CDAOIO: Support for the Comparative Data Analysis Ontology (CDAO). Requires RDFlib.

The Nexus format actually contains several sub-formats for different kinds of data; to represent trees, Nexus provides a block containing some metadata and one or more Newick trees. (Another kind of Nexus block can represent alignments; this is handled in AlignIO.) So to parse a complete Nexus file with all block types handled, use Bio.Nexus directly, and to extract just the trees, use Bio.Phylo.

Tree and Subtree classes

The basic objects are defined in Bio.Phylo.BaseTree.

Format-specific extensions

To support additional information stored in specific file formats, sub-modules within Tree offer additional classes that inherit from BaseTree classes.

Each sub-class of BaseTree.Tree or Node has a class method to promote an object from the basic type to the format-specific one. These sub-class objects can generally be treated as instances of the basic type without any explicit conversion.

PhyloXML: Support for the phyloXML format. See the PhyloXML page for details.

Newick: The Newick module provides minor enhancements to the BaseTree classes, plus several shims for compatibility with the existing Bio.Nexus module. The API for this module is under development and should not be relied on, other than the functionality already provided by BaseTree.

Utilities

Some additional tools are located in the Utils module under Bio.Phylo. These functions are also loaded to the top level of the Phylo module on import for easy access.

Where a third-party package is required, that package is imported when the function itself is called, so these dependencies are not necessary to install or use the rest of the Tree module.

Displaying trees

str(tree) produces a plain-text representation of the entire tree. Strings are automatically truncated to ensure reasonable display.

Use this with the print statement to get a quick overview of your tree:


>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next()
>>> print tree
Phylogeny(description='phyloXML allows to use either a "branch_length"
attribute or element to indicate branch lengths.', name='example from
Prof. Joe Felsenstein's book "Inferring Phylogenies"')
	Clade()
		Clade(branch_length=0.06)
			Clade(branch_length=0.102, name='A')
			Clade(branch_length=0.23, name='B')
		Clade(branch_length=0.4, name='C')
...
Rooted phylogram, via Phylo.draw

draw displays a rooted phylogram using matplotlib or pylab. New in Biopython 1.58.

Try this:

tree = Phylo.read('apaf.xml', 'phyloxml')
tree.ladderize()   # Flip branches so deeper clades are displayed at top
Phylo.draw(tree)


draw_graphviz mimics the networkx function of the same name, with some tweaks to improve the display of the graph. If a file name is given, the graph is drawn directly to that file, and options such as image format (default PDF) may be used.

Unrooted tree with colored nodes

Prerequisites: In addition to networkx, you'll need a local installation of Graphviz, matplotlib and either PyGraphviz or pydot.

Drawing a basic dendrogram is simple:

import pylab
tree = Phylo.read('apaf.xml', 'phyloxml')
Phylo.draw_graphviz(tree)
pylab.show()
Phylogram with plain text nodes

Here's the same tree without the circles at each labelled node:

Phylo.draw_graphviz(tree, node_size=0)

See the Phylo cookbook page for more drawing features and options.

draw_ascii prints an ascii-art rooted phylogram to standard output, or another file handle if specified. Only terminal node labels are shown; these are the result of str(clade) (usually clade names). The width of the text field used for drawing is 80 characters by default, adjustable with the column_width keyword argument, and the height in character rows is twice the number of terminals in the tree.

A simple tree with defined branch lengths looks like this:

>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next()
>>> Phylo.draw_ascii(tree)
          _____________ A
  _______|
_|       |_______________________________ B
 |
 |_______________________________________________________ C

The same topology without branch lengths is drawn with equal-length branches:

                              ___________________________ A
  ___________________________|
_|                           |___________________________ B
 |
 |___________________________ C

A larger tree (apaf.xml, 31 leaf nodes) drawn with the default column width demonstrates how relatively short branches are handled:

>>> apaf = Phylo.read('apaf.xml', 'phyloxml')
>>> Phylo.draw_ascii(apaf)
                                   _ 22_MOUSE
                                  |
                                 _| Apaf-1_HUMAN
                                | |
                               ,| | 12_CANFA
                               ||
                              _||___ 11_CHICK
                             | |
                             | |___________ 16_XENLA
                      _______|
                     |       |       , 14_FUGRU
                     |       |     __|
                     |       |____|  |__ 15_TETNG
                _____|            |
               |     |            |____ 17_BRARE
               |     |
               |     |    ______ 1_BRAFL
               |     | __|
         ______|     ||  |_________ 18_NEMVE
        |      |      |
        |      |      |____________ 23_STRPU
        |      |
       _|      |          _________ 26_STRPU
      | |      |_________|
      | |                |________ 25_STRPU
      | |
      | |                                    ___ CED4_CAEEL
      | |___________________________________|
  ____|                                     |_ 31_CAEBR
 |    |
 |    |                                ___ 28_DROPS
 |    |          _____________________|
 |    |   ______|                     |____ Dark_DROME
 |    |  |      |
 |    |__|      |_______________________ 29_AEDAE
 |       |
 |       |__________________________ 30_TRICA
 |
 |                                                           _ 34_BRAFL
 |                                 _________________________|
_|                           _____|                         |_ 35_BRAFL
 |                          |     |
 |                        __|     |_______________ 8_BRAFL
 |                       |  |
 |                       |  |        ___________________ 20_NEMVE
 |         ______________|  |_______|
 |        |              |          |__________________________ 21_NEMVE
 |        |              |
 |     ___|              |______________________________ 9_BRAFL
 |    |   |
 |    |   |                _____________ 3_BRAFL
 |    |   |          _____|
 |    |   |_________|     |_________________ 2_BRAFL
 |____|             |
      |             |_______________ 19_NEMVE
      |
      |                                     _____ 37_BRAFL
      |            ________________________|
      |___________|                        |____ 36_BRAFL
                  |
                  |______________________ 33_BRAFL


Exporting to other object representations

Although any phylogenetic tree can reasonably be represented by a directed acyclic graph, the Phylo module does not attempt to provide a generally usable graph library -- only the minimum functionality to represent phylogenetic trees. Instead, it provides functions for exporting tree objects to the standard graph representations, adjacency list (dict) and adjacency matrix, using third-party libraries.

to_networkx returns the given tree as a NetworkX LabeledDiGraph or LabeledGraph object (depending on whether the tree is rooted). You'll probably need to import networkx directly for subsequent operations on the graph object. From this point you can also try using one of networkx's drawing functions to display the tree, and for simple, fully labeled trees it may even work -- but you'll have better results with Phylo's own draw_graphviz function, discussed above.

import networkx, pylab
tree = Phylo.read('example.xml', 'phyloxml')
net = Phylo.to_networkx(tree)
networkx.draw(net)
pylab.show()

Recipes for exporting to other libraries, including ape (via Rpy2) and PyCogent, are available on the Phylo cookbook page.

Upcoming GSoC 2013 features

Many new features for building and processing phylogenetic trees were developed by Yanbo Ye for Google Summer of Code 2013. They are available on the development branch (see SourceCode) but are not yet included in an official Biopython release version. Note that the behavior and API of these features may change before the upcoming official release.


Tree Construction

In addition to wrappers of tree construction programs(PHYLIP programs through EMBOSS wrappers in Bio.Emboss.Applications), now Biopython also provides several tree construction algorithm implementations in pure python in the Bio.Phylo.TreeConstruction module.

All algorithms are designed as worker subclasses of a base class TreeConstructor. All constructors have the same method build_tree that accept a MultipleSeqAlignment object to construct the tree. Currently there are two types of tree constructors: DistanceTreeConstructor and ParsimonyTreeConstructor.

DistanceTreeConstructor

The DistanceTreeConstructor has two algorithms: UPGMA(Unweighted Pair Group Method with Arithmetic Mean) and NJ(Neighbor Joining).

Both algorithms construct trees based on a distance matrix. So before using these algorithms, let me introduce the DistanceCalculator to generate the distance matrix from a MultipleSeqAlignment object. The following code shows a common way to do this:

>>> from Bio.Phylo.TreeConstruction import DistanceCalculator
>>> from Bio import AlignIO
>>> aln = AlignIO.read('Tests/TreeConstruction/msa.phy', 'phylip')
>>> print aln
SingleLetterAlphabet() alignment with 5 rows and 13 columns
AACGTGGCCACAT Alpha
AAGGTCGCCACAC Beta
GAGATTTCCGCCT Delta
GAGATCTCCGCCC Epsilon
CAGTTCGCCACAA Gamma
>>> calculator = DistanceCalculator('identity')
>>> dm = calculator.get_distance(aln)
>>> dm
DistanceMatrix(names=['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon'], matrix=[[0], [0.23076923076923073, 0], [0.3846153846153846, 0.23076923076923073, 0], [0.5384615384615384, 0.5384615384615384, 0.5384615384615384, 0], [0.6153846153846154, 0.3846153846153846, 0.46153846153846156, 0.15384615384615385, 0]])
>>> print dm
Alpha   0
Beta    0.230769230769  0
Gamma   0.384615384615  0.230769230769  0
Delta   0.538461538462  0.538461538462  0.538461538462  0
Epsilon 0.615384615385  0.384615384615  0.461538461538  0.153846153846  0
    Alpha   Beta    Gamma   Delta   Epsilon

As you see, we create a DistanceCalculator object with a string 'identity', which is the name of the model(scoring matrix) to calculate the distance. The 'identity' model is the default one and can be used both for DNA and Protein sequence. To check available models for DNA, protein or all, use the attribute of the calculator dna_models, protein_models, models respectively.

After the calculator is created with the model, simply use the get_distance() method to get the distance matrix of a given alignment object. Then you will get a DistanceMatrix object, a subclass of Matrix(we will talk about this later).

Now, let's get back to the DistanceTreeConstructor. We can pass the DistanceCalculator object and a string parameter('nj' or 'upgma') to initialize it, and then call its build_tree() as mentioned before.

>>> from TreeConstruction import DistanceTreeConstructor
>>> constructor = DistanceTreeConstructor(calculator, 'nj')
>>> tree = constructor.build_tree(aln)
>>> print tree
Tree(rooted=False)
    Clade(branch_length=0, name='Inner3')
        Clade(branch_length=0.182692307692, name='Alpha')
        Clade(branch_length=0.0480769230769, name='Beta')
        Clade(branch_length=0.0480769230769, name='Inner2')
            Clade(branch_length=0.278846153846, name='Inner1')
                Clade(branch_length=0.0512820512821, name='Epsilon')
                Clade(branch_length=0.102564102564, name='Delta')
            Clade(branch_length=0.144230769231, name='Gamma')

While sometimes you might want to use your own DistanceMatrix directly instead of the raw alignment, we provide another direct way to use both algorithms.

>>> from TreeConstruction import DistanceTreeConstructor
>>> constructor = DistanceTreeConstructor()
>>> tree = constructor.nj(dm)
>>> print tree
Tree(rooted=False)
    Clade(branch_length=0, name='Inner3')
        Clade(branch_length=0.182692307692, name='Alpha')
        Clade(branch_length=0.0480769230769, name='Beta')
        Clade(branch_length=0.0480769230769, name='Inner2')
            Clade(branch_length=0.278846153846, name='Inner1')
                Clade(branch_length=0.0512820512821, name='Epsilon')
                Clade(branch_length=0.102564102564, name='Delta')
            Clade(branch_length=0.144230769231, name='Gamma')
>>> tree = constructor.upgma(dm)
>>> print tree
Tree(rooted=True)
    Clade(branch_length=0, name='Inner4')
        Clade(branch_length=0.1875, name='Inner1')
            Clade(branch_length=0.0769230769231, name='Epsilon')
            Clade(branch_length=0.0769230769231, name='Delta')
        Clade(branch_length=0.110576923077, name='Inner3')
            Clade(branch_length=0.0384615384615, name='Inner2')
                Clade(branch_length=0.115384615385, name='Gamma')
                Clade(branch_length=0.115384615385, name='Beta')
            Clade(branch_length=0.153846153846, name='Alpha')

ParsimonyTreeConstructor

Unlike DistanceTreeConstructor, the concrete algorithm of ParsimonyTreeConstructor is delegated to two different worker classes: the ParsimonyScorer to calculate the parsimony score of a target tree by the given alignment, and the TreeSearcher to search the best tree that minimize the parsimony score. A typical usage example can be as follows:

>>> from Bio import AlignIO
>>> from TreeConstruction import *
>>> aln = AlignIO.read(open('Tests/TreeConstruction/msa.phy'), 'phylip')
>>> starting_tree = Phylo.read('Tests/TreeConstruction/nj.tre', 'newick')
>>> scorer = ParsimonyScorer()
>>> searcher = NNITreeSearcher(scorer)
>>> constructor = ParsimonyTreeConstructor(searcher, starting_tree)
>>> pars_tree = constructor.build_tree(aln)
>>> print pars_tree
Tree(weight=1.0, rooted=True)
    Clade(branch_length=0.0)
        Clade(branch_length=0.197335, name='Inner1')
            Clade(branch_length=0.13691, name='Delta')
            Clade(branch_length=0.08531, name='Epsilon')
        Clade(branch_length=0.041935, name='Inner2')
            Clade(branch_length=0.01421, name='Inner3')
                Clade(branch_length=0.17523, name='Gamma')
                Clade(branch_length=0.07477, name='Beta')
            Clade(branch_length=0.29231, name='Alpha')

The ParsimonyScorer is a combination of the Fitch algorithm and Sankoff algorithm. It will work as Fitch algorithm by default if no parameter is provide, and work as Sankoff algorithm if a parsimony scoring matrix(a Matrix object) is given, .

Then the scorer is passed to a TreeSearcher to tell it how to evaluate different trees during searching. Currently, only one searcher NNITreeSearcher, the Nearest Neighbor Interchange(NNI) algorithm, is implemented.

By passing the searcher and a starting tree to the ParsimonyTreeConstructor, we finally get the instance of it. If no starting tree is provided, a simple upgma tree will be created instead, with the 'identity' model. To use this parsimony constructor, just simply call the build_tree method with an alignment.

Consensus Tree

Strict, Majority Rule and Adam Consensus

Same as tree construction algorithms, three consensus tree algorithms(Strict, Majority Rule and Adam Consensus) in pure python are also implemented in the Bio.Phylo.Consensus module. You can directly call strict_consensus, majority_consensus and adam_consensus to use these algorithms with a list of trees as the input.

>>> from Bio import Phylo
>>> from Bio.Phylo.Consensus import *
>>> trees = list(Phylo.parse('Tests/TreeConstruction/trees.tre', 'newick'))
>>> strict_tree = strict_consensus(trees)
>>> majority_tree = majority_consensus(trees, 0.5)
>>> adam_tree = adam_consensus(trees)

Instead of using 50% as the cutoff, the majority_consensus method allows you to set your own one by providing an extra cutoff parameter(0~1, 0 by default). That means it can also work as strict consensus algorithm when the cutoff equals 1. One more thing different to strict and adam consensus tree, the result majority rule consensus tree has branch support value that are automatically assigned during calculation.

Bootstrap Methods

To get the consensus tree, we must construct a list of bootstrap replicate trees. So in the Bio.Phylo.Consensus module, we also provide several useful bootstrap methods to achieve this.

>>> from Bio import Phylo
>>> from Bio.Phylo.Consensus import *
>>> msa = AlignIO.read('Tests/TreeConstruction/msa.phy', 'phylip')
>>> msas = bootstrap(msa, 100)

As you see, the bootstrap method accepts a MultipleSeqAlignment object and generates its bootstrap replicate 100 times. The you can use them to build replicate trees. While, we also provide a convenient method to do this.

>>> calculator = DistanceCalculator('blosum62')
>>> constructor = DistanceTreeConstructor(calculator)
>>> trees = bootstrap_trees(msa, 100, constructor)

This time we pass an extra DistanceTreeConstructor object to a bootstrap_trees method and finally got the replicate trees. Note that both bootstrap and bootstrap_trees are generator functions. You need to use list() function to turn the result into a list of alignment or trees.

Another useful function is bootstrap_consensus. By passing the consensus method as another extra parameter, we can directly get the consensus tree.

>>> consensus_tree = bootstrap_consensus(msa, 100, constructor, majority_consensus)

Branch Support

To get the branch support of a specific tree, we can use the get_support method.

>>> from Bio import Phylo
>>> from Bio.Phylo.Consensus import *
>>> trees = list(Phylo.parse('Tests/TreeConstruction/trees.tre', 'newick'))
>>> target_tree = trees[0]
>>> support_tree = get_support(target_tree, trees)

In the above code, we use the first tree as the target tree that we want to calculate its branch support. The get_support method accepts the target tree and a list of trees, and returns a tree, with all internal clades assigned with branch support values.

Other Useful Classes

There are some other classes in both TreeConstruction and Consensus module that are used in those algorithms. They may not be used commonly, but might be useful to you in some cases.

_Matrix

The _Matrix class in the TreeConstruction module is the super class of _DistanceMatrix. They are both actually constructed by a list of names and a nested list of numbers in lower triangular matrix format. The only difference between them is that the diagonal elements in _DistanceMatrix will all be 0 no matter what values are assigned.

To create a _Matrix object:

>>> from Bio.Phylo.TreeConstruction import _Matrix
>>> names = ['Alpha', 'Beta', 'Gamma', 'Delta']
>>> matrix = [[0], [1, 0], [2, 3, 0], [4, 5, 6, 0]]
>>> m = _Matrix(names, matrix)
>>> m
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [1, 0], [2, 3, 0], [4, 5, 6, 0]])
>>> print m
Alpha   0
Beta    1   0
Gamma   2   3   0
Delta   4   5   6   0
    Alpha   Beta    Gamma   Delta

You can use two indices to get or assign an element in the matrix, and the indices are exchangeable.

>>> m[1,2]
3
>>> m[2,1]
3
>>> m['Beta','Gamma']
3
>>> m['Beta','Gamma'] = 4
>>> m['Beta','Gamma']
4

Further more, you can use one index to get or assign a list of elements related to that index:

>>> m[0]
[0, 1, 2, 4]
>>> m['Alpha']
[0, 1, 2, 4]
>>> m['Alpha'] = [0, 7, 8, 9]
>>> m[0]
[0, 7, 8, 9]
>>> m[0,1]
7

Also you can delete or insert a column&row of elements by index:

>>> m
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [7, 0], [8, 4, 0], [9, 5, 6, 0]])
>>> del m['Alpha']
>>> m
_Matrix(names=['Beta', 'Gamma', 'Delta'], matrix=[[0], [4, 0], [5, 6, 0]])
>>> m.insert('Alpha', [0, 7, 8, 9] , 0)
>>> m
_Matrix(names=['Alpha', 'Beta', 'Gamma', 'Delta'], matrix=[[0], [7, 0], [8, 4, 0], [9, 5, 6, 0]])

_BitString

_BitString is an assistant class used frequently in the algorithms in the Consensus module. It's a sub-class of str object that only accepts two characters('0' and '1'), with additional functions for binary-like manipulation(&|^~). Common usage is as follows:

>>> from Bio.Phylo.Consensus import _BitString
>>> bitstr1 = _BitString('11111')
>>> bitstr2 = _BitString('11100')
>>> bitstr3 = _BitString('01101')
>>> bitstr1
_BitString('11111')
>>> bitstr2 & bitstr3
_BitString('01100')
>>> bitstr2 | bitstr3
_BitString('11101')
>>> bitstr2 ^ bitstr3
_BitString('10001')
>>> bitstr2.index_one()
[0, 1, 2]
>>> bitstr3.index_one()
[1, 2, 4]
>>> bitstr3.index_zero()
[0, 3]
>>> bitstr1.contains(bitstr2)
True
>>> bitstr2.contains(bitstr3)
False
>>> bitstr2.independent(bitstr3)
False
>>> bitstr2.independent(bitstr4)
True
>>> bitstr1.iscompatible(bitstr2)
True
>>> bitstr2.iscompatible(bitstr3)
False
>>> bitstr2.iscompatible(bitstr4)
True

In consensus and branch support algorithms, it is used to count and store the clades in multiple trees. During counting, the clades will be considered the same if their terminals(in terms of name attribute) are the same.

For example, let's say two trees are provided as below to search their strict consensus tree:

tree1: (((A, B), C),(D, E))
tree2: ((A, (B, C)),(D, E))

For both trees, a _BitString object '11111' will represent their root clade. Each '1' stands for the terminal clade in the list [A, B, C, D, E] (the order might not be the same, it's determined by the get_terminal method of the first tree provided). For the clade ((A, B), C) in tree1 and (A, (B, C)) in tree2, they both can be represented by '11100'. Similarly, '11000' represents clade (A, B) in tree1, '01100' represents clade (B, C) in tree2, and '00011' represents clade (D, E) in both trees.

So, with the _count_clades function in this module, finally we can get the clade counts and their _BitString representation as follows(the root and terminals are omitted):

clade   _BitString   count
ABC     '11100'     2
DE      '00011'     2
AB      '11000'     1
BC      '01100'     1

To get the _BitString representation of a clade, we can use the following code snippet:

# suppose we are provided with a tree list, the first thing to do is 
# to get all the terminal names in the first tree
term_names = [term.name for term in trees[0].get_terminals()]
# for a specific clade in any of the tree, also get its terminal names
clade_term_names = [term.name for term in clade.get_terminals()]
# then create a boolean list 
boolvals = [name in clade_term_names for name in term_names]
# create the string version and pass it to _BitString  
bitstr = _BitString(''.join(map(str, map(int, boolvals))))

To convert back:

# get all the terminal clades of the first tree
terms = [term for term in trees[0].get_terminals()]
# get the index of terminal clades in bitstr
index_list = bitstr.index_one()
# get all terminal clades by index
clade_terms = [terms[i] for i in index_list]
# create a new calde and append all the terminal clades
new_clade = BaseTree.Clade()
new_clade.clades.extend(clade_terms)

This is how the _BitString is used in the consensus and branch support algorithms. And I think it can be used in many other conditions.

For example, we can even use it to check whether the structures of two trees are the same.

# store and return all _BitStrings
def _bitstrs(tree):
    bitstrs = set()
    term_names = [term.name for term in tree.get_terminals()]
    term_names.sort()
    for clade in tree.get_nonterminals():
        clade_term_names = [term.name for term in clade.get_terminals()]
        boolvals = [name in clade_term_names for name in term_names]  
        bitstr = _BitString(''.join(map(str, map(int, boolvals))))
        bitstrs.add(bitstr)
    return bitstrs
    
def compare(tree1, tree2):
    term_names1 =  [term.name for term in tree1.get_terminals()]
    term_names2 =  [term.name for term in tree2.get_terminals()]
    # false if terminals are not the same 
    if set(term_names1) != set(term_names2):
        return False
    # true if _BitStrings are the same
    if _bitstrs(tree1) == _bitstrs(tree2):
        return True
    else:
        return False

To use it:

>>> tree1 = Phylo.read('Tests/TreeConstruction/upgma.tre', 'newick')
>>> tree2 = Phylo.read('Tests/TreeConstruction/nj.tre', 'newick')
>>> tree3 = Phylo.read('Tests/TreeConstruction/pars1.tre', 'newick')
>>> compare(tree1, tree2)
False
>>> compare(tree1, tree3)
True
>>> compare(tree2, tree3)
False

External Applications

  • Bio.Phylo.PhymlCommandline provides a wrapper for PhyML following the usual Bio.Applications API.
  • Bio.Phylo.PAML provides wrappers, parsers and utilities for working with the PAML suite of programs.

Example pipeline

See the Biopython Tutorial sections on sequence alignment and BLAST for explanations of the first few steps shown here.

1. Search for homologs of a protein sequence using BLAST.

from Bio.Blast import NBCIStandalone, NCBIXML
 
query_fname = 'AAG35789.fasta'
result_handle, error_handle = NCBIStandalone.blastall('/usr/bin/blastall', 'blastp',
                                                      '/db/fasta/swissprot', query_fname)
blast_record = NCBIXML.read(result_handle)  # This takes some time to run

2. Extract the best hits from the BLAST result.

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
 
def get_seqrecs(alignments, threshold):
    for aln in alignments:
        for hsp in aln.hsps:
            if hsp.expect < threshold:
                yield SeqRecord(Seq(hsp.sbjct), id=aln.accession)
                break
 
best_seqs = get_seqrecs(blast_record.alignments, 1e-90)
SeqIO.write(best_seqs, 'egfr-family.fasta', 'fasta')

To help with annotating to your tree later, pick a lookup key here (e.g. accession number) and build a dictionary mapping that key to any additional data you can glean from the BLAST output, such as taxonomy and GI numbers. In this example, we're only keeping the original sequence and accession number.

3. Re-align the sequences using Muscle. The program creates an alignment file in Clustal format, "egfr-family.aln".

from Bio.Align.Applications import MuscleCommandline
 
cmdline = MuscleCommandline(input="egfr-family.fasta", out="efgr-family.aln", clw=True)
cmdline()

4. Infer a gene tree using PhyML. First, convert the alignment from step 3 to "relaxed Phylip" format (new in Biopython 1.58):

from Bio import AlignIO
 
AlignIO.convert("egfr-family.aln", "clustal", "egfr-family.phy", "phylip-relaxed")

Feed the alignment to PhyML using the command line wrapper:

from Bio.Phylo.Applications import PhymlCommandline
 
cmdline = PhymlCommandline(input='egfr-family.phy', datatype='aa', model='WAG', alpha='e', bootstrap=100)
out_log, err_log = cmdline()

5. Load the gene tree with Phylo, and take a quick look at the topology. (PhyML writes the tree to a file named after the input file plus "_phyml_tree.txt".)

from Bio import Phylo
 
egfr_tree = Phylo.read("egfr-family.phy_phyml_tree.txt", "newick")
Phylo.draw_ascii(egfr_tree)

6. Add accession numbers and sequences to the tree -- now we're using PhyloXML's extra features.

from Bio.Phylo import PhyloXML
 
# Promote the basic tree to PhyloXML
egfr_phy = egfr_tree.as_phyloxml()
 
# Make a lookup table for sequences
lookup = dict((rec.id, str(rec.seq)) for rec in best_seqs)
 
for clade in egfr_phy.get_terminals():
    key = clade.name
    accession = PhyloXML.Accession(key, 'NCBI')
    mol_seq = PhyloXML.MolSeq(lookup[key], is_aligned=True)
    sequence = PhyloXML.Sequence(type='aa', accession=accession, mol_seq=mol_seq)
    clade.sequences.append(sequence)
 
# Save the annotated phyloXML file
Phylo.write(egfr_phy, 'egfr-family-annotated.xml', 'phyloxml')
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox