(Added links to the cookbook and PhyloXML pages in the intro)
(→read(): example of parsing a string via StringIO)
|Line 76:||Line 76:|
tree = Phylo.read('example.
tree = Phylo.read('example.', '')
Revision as of 21:45, 10 October 2010
This module provides classes, functions and I/O support for working with phylogenetic trees.
This code is included in Biopython 1.54, and also described in the Phylogenetics chapter of the Biopython Tutorial and the Bio.Phylo API pages generated from the source code. The Phylo cookbook page has more examples of how to use this module, and the PhyloXML page describes how to attach graphical cues and additional information to a tree.
The source code for this module is included in Biopython 1.54. If you're interested in testing newer additions to this code before the next official release, see SourceCode for instructions on getting a copy of the development branch.
- Python 2.4 or newer
- ElementTree module
To draw trees (optional), you'll also need these packages:
- NetworkX 1.0rc1 (or 0.36 for snapshot at the end of GSoC 2009)
- PyGraphviz 0.99.1 or pydot
The I/O and tree-manipulation functionality will work without them; they're imported on demand when the functions to_networkx() and draw_graphviz() are called.
The XML parser used in the PhyloXMLIO sub-module is ElementTree, added to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:
PhyloXMLIO attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information.
This module has also been successfully tested on Jython 2.5.1, minus the Graphviz- and NetworkX-based functions. However, parsing phyloXML files is noticeably slower because Jython uses a different version of the underlying XML parsing library.
Wrappers for supported file formats are available from the top level of the module:
from Bio import Phylo
Like SeqIO and AlignIO, this module provides four I/O functions: parse(), read(), write() and convert(). Each function accepts either a file name or an open file handle, so data can be also loaded from compressed files, StringIO objects, and so on. If the file name is passed as a string, the file is automatically closed when the function finishes; otherwise, you're responsible for closing the handle yourself.
The second argument to each function is the target format. Currently, the following formats are supported:
See the PhyloXML page for more examples of using tree objects.
Incrementally parse each tree in the given file or handle, returning an iterator of Tree objects (i.e. some subclass of the Bio.Phylo.BaseTree Tree class, depending on the file format).
>>> trees = Phylo.parse('phyloxml_examples.xml', 'phyloxml') >>> for tree in trees: ... print tree.name
example from Prof. Joe Felsenstein's book "Inferring Phylogenies" example from Prof. Joe Felsenstein's book "Inferring Phylogenies" same example, with support of type "bootstrap" same example, with species and sequence same example, with gene duplication information and sequence relationships similar example, with more detailed sequence data network, node B is connected to TWO nodes: AB and C ...
If there's only one tree, then the next() method on the resulting generator will return it.
>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next() >>> tree.name 'example from Prof. Joe Felsenstein\'s book "Inferring Phylogenies"'
Note that this doesn't immediately reveal whether there are any remaining trees -- if you want to verify that, use read() instead.
Parse and return exactly one tree from the given file or handle. If the file contains zero or multiple trees, a ValueError is raised. This is useful if you know a file contains just one tree, to load that tree object directly rather than through parse() and next(), and as a safety check to ensure the input file does in fact contain exactly one phylogenetic tree at the top level.
tree = Phylo.read('example.dnd', 'newick') print tree
If you have your tree data already loaded as a Python string, you can parse it with the help of StringIO (in Python's standard library):
from cStringIO import StringIO treedata = "(A, (B, C), (D, E))" handle = StringIO(treedata) tree = Phylo.read(handle, "newick")
In one line:
tree = Phylo.read(StringIO("(A, (B, C), (D, E))"), "newick")
The other I/O functions also can be used with StringIO.
(General tip: if you write to the StringIO object and want to re-read the contents, you'll need to call the
seek(0) method to move the handle back to the start of the StringIO data -- the same as an open file handle. See examples of this in the unit tests for Phylo,
Tests/test_Phylo.py, in the Biopython source code.)
Write a sequence of Tree objects to the given file or handle. Passing a single Tree object instead of a list or iterable will also work. (See, Phylo is friendly.)
tree1 = Phylo.read('example1.xml', 'phyloxml') tree2 = Phylo.read('example2.xml', 'phyloxml') Phylo.write([tree1, tree2], 'example-both.xml', 'phyloxml')
Given two files (or handles) and two formats, both supported by Bio.Phylo, convert the first file from the first format to the second format, writing the output to the second file.
Phylo.convert('example.nhx', 'newick', 'example2.nex', 'nexus')
Within the Phylo module are parsers and writers for specific file formats, conforming to the basic top-level API and sometimes adding additional features.
NewickIO: A port of the parser in Bio.Nexus.Trees to support the Newick (a.k.a. New Hampshire) format through the Phylo API.
NexusIO: Wrappers around Bio.Nexus to support the Nexus tree format.
The Nexus format actually contains several sub-formats for different kinds of data; to represent trees, Nexus provides a block containing some metadata and one or more Newick trees. (Another kind of Nexus block can represent alignments; this is handled in AlignIO.) So to parse a complete Nexus file with all block types handled, use Bio.Nexus directly, and to extract just the trees, use Bio.Phylo. Integration between Bio.Nexus and Bio.Phylo will be improved in the future.
Tree and Subtree classes
The basic objects are defined in Bio.Phylo.BaseTree.
To support additional information stored in specific file formats, sub-modules within Tree offer additional classes that inherit from BaseTree classes.
Each sub-class of BaseTree.Tree or Node has a class method to promote an object from the basic type to the format-specific one. These sub-class objects can generally be treated as instances of the basic type without any explicit conversion.
PhyloXML: Support for the phyloXML format. See the PhyloXML page for details.
Newick: The Newick module provides minor enhancements to the BaseTree classes, plus several shims for compatibility with the existing Bio.Nexus module. The API for this module is under development and should not be relied on, other than the functionality already provided by BaseTree.
Some additional tools are located in the Utils module under Bio.Phylo. These functions are also loaded to the top level of the Phylo module on import for easy access.
Where a third-party package is required, that package is imported when the function itself is called, so these dependencies are not necessary to install or use the rest of the Tree module.
Exporting to other object representations
Although any phylogenetic tree can reasonably be represented by a directed acyclic graph, the Phylo module does not attempt to provide a generally usable graph library -- only the minimum functionality to represent phylogenetic trees. Instead, it provides functions for exporting tree objects to the standard graph representations, adjacency list (dict) and adjacency matrix, using third-party libraries.
to_networkx returns the given tree as a NetworkX LabeledDiGraph or LabeledGraph object (depending on whether the tree is rooted). You'll probably need to import networkx directly for subsequent operations on the graph object. From this point you can also try using one of networkx's drawing functions to display the tree, and for simple, fully labeled trees it may even work -- but you'll have better results with Phylo's own draw_graphviz function, discussed below.
import networkx, pylab tree = Phylo.read('example.xml', 'phyloxml') net = Phylo.to_networkx(tree) networkx.draw(net) pylab.show()
str(tree) produces a plain-text representation of the entire tree. Strings are automatically truncated to ensure reasonable display.
Use this with the print statement to get a quick overview of your tree:
>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next() >>> print tree Phylogeny(description='phyloXML allows to use either a "branch_length" attribute or element to indicate branch lengths.', name='example from Prof. Joe Felsenstein's book "Inferring Phylogenies"') Clade() Clade(branch_length=0.06) Clade(branch_length=0.102, name='A') Clade(branch_length=0.23, name='B') Clade(branch_length=0.4, name='C') ...
draw_graphviz mimics the networkx function of the same name, with some tweaks to improve the display of the graph. If a file name is given, the graph is drawn directly to that file, and options such as image format (default PDF) may be used.
Drawing a basic dendrogram is simple:
import pylab tree = Phylo.read('apaf.xml', 'phyloxml') Phylo.draw_graphviz(tree) pylab.show()
Here's the same tree without the circles at each labelled node:
See the Phylo cookbook page for more drawing features and options.
draw_ascii prints an ascii-art rooted phylogram to standard output, or another file handle if specified. Only terminal node labels are shown; these are the result of str(clade) (usually clade names). The width of the text field used for drawing is 80 characters by default, adjustable with the column_width keyword argument, and the height in character rows is twice the number of terminals in the tree.
A simple tree with defined branch lengths looks like this:
>>> tree = Phylo.parse('phyloxml_examples.xml', 'phyloxml').next() >>> Phylo.draw_ascii(tree) _____________ A _______| _| |_______________________________ B | |_______________________________________________________ C
The same topology without branch lengths is drawn with equal-length branches:
___________________________ A ___________________________| _| |___________________________ B | |___________________________ C
A larger tree (apaf.xml, 31 leaf nodes) drawn with the default column width demonstrates how relatively short branches are handled:
>>> apaf = Phylo.read('apaf.xml', 'phyloxml') >>> Phylo.draw_ascii(apaf) _ 22_MOUSE | _| Apaf-1_HUMAN | | ,| | 12_CANFA || _||___ 11_CHICK | | | |___________ 16_XENLA _______| | | , 14_FUGRU | | __| | |____| |__ 15_TETNG _____| | | | |____ 17_BRARE | | | | ______ 1_BRAFL | | __| ______| || |_________ 18_NEMVE | | | | | |____________ 23_STRPU | | _| | _________ 26_STRPU | | |_________| | | |________ 25_STRPU | | | | ___ CED4_CAEEL | |___________________________________| ____| |_ 31_CAEBR | | | | ___ 28_DROPS | | _____________________| | | ______| |____ Dark_DROME | | | | | |__| |_______________________ 29_AEDAE | | | |__________________________ 30_TRICA | | _ 34_BRAFL | _________________________| _| _____| |_ 35_BRAFL | | | | __| |_______________ 8_BRAFL | | | | | | ___________________ 20_NEMVE | ______________| |_______| | | | |__________________________ 21_NEMVE | | | | ___| |______________________________ 9_BRAFL | | | | | | _____________ 3_BRAFL | | | _____| | | |_________| |_________________ 2_BRAFL |____| | | |_______________ 19_NEMVE | | _____ 37_BRAFL | ________________________| |___________| |____ 36_BRAFL | |______________________ 33_BRAFL
See the Biopython Tutorial sections on sequence alignment and BLAST for explanations of the first few steps shown here.
1. Search for homologs of a protein sequence using BLAST.
from Bio.Blast import NBCIStandalone, NCBIXML query_fname = 'human-egfr.fasta' result_handle, error_handle = NCBIStandalone.blastall('/usr/bin/blastall', 'blastp', '/db/fasta/nr', query_fname) blast_record = NCBIXML.read(result_handle) # This takes some time to run
2. Extract the best hits from the BLAST result.
from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord def get_seqrecs(alignments, threshold): for aln in alignments: for hsp in aln.hsps: if hsp.expect < threshold: yield SeqRecord(Seq(hsp.sbjct), id=aln.accession) break best_seqs = get_seqrecs(blast_record.alignments, 1e-50) SeqIO.write(best_seqs, open('egfr-family.fasta', 'w+'), 'fasta')
To help with annotating to your tree later, pick a lookup key here (e.g. accession number) and build a dictionary mapping that key to any additional data you can glean from the BLAST output, such as taxonomy and GI numbers. In this example, we're only keeping the original sequence and accession number.
3. Re-align the sequences using ClustalW. The program creates an alignment file in Clustal format, "egfr-family.aln", and a gene tree in Newick format, "egfr-family.dnd".
import sys, subprocess from Bio import AlignIO from Bio.Align.Applications import ClustalwCommandline cline = ClustalwCommandline("clustalw", infile="egfr-family.fasta") child = subprocess.call(str(cline), shell=(sys.platform!="win32"))
4. Load the gene tree with Phylo, and take a quick look at the topology.
from Bio import Phylo egfr_tree = Phylo.read("egfr-family.dnd", "newick") Phylo.draw_ascii(egfr_tree)
5. Add accession numbers and sequences to the tree -- now we're using PhyloXML's extra features.
from Bio.Phylo import PhyloXML # Make a lookup table for sequences lookup = dict((rec.id, str(rec.seq)) for rec in best_seqs) for clade in egfr_tree.get_terminals(): key = clade.name accession = PhyloXML.Accession(key, 'NCBI') mol_seq = PhyloXML.MolSeq(lookup[key], is_aligned=True) sequence = PhyloXML.Sequence(type='aa', accession=accession, mol_seq=mol_seq) clade.sequences.append(sequence) # Save the annotated phyloXML file Phylo.write(tree, 'egfr-family-annotated.xml', 'phyloxml')