PhyloXML

From Biopython
Revision as of 02:32, 24 September 2009 by EricTalevich (Talk | contribs)
Jump to: navigation, search

This module handles the parsing, generation and manipulation of files in the phyloXML format.

This code is not yet part of Biopython, and therefore the documentation has not been integrated into the Biopython Tutorial yet either.

Contents

Availability

The source code for this module currently lives on Eric's phyloxml branch in GitHub. If you're interested in testing this code before it's been merged into Biopython, follow the instructions there to create your own fork, or just clone the phyloxml branch onto your machine.

Requirements:

  • Biopython 1.51 or newer (older may work, but hasn't been tested)
  • Python 2.4 or newer
  • ElementTree module

The XML parser used in this module is ElementTree, new to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:

The PhyloXMLIO sub-module attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information.

To draw trees (optional), you'll also need these packages:

The I/O and tree-manipulation functionality will work without them; they're imported on demand when the functions to_networkx() and draw_graphviz() are called.

About the format

A complete phyloXML document has a root node with the tag "phyloxml". Directly under the root is a sequence of "phylogeny" elements (phylogenetic trees), possibly followed by other arbitrary data not included in the phyloXML spec. The main structural element of these phylogenetic trees is the Clade: a tree has a clade attribute, along with other attributes, and each clade contains a series of clades (and other attributes), recursively.

The child nodes and attributes of each XML node are mapped onto classes in the PhyloXML.Tree module, keeping the names the same where possible; the XML document structure is closely mirrored in the Phyloxml objects produced by Bio.TreeIO.PhyloXMLIO.read().

For example, this XML:

<phyloxml>
   <phylogeny rooted="true">
      <name>An example</name>
      <clade>
         <clade branch_length="0.06">
            <clade branch_length="0.102">
               <name>A</name>
            </clade>
            <clade branch_length="0.23">
               <name>B</name>
            </clade>
         </clade>
         <clade branch_length="0.4">
            <name>C</name>
         </clade>
      </clade>
   </phylogeny>
</phyloxml>

produces an object hierarchy like this:

Phyloxml(phylogenies=[
            Phylogeny(name='An example',
                      rooted='True',
                      clade=Clade(clades=[
                                    Clade(branch_length='0.06',
                                          clades=[
                                            Clade(branch_length='0.102', name='A'),
                                            Clade(branch_length='0.23', name='B'),
                                            ]),
                                    Clade(branch_length='0.4', name='C'),
                                    ]))
    ])

which represents a phylogeny like this:


         .102
       _______A
   .06 |
 ______|
 |     | .23
 |     |______B
_|
 |
 |    .4
 |____________C

The tree objects are derived from base classes in Bio.Tree; see that page for more about this object representation.

I/O functions

To start working with phyloXML files, use the TreeIO package with 'phyloxml' as the format argument:

>>> from Bio import TreeIO
>>> tree = TreeIO.read('some-trees.xml', 'phyloxml')
# ValueError: There are multiple trees in this file; use parse() instead.
>>> trees = TreeIO.parse('some-trees.xml', 'phyloxml')
>>> TreeIO.write(trees.next(), 'first-tree.xml', 'phyloxml')
1
>>> TreeIO.write(trees, 'rest-trees.xml', 'phyloxml')
12

These functions work with Phylogeny objects (derived from BaseTree.Tree) from the Bio.Tree.PhyloXML module. This standard API is enough for most use cases.

PhyloXMLIO

Within the Bio.TreeIO module, the I/O functions for the phyloXML format are implemented in the PhyloXMLIO sub-module. For access to some additional functionality beyond the basic TreeIO API, or to skip specifying the 'phyloxml' format argument each time, this can be imported directly:

from Bio.TreeIO import PhyloXMLIO

The read() function returns a single Bio.Tree.PhyloXML.Phyloxml object representing the entire file's data. The phylogenetic trees are in the "phylogenies" attribute, and any other arbitrary data is stored in "other".

>>> phx = PhyloXMLIO.read('phyloxml_examples.xml')
>>> print phx
Phyloxml
>>> len(phx.phylogenies)
13
>>> len(phx.other)
1
>>> print phx.other
[Other(tag='alignment', namespace='http://example.org/align')]
>>> print phx.other[0].children
[Other(tag='seq', namespace='http://www.phyloxml.org', value='acgtcgcggcccgtggaagtcctctcct'),
Other(tag='seq', namespace='http://www.phyloxml.org', value='aggtcgcggcctgtggaagtcctctcct'),
Other(tag='seq', namespace='http://www.phyloxml.org', value='taaatcgc--cccgtgg-agtccc-cct')]

If you aren't interested in the "other" data, you can use parse() to iteratively construct just the phylogenetic trees contained in the file -- this is exactly the same as calling TreeIO.parse() with the 'phyloxml' format argument.

PhyloXMLIO.write() is similar to TreeIO.write(), but also accepts a Phyloxml object (the result of read() or to_phyloxml()) to serialize. Optionally, an encoding other than UTF-8 can be specified.

>>> phx = PhyloXMLIO.read('phyloxml_examples.xml')
>>> print phx.other
[Other(tag='alignment', namespace='http://example.org/align')]
>>> phx.other = []
>>> PhyloXMLIO.write(phx, 'ex_no_other.xml')
13
>>> phx_no = PhyloXMLIO.read('ex_no_other.xml')
>>> phx_no.other
[]

PhyloXMLIO also contains a utility called dump_tags() for printing all of the XML tags as they are encountered in a phyloXML file. This can be helpful for debugging, or used along with grep or sort -u on the command line to obtain a list of the tags a phyloXML file contains.

>>> PhyloXMLIO.dump_tags('phyloxml_examples.xml')
{http://www.phyloxml.org}phyloxml
{http://www.phyloxml.org}phylogeny
{http://www.phyloxml.org}name
{http://www.phyloxml.org}description
{http://www.phyloxml.org}clade
...

Using PhyloXML objects

Standard Python syntactic sugar is supported wherever it's reasonable.

  • str() makes a string of the object's class name and an identifier, suitable for labeling a node in generated graph
  • repr() makes a string resembling the object constructor call, such that eval(repr(obj)) will return obj for simpler PhyloXML objects, and at least partially rebuild more complex objects.
  • iter() is supported by Phyloxml and Clade objects, iterating over the contained phylogenies and sub-clades, respectively
  • len() is supported by the same objects that support iteration, with expected results

Clade objects also support slicing and multiple indexing:

tree = TreeIO.parse('example.xml', 'phyloxml').next()
assert tree.clade[0] == tree.clade.clades[0]
assert tree.clade[0,1] == tree.clade.clades[0].clades[1]

Since valid Phylogeny objects always have a single clade attribute, this style of indexing is a handy way to reach specific nodes buried deep in the tree if you happen to know exactly where they are.

A couple of methods allow converting a selection to a new PhyloXML object: Phylogeny.to_phyloxml() and Clade.to_phylogeny(). A few use cases:

  • Parse a phyloXML containing multiple phylogenetic trees. Check each tree sequentially, and upon finding a tree with the desired characteristic, isolate it as a new PhyloXML object.
for tree in TreeIO.parse('example.xml', 'phyloxml'):
    if tree.name == 'monitor lizards':
        return tree.to_phyloxml()
  • Extract a specific sub-clade and make it a separate phylogeny (and probably a new phyloXML file).
tree = TreeIO.parse('example.xml', 'phyloxml').next()
best = None
for clade in tree.clade:
    if (clade.confidences[0].type == 'bootstrap'
            and (best is None
                or clade.confidences[0].value > best.confidences[0].value)):
        best = clade
phyloxml = best.to_phylogeny(rooted=True).to_phyloxml()
TreeIO.write(phyloxml, 'example_best.xml', 'phyloxml')

Integrating with the rest of Biopython

The classes used by this module inherit from the Tree module's generalized BaseTree classes, and therefore have access to the methods defined on those base classes. Since the phyloXML specification is very detailed, these subclasses are kept in a separate module, Bio.Tree.PhyloXML, and offer additional methods for converting between phyloXML and standard Biopython types.

The PhyloXML.Sequence class contains methods for converting to and from Biopython SeqRecord objects -- to_seqrecord() and from_seqrecord(). This includes the molecular sequence (mol_seq) as a Seq object, and the protein domain architecture as list of SeqFeature objects. Likewise, PhyloXML.ProteinDomain objects have a to_seqfeature() method.

Example pipeline

See the Biopython Tutorial sections on sequence alignment and BLAST for explanations of the first few steps shown here.

1. Search for homologs of a protein sequence using BLAST.

from Bio.Blast import NBCIStandalone, NCBIXML
 
query_fname = 'egfr-kinase.fasta'
result_handle, error_handle = NCBIStandalone.blastall('/usr/bin/blastall', 'blastp',
                                                      '/db/fasta/nr', query_fname)
blast_record = NCBIXML.read(result_handle)  # This takes some time to run

2. Extract the best hits from the BLAST result.

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
 
def get_seqrecs(alignments, threshold):
    for aln in alignments:
        for hsp in aln.hsps:
            if hsp.expect < threshold:
                yield SeqRecord(Seq(hsp.sbjct), id=aln.accession)
                break
 
best_seqs = get_seqrecs(blast_record.alignments, 1e-50)
SeqIO.write(best_seqs, open('tyr-kinases.fasta', 'w+'), 'fasta')

To help with annotating to your tree later, pick a lookup key here (e.g. accession number) and build a dictionary mapping that key to any additional data you can glean from the BLAST output, such as taxonomy and GI numbers. In this example, we're only keeping the original sequence and accession number.

3. Re-align the sequences using MUSCLE, and format the alignment for use with stand-alone Phylip.

import sys, subprocess
from Bio import AlignIO
from Bio.Align.Applications import MuscleCommandline
 
cline = MuscleCommandline(input="tyr-kinases.fasta")
child = subprocess.call(str(cline),
                        stdout=subprocess.PIPE,
                        shell=(sys.platform!="win32"))
align = AlignIO.read(child.stdout, "fasta")
AlignIO.write([align], open('tyr-kinases.phy', 'w+'), 'phylip')

(Note: Phylip alignments have only 9-letter sequence identifiers, which must be unique. For didactic purposes, let's say there are no name collisions and the accession numbers we used as IDs are all less than 10 characters.)

Now run phylip proml with tyr-kinases.phy as the input file, and convert the resulting tree file outtree to phyloXML format using one of the converters listed at the bottom of this page. Call the result tyr-kinases.xml.

4. Add accession numbers and sequences to the tree -- now we're using Tree and TreeIO.

from Bio import TreeIO
from Bio.Tree import PhyloXML
 
# Make a lookup table for sequences
lookup = dict((rec.id, str(rec.seq)) for rec in best_seqs)
 
tree = TreeIO.read('tyr-kinases.xml', 'phyloxml')
for node in tree.find(terminal=True):
    key = node.name
    accession = PhyloXML.Accession(key, 'NCBI')
    mol_seq = PhyloXML.MolSeq(lookup[key], is_aligned=True)
    sequence = PhyloXML.Sequence(type='aa', accession=accession, mol_seq=mol_seq)
    node.sequences.append(sequence)
 
# Save the annotated phyloXML file
TreeIO.write(tree, 'tyr-kinases-pretty.xml', 'phyloxml')

Performance

This parser is meant to be able to handle large files, meaning several thousand external nodes. (Benchmarks of relevant XML parsers for Python are here.) It has been tested with files of this size; for example, the complete NCBI taxonomy parses in about 100 seconds and consumes about 1.3 GB of memory. Provided enough memory is available on the system, the writer can also rebuild phyloXML files of this size.

The read() and parse() functions process a complete file in about the same amount of CPU time. Most of the underlying code is the same, and the majority of the time is spent building Clade objects (the most common node type). For small files (smaller than ncbi_taxonomy_mollusca.xml), the write() function serializes the complete object back to an equivalent file slightly slower than the corresponding read() call; for very large files, write() finishes faster than read().

Here are some times on a 2.00GHz Intel Xeon E5405 processor (only 1 CPU core used) with 7.7GB memory, running the standard Python 2.6.2 on Ubuntu 9.04, choosing the best of 3 runs for each function:

File Ext. Nodes Size (uncompressed) Read (s) Parse (s) Write (s)
apaf.xml 38 KB 0.01 0.01 0.02
bcl_2.xml 105 KB 0.02 0.02 0.04
ncbi_taxonomy_mollusca.xml 5632 1.5 MB 0.51 0.49 0.80
tol_life_on_earth_1.xml 57124 46 MB 10.28 10.67 10.36
ncbi_taxonomy_metazoa.xml 73907 33 MB 15.76 16.15 10.69
ncbi_taxonomy.xml 263691 31 MB (unindented) 109.70 109.14 32.39

On 32-bit architectures, psyco might improve these times significantly, at the risk of increasing memory usage. (I haven't tested it.) For comparison, the Java-based parser used in Forester and ATV (see below) reads the same files about 3-5 times as quickly, or up to 15x for the largest file.

For Python 2.4, performance depends on which ElementTree implementation is used. Using the original pure-Python elementtree, reading/parsing takes about twice as much time for all file sizes, but writing is only significantly slower for very large files.

Summer of Code project

This module is being developed by Eric Talevich as a project for Google Summer of Code 2009, with NESCent as the mentoring organization and Brad Chapman as the primary mentor.

Main SoC project page: PhyloSoC:Biopython support for parsing and writing phyloXML

Related software

Christian Zmasek, one of the authors of the phyloXML specification, has released some software that uses this format:

  • Forester -- a collection of Java and Ruby libraries for working with phylogenetic data
  • Archaopteryx -- Java application for the visualization of annotated phylogenetic trees (also available in applet form)

Another list is maintained at phylosoft.org.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox