PhyloXML

(Difference between revisions)
Jump to: navigation, search
(Reorg; I'll fill in the rest tomorrow)
Line 3: Line 3:
 
This code is not yet part of Biopython, and therefore the documentation has not been integrated into the Biopython Tutorial yet either.
 
This code is not yet part of Biopython, and therefore the documentation has not been integrated into the Biopython Tutorial yet either.
  
==Usage==
+
==Installation==
  
(Coming soon: use cases)
+
The source code for this module currently lives on the [http://github.com/etal/biopython/tree/phyloxml phyloxml branch] in GitHub. If you're interested in testing this code before it's been merged into Biopython, follow the instructions there to create your own fork, or just clone the phyloxml branch onto your machine.
  
==Development==
+
===Requirements===
 
+
The source code for this module currently lives on the [http://github.com/etal/biopython/tree/phyloxml phyloxml branch] in GitHub.
+
 
+
===Parser===
+
  
 
The XML parser used in this module is ElementTree, new to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:
 
The XML parser used in this module is ElementTree, new to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:
Line 19: Line 15:
 
The module attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information. Existing Biopython classes will be reused for these objects wherever appropriate.
 
The module attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information. Existing Biopython classes will be reused for these objects wherever appropriate.
  
This parser is meant to be able to handle large files, meaning several thousand external nodes. (Benchmarks of relevant XML parsers for Python are [http://effbot.org/zone/celementtree.htm#benchmarks here]. To support this, the parser takes an open file handle, and the API will offer wrappers for loading compressed files, and perhaps pulling from a database or web URL, too.
+
This parser is meant to be able to handle large files, meaning several thousand external nodes. (Benchmarks of relevant XML parsers for Python are [http://effbot.org/zone/celementtree.htm#benchmarks here].) It has been tested with files of this size; for example, the complete NCBI taxonomy parses in about 100 seconds and consumes about 1.6 GB of memory. Provided enough memory is available on the system, the writer can also rebuild phyloXML files of this size.
 +
 
 +
==Usage==
 +
 
 +
The most useful parts of this package are available from the top level of the module:
 +
 
 +
<code>from Bio import PhyloXML</code> 
 +
 
 +
The main structural element of these phylogenetic trees is the Clade.
 +
 
 +
===Parser===
 +
 
 +
This module provides two functions: read() returns a single object representing the entire file's data, while parse() iteratively constructs and yields the phylogenetic trees contained in the file (there may be more than one).
 +
 
 +
Both functions accept either a file name or an open file handle, so phyloXML data can be also loaded from compressed files, StringIO objects, and so on.
  
 
===Writer===
 
===Writer===
Line 27: Line 37:
 
===Integration===
 
===Integration===
  
At some point this should be merged into the Biopython trunk, and it would be nice to have a common interface with Bio.Nexus and Newick. Should these three modules be reorganized to extract a common Bio.TreeIO interface? Let's discuss it at some point.
+
The Tree.Sequence class contains methods for converting to and from Biopython [[SeqRecord]] objects. This includes the molecular sequence (mol_seq) as a [[Seq]] object, and the protein domain architecture as list of [[SeqFeature]] objects.
 +
 
 +
At some point this module should be merged into the Biopython trunk, and it would be nice to have a common interface with Bio.Nexus and Newick. Should these three modules be reorganized to extract a common Bio.TreeIO interface? Let's discuss it at some point.
  
 +
===Tricks===
  
 
==Summer of Code project==
 
==Summer of Code project==

Revision as of 03:52, 2 July 2009

This module handles the parsing and generation of files in the phyloXML format.

This code is not yet part of Biopython, and therefore the documentation has not been integrated into the Biopython Tutorial yet either.

Contents

Installation

The source code for this module currently lives on the phyloxml branch in GitHub. If you're interested in testing this code before it's been merged into Biopython, follow the instructions there to create your own fork, or just clone the phyloxml branch onto your machine.

Requirements

The XML parser used in this module is ElementTree, new to the Python standard library in Python 2.5. To use this module in Python 2.4, you'll need to install a separate package that provides the ElementTree interface. Two exist:

The module attempts to import each of these compatible ElementTree implementations until it succeeds. The given XML file handle is then parsed incrementally to instantiate an object hierarchy containing the relevant phylogenetic information. Existing Biopython classes will be reused for these objects wherever appropriate.

This parser is meant to be able to handle large files, meaning several thousand external nodes. (Benchmarks of relevant XML parsers for Python are here.) It has been tested with files of this size; for example, the complete NCBI taxonomy parses in about 100 seconds and consumes about 1.6 GB of memory. Provided enough memory is available on the system, the writer can also rebuild phyloXML files of this size.

Usage

The most useful parts of this package are available from the top level of the module:

from Bio import PhyloXML

The main structural element of these phylogenetic trees is the Clade.

Parser

This module provides two functions: read() returns a single object representing the entire file's data, while parse() iteratively constructs and yields the phylogenetic trees contained in the file (there may be more than one).

Both functions accept either a file name or an open file handle, so phyloXML data can be also loaded from compressed files, StringIO objects, and so on.

Writer

The writer portion of this module hasn't been written yet, but presumably it will be based on ElementTree as well.

Integration

The Tree.Sequence class contains methods for converting to and from Biopython SeqRecord objects. This includes the molecular sequence (mol_seq) as a Seq object, and the protein domain architecture as list of SeqFeature objects.

At some point this module should be merged into the Biopython trunk, and it would be nice to have a common interface with Bio.Nexus and Newick. Should these three modules be reorganized to extract a common Bio.TreeIO interface? Let's discuss it at some point.

Tricks

Summer of Code project

This module is being developed by Eric as a project for Google Summer of Code 2009, with NESCent as the mentoring organization and Brad as the primary mentor.

Main SoC project page: PhyloSoC:Biopython support for parsing and writing phyloXML

Other software

Christian Zmasek, author of the phyloXML specification, has released some software that uses this format:

  • Forester -- a collection of Java and Ruby libraries for working with phylogenetic data
  • Archaopteryx -- Java application for the visualization of annotated phylogenetic trees (also available in applet form)

Another list is maintained at phylosoft.org.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox