From Biopython
Revision as of 04:57, 20 May 2009 by Matzke (Talk | contribs)
Jump to: navigation, search



BioGeography is a module under development by Nick Matzke for a Google Summer of Code 2009 project. It is run through NESCENT's Phyloinformatics Summer of Code 2009. See the project proposal at: Biogeographical Phylogenetics for BioPython. The mentors are Stephen Smith (primary), Brad Chapman, and David Kidd. The code currently lives at the nmatzke branch on GitHub, and you can see a timeline and other info about ongoing development here.

Abstract: Create a BioPython module that will enable users to automatically access and parse species locality records from online biodiversity databases; link these to user-specified phylogenies; calculate basic alpha- and beta-phylodiversity summary statistics, produce input files for input into the various inference algorithms available for inferring historical biogeography; convert output from these programs into files suitable for mapping, e.g. in Google Earth (KML files).

Work Plan

May, week 1: Functions to read locality data and place points in geographic regions (Tasks 1-2)

  • Function: readshapefile
Parses polygon, point, and multipoint shapefiles into python objects (storing latitude/longitude coordinates and feature names, e.g. the region name associated with each polygon)
  • Function: readGBIFrecord
Parse a manually downloaded GBIF record, extracting latitude/longitude and taxon names
  • Function: points2ranges
Input geographic points, determine which region (polygon) each range falls in (via point-in-polygon algorithm); also output points that are unclassified, e.g. some GBIF locations were mis-typed in the source database, so a record will fall in the middle of the ocean.

June, week 1: Functions to search GBIF and download occurrence records

Note: creating functions for all possible interactions with GBIF is not possible in the time available, I will just focus on searching and downloading basic record occurrence record data. The relevant GBIF web service is here:

  • Function: searchGBIFrecords – user inputs parameters and a list of GBIF records is returned
  • Function: gettaxonconceptkey – user inputs a taxon name and gets the GBIF key back (useful for searching GBIF records and finding e.g. synonyms and daughter taxa). The GBIF taxon concepts are accessed via the taxon web service:

June, week 2: Functions to get GBIF records

  • Function: getGBIFrecord – retrieves the record (for this project, just the “brief” format of the record) and saves it
  • Function: getGBIFrecords – calls getGBIFrecord for a user-specified list of records (derived from searchGBIFrecords function call)
  • Function: readGBIFrecords – calls readGBIFrecord on a list of saved records

June, week 3: Functions to read user-specified Newick files (with ages and internal node labels) and generate basic summary information.

(note: I have scripts doing all of these functions already, so the work is integrating them into a Biopython module, testing them, etc.)

  • Function: read_ultrametric_Newick – read a Newick file into a tree object (a series of node objects links to parent and daughter nodes), also reading node ages and node labels if any.
  • Function: treelength – get the total branchlength above a given node
  • Function: phylodistance – get the phylogenetic distance (branch length) between two nodes
  • Function: get_distance_matrix – get a matrix of all of the pairwise distances between the tips of a tree.

This can be a slow function for large trees; currently I call a java function from python, this is probably the way to go.

  • Function: subset_tree – given a list of tips and a tree, remove all other tips and resulting redundant nodes to produce a new smaller tree (as in Phylomatic)

June, week 4: Functions to summarize taxon diversity in regions, given a phylogeny and a list of taxa and the regions they are in.

(note: I have scripts doing all of these functions already, so the work is integrating them into a Biopython module, testing them, etc.)

  • Function: alphadiversity – alpha diversity of a region (number of taxa in the region)
  • Function: betadiversity – beta diversity (Sorenson’s index) between two regions
  • Function: alphaphylodistance – total branchlength of a phylogeny of taxa within a region
  • Function: phylosor – phylogenetic Sorenson’s index between two regions
  • Function: meanphylodistance – average distance between all tips on a region’s phylogeny
  • Function: meanminphylodistance – average distance to nearest neighbor for tips on a region’s phylogeny
  • Function: netrelatednessindex – standardized index of mean phylodistance
  • Function: nearesttaxonindex – standardized index of mean minimum phylodistance

July, week 1: lagrange input/output handling (Task 6)

(note: lagrange requires a number of input files, e.g. hypothesized histories of connectivity; the only inputs suitable for automation in this project are the species ranges and phylogeny

  • Function: make_lagrange_species_range_inputs – convert list of taxa/ranges to input format:
  • Function: check_input_lagrange_tree – checks if input phylogeny meets the requirements for lagrange, i.e. has ultrametric branchlengths, tips end at time 0, tip names are in the species/ranges input file
  • Function: parse_lagrange_output – take the output file from lagrange and get ages and estimated regions for each node

July, weeks 2-3: Devise algorithm for representing estimated node histories (location of nodes in categorical regions) as latitude/longitude points, necessary for input into geographic display files.

  • Regarding where to put reconstructed nodes, or tips that where the only location information is region. Within regions, dealing with linking already geo-located tips, spatial averaging can be used as currently happens with GeoPhyloBuilder. If there is only one node in a region the centroid or something similar could be used (i.e. the "root" of the polygon skeleton would deal even with weird concave polygons).
  • If there are multiple ancestral nodes or region-only tips in a region, they need to be spread out inside the polygon, or lines will just be drawn on top of each other. This can be done by putting the most ancient node at the root of the polygon skeleton/medial axis, and then spreading out the daughter nodes along the skeleton/medial axis of the polygon.
  • Function: get_polygon_skeleton – this is a standard operation:
  • Function: assign_node_locations_in_region -- within a region’s polygon, given a list of nodes, their relationship, and ages, spread the nodes out along the middle 50% of the longest axis of the polygon skeleton, with the oldest node in the middle
  • Function: assign_node_locations_between_regions – connect the nodes that are linked to branches that cross between regions (for this initial project, just the great circle lines)

July, week 4 and August, week 1: Write functions for converting the output from the above into graphical display formats, e.g. shapefiles for ArcGIS, KML files for Google Earth.

  • Function: write_history_to_shapefile -- write the biogeographic history to a shapefile
  • Function: write_history_to_KML – write the biogeographic history to a KML file for input into Google Earth

===August, week 2: Beta testing

Make the series of functions available, along with suggested input files; have others run on various platforms, with various levels of expertise (e.g. Evolutionary Biogeography Discussion Group at U.C. Berkeley). Also get final feedback from mentors and advisors.

August, week 3: Wrapup

Assemble documentation, FAQ, project results writeup for Phyloinformatics Summer of Code.

Personal tools