BioGeography

From Biopython
Revision as of 18:56, 2 July 2009 by Matzke (Talk | contribs)
Jump to: navigation, search

Contents

Introduction

BioGeography is a module under development by Nick Matzke for a Google Summer of Code 2009 project. It is run through NESCENT's Phyloinformatics Summer of Code 2009. See the project proposal at: Biogeographical Phylogenetics for BioPython. The mentors are Stephen Smith (primary), Brad Chapman, and David Kidd. The source code is in the Bio/Geography directory of the Geography fork of the nmatzke branch on GitHub, and you can see a timeline and other info about ongoing development of the module here. The new module is being documented on the BioPython wiki as BioGeography.

Abstract: Create a BioPython module that will enable users to automatically access and parse species locality records from online biodiversity databases; link these to user-specified phylogenies; calculate basic alpha- and beta-phylodiversity summary statistics, produce input files for input into the various inference algorithms available for inferring historical biogeography; convert output from these programs into files suitable for mapping, e.g. in Google Earth (KML files).

Work Plan

Note: all major functions are being placed in the file geogUtils.py for the moment. Also, the immediate goal is to just get everything basically working, so details of where to put various functions, what to call them, etc. are being left for later.

Code usage: For a few things, an entire necessary function already exists (e.g. for reading a shapefile), and re-inventing the wheel seems pointless. In most cases the material used appears to be open source (e.g. previous Google Summer of Code). For a few short code snippets found online in various places I am less sure. In all cases I am noting the source and when finalizing this project I will go back and determine if the stuff is considered copyright, and if so email the authors for permission to use.

May, week 1: Functions to read locality data and place points in geographic regions (Tasks 1-2)

readshpfile

Parses polygon, point, and multipoint shapefiles into python objects (storing latitude/longitude coordinates and feature names, e.g. the region name associated with each polygon)

extract_latlong

Parse a manually downloaded GBIF record, extracting latitude/longitude and taxon names

shapefile_points_in_poly, tablefile_points_in_poly

Input geographic points, determine which region (polygon) each range falls in (via point-in-polygon algorithm); also output points that are unclassified, e.g. some GBIF locations were mis-typed in the source database, so a record will fall in the middle of the ocean.

Code

June, week 1: Functions to search GBIF and download occurrence records

Note: creating functions for all possible interactions with GBIF is not possible in the time available, I will just focus on searching and downloading basic record occurrence record data.

access_gbif

utility function invoked by other functions, user inputs parameters and the GBIF response in XML/DarwinCore format is returned. The relevant GBIF web service, and the search commands etc., are here: http://data.gbif.org/ws/rest/occurrence

get_hits

Get the actual hits that are be returned by a given search, returns filename were they are saved

get_xml_hits

Like get_hits, but returns a parsed XML tree

fix_ASCII

files downloaded from GBIF contain HTML character entities & unicode characters (e.g. umlauts mostly) which mess up printing results to prompt in Python, this fixes that

paramsdict_to_string

converts user's search parameters (in python dictionary format; see here for params http://data.gbif.org/ws/rest/occurrence ) to a string for submission via access_gbif

xmlstring_to_xmltree(xmlstring)

Take the text string returned by GBIF and parse to an XML tree using ElementTree. Requires the intermediate step of saving to a temporary file (required to make ElementTree.parse work, apparently).

element_items_to_dictionary

If the XML tree element has items encoded in the tag, e.g. key/value or whatever, this function puts them in a python dictionary and returns them.

extract_numhits

Search an element of a parsed XML string and find the number of hits, if it exists. Recursively searches, if there are subelements.

print_xmltree

Prints all the elements & subelements of the xmltree to screen (may require fix_ASCII to input file to succeed)

Deleted (turns out this was unnecessary): gettaxonconceptkey

user inputs a taxon name and gets the GBIF key back (useful for searching GBIF records and finding e.g. synonyms and daughter taxa). The GBIF taxon concepts are accessed via the taxon web service: http://data.gbif.org/ws/rest/taxon

Code

June, week 2: Functions to get GBIF records

Added functions download & parse large numbers of records, get TaxonOccurrence gbifKeys, and search with those keys.

get_record

Retrieves a single specified record in DarwinCore XML format, and returns an xmltree for it.

extract_occurrence_elements

Returns a list of the elements, picking elements by TaxonOccurrence; this should return a list of elements equal to the number of hits.

extract_taxonconceptkeys_tolist

Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete name. Searches recursively, if there are subelements. Returns list.

extract_taxonconceptkeys_tofile

Searches an element in an XML tree for TaxonOccurrence gbifKeys, and the complete name. Searches recursively, if there are subelements. Returns file at outfh.

get_all_records_by_increment

Download all of the records in stages, store in list of elements. Increments of e.g. 100 to not overload server. Currently stores results in a list of tempfiles which is returned (could return a list of handles I guess).

Code

Updated functions have been pushed to Github here

June, week 3: Functions to read user-specified Newick files (with ages and internal node labels) and generate basic summary information.

(note: I have scripts doing all of these functions already, so the work is integrating them into a Biopython module, testing them, etc.)

read_ultrametric_Newick(newickstr)

Read a Newick file into a tree object (a series of node objects links to parent and daughter nodes), also reading node ages and node labels if any.

list_leaves(phylo_obj)

Print out all of the leaves in above a node object

treelength(node)

Gets the total branchlength above a given node by recursively adding through tree.

phylodistance(node1, node2)

Get the phylogenetic distance (branch length) between two nodes.

get_distance_matrix(phylo_obj)

Get a matrix of all of the pairwise distances between the tips of a tree.

get_mrca_array(phylo_obj)

Get a square list of lists (array) listing the mrca of each pair of leaves (half-diagonal matrix)

subset_tree(phylo_obj, list_to_keep)

Given a list of tips and a tree, remove all other tips and resulting redundant nodes to produce a new smaller tree.

prune_single_desc_nodes(node)

Follow a tree from the bottom up, pruning any nodes with only one descendant

find_new_root(node)

Search up tree from root and make new root at first divergence

make_None_list_array(xdim, ydim)

Make a list of lists ("array") with the specified dimensions

get_PD_to_mrca(node, mrca, PD)

Add up the phylogenetic distance from a node to the specified ancestor (mrca). Find mrca with find_1st_match.

find_1st_match(list1, list2)

Find the first match in two ordered lists.

get_ancestors_list(node, anc_list)

Get the list of ancestors of a given node

addup_PD(node, PD)

Adds the branchlength of the current node to the total PD measure.

print_tree_outline_format(phylo_obj)

Prints the tree out in "outline" format (daughter clades are indented, etc.)

print_Node(node, rank)

Prints the node in question, and recursively all daughter nodes, maintaining rank as it goes.

lagrange_disclaimer()

Just prints lagrange citation etc. in code using lagrange libraries.

Code

June, week 4: Functions to summarize taxon diversity in regions, given a phylogeny and a list of taxa and the regions they are in.

(note: I have scripts doing all of these functions already, so the work is integrating them into a Biopython module, testing them, etc.)

Priority for this week:

Following up on suggestions to make the code more standard, with the priority of figuring out how I can revise the current BioPython phylogeny class, to resemble the better version in lagrange, so that there is a generic flexible phylogeny/newick parser that can be used generally as well as by my BioGeography package specifically.


Added a bunch of tools for managing/parsing xmltree structures from ElementTree parsing of XML:

find_to_elements_w_ancs(xmltree, el_tag, anc_el_tag)

Burrow into XML to get an element with tag el_tag, return only those el_tags underneath a particular parent element parent_el_tag


create_sub_xmltree(element)

Create a subset xmltree (to avoid going back to irrelevant parents)


xml_recursive_search_w_anc(xmltree, element, el_tag, anc_el_tag, match_el_list)

Recursively burrows down to find whatever elements with el_tag exist inside a parent_el_tag.


xml_burrow_up(xmltree, element, anc_el_tag, found_anc)

Burrow up xml to find anc_el_tag


xml_burrow_up_cousin(xmltree, element, cousin_el_tag, found_cousin)

Burrow up from element of interest, until a cousin is found with cousin_el_tag


return_parent_in_xmltree(xmltree, child_to_search_for)

Search through an xmltree to get the parent of child_to_search_for


return_parent_in_element(potential_parent, child_to_search_for, returned_parent)

Search through an XML element to return parent of child_to_search_for


find_1st_matching_element(element, el_tag, return_element)

Burrow down into the XML tree, retrieve the first element with the matching tag


element_items_to_string(items)

Input a list of items, get string back

Code

These still need to be integrated:

alphadiversity

alpha diversity of a region (number of taxa in the region)

betadiversity

beta diversity (Sorenson’s index) between two regions

alphaphylodistance

total branchlength of a phylogeny of taxa within a region

phylosor

phylogenetic Sorenson’s index between two regions

meanphylodistance

average distance between all tips on a region’s phylogeny

meanminphylodistance

average distance to nearest neighbor for tips on a region’s phylogeny

netrelatednessindex

standardized index of mean phylodistance

nearesttaxonindex

standardized index of mean minimum phylodistance

July, week 1: lagrange input/output handling (Task 6)

(note: lagrange requires a number of input files, e.g. hypothesized histories of connectivity; the only inputs suitable for automation in this project are the species ranges and phylogeny

make_lagrange_species_range_inputs

convert list of taxa/ranges to input format: http://www.reelab.net/lagrange/configurator/index

check_input_lagrange_tree

checks if input phylogeny meets the requirements for lagrange, i.e. has ultrametric branchlengths, tips end at time 0, tip names are in the species/ranges input file

parse_lagrange_output

take the output file from lagrange and get ages and estimated regions for each node

July, weeks 2-3: Devise algorithm for representing estimated node histories (location of nodes in categorical regions) as latitude/longitude points, necessary for input into geographic display files.

  • Regarding where to put reconstructed nodes, or tips that where the only location information is region. Within regions, dealing with linking already geo-located tips, spatial averaging can be used as currently happens with GeoPhyloBuilder. If there is only one node in a region the centroid or something similar could be used (i.e. the "root" of the polygon skeleton would deal even with weird concave polygons).
  • If there are multiple ancestral nodes or region-only tips in a region, they need to be spread out inside the polygon, or lines will just be drawn on top of each other. This can be done by putting the most ancient node at the root of the polygon skeleton/medial axis, and then spreading out the daughter nodes along the skeleton/medial axis of the polygon.

get_polygon_skeleton

this is a standard operation: http://en.wikipedia.org/wiki/Straight_skeleton

assign_node_locations_in_region

within a region’s polygon, given a list of nodes, their relationship, and ages, spread the nodes out along the middle 50% of the longest axis of the polygon skeleton, with the oldest node in the middle

assign_node_locations_between_regions

connect the nodes that are linked to branches that cross between regions (for this initial project, just the great circle lines)

July, week 4 and August, week 1: Write functions for converting the output from the above into graphical display formats, e.g. shapefiles for ArcGIS, KML files for Google Earth.

write_history_to_shapefile

write the biogeographic history to a shapefile

write_history_to_KML

write the biogeographic history to a KML file for input into Google Earth

August, week 2: Beta testing

Make the series of functions available, along with suggested input files; have others run on various platforms, with various levels of expertise (e.g. Evolutionary Biogeography Discussion Group at U.C. Berkeley). Also get final feedback from mentors and advisors.

August, week 3: Wrapup

Assemble documentation, FAQ, project results writeup for Phyloinformatics Summer of Code.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox