Google Summer of Code

(Difference between revisions)
Jump to: navigation, search
(Joao is our GSoC student this year)
(PDB-Tidy: command-line tools for manipulating PDB files: Removed -- it was done last year)
Line 75: Line 75:
; Mentors : [ Laurent Gautier], [ Brad Chapman], [ Peter Cock]
; Mentors : [ Laurent Gautier], [ Brad Chapman], [ Peter Cock]
=== PDB-Tidy: command-line tools for manipulating PDB files ===
; Rationale : The [ Protein Data Bank] is an important data repository for protein structures, but the tools currently available for working with data in the PDB file format are usually specialized for a single specific task (e.g. visualization, homology modelling). Structural biologists would benefit from a command-line toolkit that makes structure data as easy to manipulate as sequence data already is.
; Approach : Use Bio.PDB to build a set of simple command-line tools for tidying up PDB files. For example:
:* Renumber residues starting from 1 (or N)
:* Select a portion of the structure -- models, chains, etc. -- and write it to a new file (PDB, FASTA, and other formats)
:* Perform some basic, well-established measures of model quality/validity
:* Draw a Ramachandran plot (using ReportLab; see also [ this page])
:* Extend and improve Bio.PDB as appropriate to support this effort
; Challenges : Many PDB files contain some inconsistent or surprising features -- some sensible assumptions, like continuous numbering of residues, do not hold in all cases. So, awareness of these issues, defensive coding, and extensive testing will be necesssary. Also, some discretion in the specific features implemented will be necessary: the features should generally supplement each other, and be conceptually cohesive. Your application should include a detailed project plan describing and justifying each feature you plan to implement.
; Involved toolkits or projects :
:* Biopython: Bio.PDB, and other modules as needed
:* [ Protein Data Bank]
:* See RCSB's [ list of software tools]
; Degree of difficulty and needed skills : Moderate. An understanding of the types of information in a PDB file, and what they're used for, is valuable here. It is also good to be aware of functionality that is already available in other popular software, and aim for interoperability in those cases rather than duplicating major features. Applications related to this idea should focus on a specific goal beyond the manipulation PDB files -- in what sort of research will this be useful?
; Mentors : [[User:EricTalevich|Eric Talevich]], [ Diana Jaunzeikare]
=== Integration with a third-party structural biology application ===
=== Integration with a third-party structural biology application ===

Revision as of 20:14, 7 March 2011

As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2010. We are supporting João Rodrigues in his project, "Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module."

In 2009, Biopython was involved with GSoC in collaboration with our friends at NESCent, and had two projects funded:

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the mailing list.


2010 Project ideas

Biopython and PyCogent interoperability

PyCogent and Biopython are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.
The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:
  • Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.).
  • Building workflows using Codon Usage analyses in PyCogent with clustering code in Biopython.
  • Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code.
  • Integrate Biopython's phyloXML support, developed during GSoC 2009, with PyCogent.
  • Develop a standardised controller architecture for interrogation of genome databases by extending PyCogent's Ensembl code, including export to Biopython objects.
This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.
Gavin Huttley, Rob Knight, Brad Chapman, Eric Talevich

Galaxy phylogenetics pipeline development

Galaxy is a popular web based interface for integrating biological tools and analysis pipelines. It is widely used by bench biologists for their analysis work, and by computational biologists for building interfaces to developed tools. HyPhy provides a popular package for molecular evolution and sequence statistical analysis, and the server provides web based workflows to perform a number of common tasks with HyPhy. This project bridges these two complementary projects by bringing HyPhy workflows into the Galaxy system, standardizing these analyses on a widely used platform.
The student would bring existing workflows from to Galaxy. The general approach would be to pick a workflow, wrap the relevant tools using Galaxy's XML tool definition language, and implement a shared pipeline with Galaxy's workflow system. Functional tests will be developed for tools and workflows, along with high level documentation for end users.
This project requires the student to become comfortable working in the existing Galaxy framework. This is a useful practical skill as Galaxy is widely used in the biological community. Similarly, the student should become familiar with the statistical evolutionary methods in HyPhy to feel comfortable wrapping and testing them in Galaxy. Since the tools would be widely used from the main Galaxy website and installed instances, we place a strong emphasis on students who feel comfortable building tests and examples that would ensure the developed workflows function as expected.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to Hard. As envisioned, the project would involve implementing full phylogenetic pipelines with the Galaxy toolkits. This would require becoming familiar with the Galaxy tool integration framework as well as being comfortable with HyPhy tools and current pipelines. This would involve comfort with XML for developing the tool interfaces, and Python for integrating scripts and tests with Galaxy and HyPhy.
Sergei L Kosakovsky Pond, Brad Chapman, Anton Nekrutenko

Accessing R phylogenetic tools from Python

The R statistical language is a powerful open-source environment for statistical computation and visualization. Python serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using Rpy2, allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.
Rpy2 contains higher level interfaces to popular R libraries. For instance, the ggplot2 interface allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the Bodega Bay Marine Lab wiki. Some examples of R libraries for which integration would be welcomed are:
  • ape (Analysis of Phylogenetics and Evolution) -- an interactive library environment for phylogenetic and evolutionary analyses
  • ade4 -- Data Analysis functions to analyse Ecological and Environmental data in the framework of Euclidean Exploratory methods
  • geiger -- Running macroevolutionary simulation, and estimating parameters related to diversification from comparative phylogenetic data.
  • picante -- R tools for integrating phylogenies and ecology
  • mefa -- multivariate data handling for ecological and biogeographical data
The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests (e.g. microarrays and heatmaps); there is also the possibility to venture far into data visualization once access to analysis methods is made. GenGIS and can give ideas about what is possible.
Laurent Gautier, Brad Chapman, Peter Cock

Integration with a third-party structural biology application

Biopython is already a useful toolkit for computational structural biology, and Python is a popular language for scripting and configuring a number of separate molecular modelling and simulation tools. Support for controlling these external tools from within Biopython, however, is relatively sparse. This project addresses the issue, starting with a single application.
Select a stable, popular and well-supported third-party application for structural biology (see below) to support from within Biopython. Identify "pain points" that would occur when trying to control various workflows involving your application of choice from Biopython -- e.g. data formats Biopython doesn't yet support, or command-line programs with complex options. In a new Biopython module, write code that makes these common procedures easier to perform, without also making common errors easier to commit.
Most of the relevant third-party tools have quite extensive functionality, and the corresponding Biopython module should support essentially all of it (unless there's a good reason to skip a particular feature). Some simulation and modelling operations take a long time, too; it should be possible to launch these long-running processes and protect them from interruption.
Involved toolkits or projects 
  • Biopython: Bio.PDB, and other modules as needed
  • An external application of your choice: Modeller, AutoDock, PyMol, MolProbity, ...
Degree of difficulty and needed skills 
Medium to hard. Simpler tasks include providing "glue" at each point of I/O; more challenging design problems could come from automating common parts of a simulation or modeling pipeline. Experience in with the tool being wrapped is probably essential.
Eric Talevich (looking for co-mentors)
Personal tools