Google Summer of Code

(Difference between revisions)
Jump to: navigation, search
(2011 Project ideas)
(Scrapping last year's ideas)
Line 13: Line 13:
== 2011 Project ideas ==
== 2011 Project ideas ==
=== Biopython and PyCogent interoperability ===
; Rationale : [ PyCogent] and [ Biopython] are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.
; Approach : The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:
:* Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.).
:* Building workflows using Codon Usage analyses in PyCogent with clustering code in Biopython.
:* Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code.
:* Integrate Biopython's [ phyloXML support], developed during GSoC 2009, with PyCogent.
:* Develop a standardised controller architecture for interrogation of genome databases by extending PyCogent's Ensembl code, including export to Biopython objects.
; Challenges : This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.
; Involved toolkits or projects :
:* [ Biopython]
:* [ PyCogent]
; Degree of difficulty and needed skills : Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.
; Mentors : [ Gavin Huttley], [ Rob Knight], [ Brad Chapman], [[User:EricTalevich|Eric Talevich]]
=== Accessing R phylogenetic tools from Python ===
; Rationale : The [ R statistical language] is a powerful open-source environment for statistical computation and visualization. [ Python] serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using [ Rpy2], allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.
; Approach : Rpy2 contains higher level interfaces to popular R libraries. For instance, the [ ggplot2 interface] allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the [ Bodega Bay Marine Lab wiki]. Some examples of R libraries for which integration would be welcomed are:
:* [ ape (Analysis of Phylogenetics and Evolution)] -- an interactive library environment for phylogenetic and evolutionary analyses
:* [ ade4] -- Data Analysis functions to analyse Ecological and Environmental data in the framework of Euclidean Exploratory methods
:* [ geiger] -- Running macroevolutionary simulation, and estimating parameters related to diversification from comparative phylogenetic data.
:* [ picante] -- R tools for integrating phylogenies and ecology
:* [ mefa] -- multivariate data handling for ecological and biogeographical data
; Challenges : The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.
; Involved toolkits or projects :
:* [ ape (Analysis of Phylogenetics and Evolution)]
:* [ Rpy2]
:* [ Biopython]
; Degree of difficulty and needed skills : Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests (e.g. [ microarrays and heatmaps]); there is also the possibility to venture far into data visualization once access to analysis methods is made. [ GenGIS] and can give ideas about what is possible.
; Mentors : [ Laurent Gautier], [ Brad Chapman], [ Peter Cock]
=== Mocapy++Biopython: from data to probabilistic models of biomolecules ===
=== Mocapy++Biopython: from data to probabilistic models of biomolecules ===

Revision as of 19:49, 18 March 2011

As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2010. We are supporting João Rodrigues in his project, "Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module."

In 2009, Biopython was involved with GSoC in collaboration with our friends at NESCent, and had two projects funded:

In 2010, another project was funded:

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the mailing list.

2011 Project ideas

Mocapy++Biopython: from data to probabilistic models of biomolecules

Mocapy++ is a machine learning toolkit for training and using Bayesian networks. Mocapy++ supports the use of directional statistics; the statistics of angles, orientations and directions. This unique feature of Mocapy++ makes the toolkit especially suited for the formulation of probabilistic models of biomolecular structure. The toolkit has already been used to develop (published and peer reviewed) models of protein and RNA structure in atomic detail. Mocapy++ is implemented in C++, and does not provide any Python bindings. The goal of this proposal is to develop an easy-to-use Python interface to Mocapy++, and to integrate this interface with the Biopython project. Through its Bio.PDB module (initially implemented by the mentor of this proposal, T. Hamelryck), Biopython provides excellent functionality for data mining of biomolecular structure databases. Integrating Mocapy++ and Biopython would create strong synergy, as it would become quite easy to extract data from the databases, and subsequently use this data to train a probabilistic model. As such, it would provide a strong impulse to the field of protein structure prediction, design and simulation. Possible applications beyond bioinformatics are obvious, and include probabilistic models of human or animal movement, or any other application that involves directional data.
Ideally, the student (or several students) would first gain some understanding of the theoretical background of the algorithms that are used in Mocapy++, such as parameter learning of Bayesian networks using Stochastic Expectation Maximization (S-EM). Next, the student would study some of the use cases of the toolkit, making use of some of the published articles that involve Mocapy++. After becoming familiar with the internals of Mocapy++, Python bindings will then be implemented using the Boost C++ library. Based on the use cases, the student would finally implement some example applications that involve data mining of biomolecular structure using Biopython, the subsequent formulation of probabilistic models using Python-Mocapy++, and its application to some biologically relevant problem. Schematically, the following steps are involved for the student:
  • Gaining some understanding of S-EM and directional statistics
  • Study of Mocapy++ use cases
  • Study of Mocapy++ internals and code
  • Design of interface strategy
  • Implementing Python bindings using Boost
  • Example applications, involving Bio.PDB data mining
The project is highly interdisciplinary, and ideally requires skills in programming (C++, Python, wrapping C++ libraries in Python, Boost), machine learning, knowledge of biomolecular structure and statistics. The project could be extended (for example, by implementing additional functionality in Mocapy++) or limited (for example, by limiting the time spent on understanding the theory behind Mocapy++). The project would certainly benefit from several students with complementary skills.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Hard. The student needs to be fluent in C++, Python and the C++ Boost library. Experience with machine learning, Bayesian statistics and biomolecular structure would be clear advantages.
Thomas Hamelryck

Variant representation, parser, generator, and coordinate converter

Computational analysis of genomic variation requires the ability to reliably translate between human and computer representations of genomic variants. While several standards for human variation syntax have been proposed, community support is limited because of the technical complexity of the proposals and the lack of software libraries that implement them. The goal of this project is to initiate freely-available, language-neutral tools to parse, generate, and convert between representations of genomic variation.
Approach and Goals 
  • identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
  • develop internal machine representation for variation types in Python
  • develop language-neutral grammar for the (reasonably) supportable subset of the Human Genome Variation Society nomeclature guidelines
  • write a Python library to convert between machine and human representations of variation (i.e., parsing and generating)
  • develop coordinate mapping between genomic, cDNA, and protein sequences (at least)
  • release code to appropriate community efforts and write short manuscript
  • as time permits:
    • build Perl modules or Java libraries with identical functionality
    • develop syntactic and semantic validation
    • implement web service for coordinate conversion using NCBI Eutilities
    • develop a new variant syntax that is representation-complete
The major challenge in this project is to design an API which cleanly separates internal representations of variation from the multiple external representations. For example, coordinate conversion per se does not require any sequence information, but validating a variant does. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable.
Reece Hart, Locus Development, San Francisco
Personal tools