Google Summer of Code

(Difference between revisions)
Jump to: navigation, search
(2011 projects to past tense)
(Link to new SearchIO page)
(9 intermediate revisions by 4 users not shown)
Line 1: Line 1:
As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2011. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.
+
As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2012. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.
  
 
In 2009, Biopython was involved with GSoC in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:
 
In 2009, Biopython was involved with GSoC in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:
Line 10: Line 10:
 
* João Rodrigues [[GSOC2010_Joao|worked on the Structural Biology module Bio.PDB]], adding several features used in everyday structural bioinformatics. These features are now gradually being merged into the mainline with João's help.
 
* João Rodrigues [[GSOC2010_Joao|worked on the Structural Biology module Bio.PDB]], adding several features used in everyday structural bioinformatics. These features are now gradually being merged into the mainline with João's help.
  
In 2011, three projects were funded in Biopython via OBF:
+
In 2011, three projects were funded in Biopython via the OBF:
  
 
* [[User:Mtrellet|Mikael Trellet]] added [[GSoC2011_mtrellet|support for biomolecular interface analysis]] to the Bio.PDB module.
 
* [[User:Mtrellet|Mikael Trellet]] added [[GSoC2011_mtrellet|support for biomolecular interface analysis]] to the Bio.PDB module.
 
* Michele Silva wrote a [[GSOC2011_Mocapy|Python bridge for Mocapy++]] and linked it to Bio.PDB to enable statistical analysis of protein structures.
 
* Michele Silva wrote a [[GSOC2011_Mocapy|Python bridge for Mocapy++]] and linked it to Bio.PDB to enable statistical analysis of protein structures.
 
* Justinas Daugmaudis also enhanced Mocapy++ in a complementary way, developing a [[GSOC2011_MocapyExt|plugin system for Mocapy++]] allowing users to easily write new nodes (probability distribution functions) in Python.
 
* Justinas Daugmaudis also enhanced Mocapy++ in a complementary way, developing a [[GSOC2011_MocapyExt|plugin system for Mocapy++]] allowing users to easily write new nodes (probability distribution functions) in Python.
 +
 +
In 2012, two projects were funded in Biopython via the OBF:
 +
 +
* Wibowo Arindrarto: ''[[SearchIO]] Implementation in Biopython'' ([http://bow.web.id/blog/tag/gsoc/ blog])
 +
* Lenna Peterson: ''Diff My DNA: Development of a Genomic Variant Toolkit for Biopython'' ([http://arklenna.tumblr.com/tagged/gsoc2012 blog])
  
 
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].
 
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].
  
== 2011 Project ideas ==
+
== 2012 Project ideas ==
  
=== Mocapy++Biopython: from data to probabilistic models of biomolecules ===
+
=== SearchIO ===
  
; Rationale : [http://sourceforge.net/projects/mocapy/ Mocapy++] is a machine learning toolkit for training and using [http://en.wikipedia.org/wiki/Bayesian_network Bayesian networks]. Mocapy++ supports the use of [http://en.wikipedia.org/wiki/Directional_statistics directional statistics]; the statistics of angles, orientations and directions. This unique feature of Mocapy++ makes the toolkit especially suited for the formulation of probabilistic models of biomolecular structure. The toolkit has already been used to develop (published and peer reviewed) models of [http://www.pnas.org/content/105/26/8932.abstract?etoc protein] and [http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000406 RNA] structure in atomic detail. Mocapy++ is implemented in C++, and does not provide any Python bindings. The goal of this proposal is to develop an easy-to-use Python interface to Mocapy++, and to integrate this interface with the Biopython project. Through its [http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf Bio.PDB] module (initially implemented by the mentor of this proposal, [http://www.binf.ku.dk/research/structural_bioinformatics/ T. Hamelryck]), Biopython provides excellent functionality for data mining of biomolecular structure databases. Integrating Mocapy++ and Biopython would create strong synergy, as it would become quite easy to extract data from the databases, and subsequently use this data to train a probabilistic model. As such, it would provide a strong impulse to the field of protein structure prediction, design and simulation. Possible applications beyond bioinformatics are obvious, and include probabilistic models of human or animal movement, or any other application that involves directional data.  
+
; Rationale : Biopython has general APIs for parsing and writing assorted sequence file formats ([[SeqIO]]), multiple sequence alignments ([[AlignIO]]), phylogenetic trees ([[Phylo]]) and motifs (Bio.Motif). An obvious omission is something equivalent to [[bp:HOWTO:SearchIO|BioPerl's SearchIO]]. The goal of this proposal is to develop an easy-to-use Python interface in the same style as [[SeqIO]], [[AlignIO]], etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.
  
; Approach : Ideally, the student (or several students) would first gain some understanding of the theoretical background of the algorithms that are used in Mocapy++, such as parameter learning of Bayesian networks using [http://en.wikipedia.org/wiki/Expectation-maximization_algorithm Stochastic Expectation Maximization (S-EM)]. Next, the student would study some of the use cases of the toolkit, making use of some of the published articles that involve Mocapy++. After becoming familiar with the internals of Mocapy++, Python bindings will then be implemented using the [http://www.boost.org Boost C++ library]. Based on the use cases, the student would finally implement some example applications that involve data mining of biomolecular structure using Biopython, the subsequent formulation of probabilistic models using Python-Mocapy++, and its application to some biologically relevant problem. Schematically, the following steps are involved for the student: 
+
Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the [[SeqIO]] and [[AlignIO]] modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object.
  
:* Gaining some understanding of S-EM and directional statistics
+
Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.
:* Study of Mocapy++ use cases
+
:* Study of Mocapy++ internals and code
+
:* Design of interface strategy
+
:* Implementing Python bindings using Boost
+
:* Example applications, involving Bio.PDB data mining
+
  
; Challenges : The project is highly interdisciplinary, and ideally requires skills in programming (C++, Python, wrapping C++ libraries in Python, Boost), machine learning, knowledge of biomolecular structure and statistics. The project could be extended (for example, by implementing additional functionality in Mocapy++) or limited (for example, by limiting the time spent on understanding the theory behind Mocapy++). The project would certainly benefit from several students with complementary skills.
+
; Challenges : The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.
  
 
; Involved toolkits or projects :
 
; Involved toolkits or projects :
  
:* [http://biopython.org/wiki/Main_Page Biopython]
+
:* Biopython
:* [http://sourceforge.net/projects/mocapy/ Mocapy++]
+
  
; Degree of difficulty and needed skills : Hard. The student needs to be fluent in C++, Python and the [http://www.boost.org C++ Boost library]. Experience with machine learning, Bayesian statistics and biomolecular structure would be clear advantages.  
+
; Degree of difficulty and needed skills : Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using [[bp:HOWTO:SearchIO|BioPerl's SearchIO]]. You will also need to know or learn the git version control system.
  
; Mentors : [http://www.binf.ku.dk/research/structural_bioinformatics/ Thomas Hamelryck]
+
; Mentors : Peter Cock
  
=== Variant representation, parser, generator, and coordinate converter ===
+
=== Representation and manipulation of genomic variants ===
  
; Rationale : Computational analysis of genomic variation requires the ability to reliably translate between human and computer representations of genomic variants. While several standards for human variation syntax have been proposed, community support is limited because of the technical complexity of the proposals and the lack of software libraries that implement them. The goal of this project is to initiate freely-available, language-neutral tools to parse, generate, and convert between representations of genomic variation.
+
; Rationale : Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.
  
; Approach and Goals :
+
; Approach and Goals
:* identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
+
* Object representation
:* develop internal machine representation for variation types in Python, perhaps by implementing subclasses of BioPython's SeqFeature class.
+
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
:* develop language-neutral grammar for the (reasonably) supportable subset of the Human Genome Variation Society nomeclature guidelines
+
** develop internal machine representation for variation types
:* write a Python library to convert between machine and human representations of variation (i.e., parsing and generating)  
+
** ensure coverage of essential standards, including HGVS, GFF, VCF
:* develop coordinate mapping between genomic, cDNA, and protein sequences (at least)
+
* External representations
:* release code to appropriate community efforts and write short manuscript
+
** write parser and generators between objects and external string and file formats
:* as time permits:
+
* Manipulations
:** build Perl modules or Java libraries with identical functionality
+
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).
:** develop syntactic and semantic validation
+
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)
:** implement web service for coordinate conversion using NCBI Eutilities
+
* Other
:** develop a new variant syntax that is representation-complete
+
** release code to appropriate community efforts and write short manuscript
 +
** implement web service for HGVS conversion
  
; Challenges : The major challenge in this project is to design an API which cleanly separates internal representations of variation from the multiple external representations. For example, coordinate conversion per se does not require any sequence information, but validating a variant does. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.  
+
; Challenges : The major challenge in this project is to design an API that separates internal representations of variation from the multiple external representations. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
  
; Involved toolkits or projects :
+
; Resources
:* BioPython
+
* [http://biopython.org BioPython]
:* Related: http://www.mutalyzer.nl/2.0/, http://www.hgvs.org/mutnomen/
+
* [https://github.com/jamescasbon/PyVCF PyVCF]
 +
* [http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.VCF pysam VCF support]
 +
* [http://biopython.org/wiki/GFF_Parsing Biopython GFF support]
 +
* [http://www.mutalyzer.nl/2.0/ HGVS "nomenclature"]
 +
* [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF], [VCFtools http://vcftools.sourceforge.net/]
 +
* [http://www.sequenceontology.org/gff3.shtml GFF3], [http://www.sequenceontology.org/resources/gvf.html GVF]
 +
* [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit GATK]
  
; Degree of difficulty and needed skills : Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable.
+
; Degree of difficulty and needed skills : Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.
  
; Mentors : [http://linkedin.com/in/reece Reece Hart] ([http://locusdevelopmentinc.com Locus Development], San Francisco); [http://bcbio.wordpress.com Brad Chapman]
+
; Mentors: [http://linkedin.com/in/reece Reece Hart] ([http://locusdevelopmentinc.com Locus Development], San Francisco); [http://bcbio.wordpress.com Brad Chapman]; [http://casbon.me James Casbon]

Revision as of 11:01, 28 May 2012

As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2012. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.

In 2009, Biopython was involved with GSoC in collaboration with our friends at NESCent, and had two projects funded:

In 2010, another project was funded:

In 2011, three projects were funded in Biopython via the OBF:

In 2012, two projects were funded in Biopython via the OBF:

  • Wibowo Arindrarto: SearchIO Implementation in Biopython (blog)
  • Lenna Peterson: Diff My DNA: Development of a Genomic Variant Toolkit for Biopython (blog)

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the mailing list.

2012 Project ideas

SearchIO

Rationale 
Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.

Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object.

Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.

Challenges 
The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.
Involved toolkits or projects 
  • Biopython
Degree of difficulty and needed skills 
Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.
Mentors 
Peter Cock

Representation and manipulation of genomic variants

Rationale 
Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.
Approach and Goals
  • Object representation
    • identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
    • develop internal machine representation for variation types
    • ensure coverage of essential standards, including HGVS, GFF, VCF
  • External representations
    • write parser and generators between objects and external string and file formats
  • Manipulations
    • canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).
    • develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)
  • Other
    • release code to appropriate community efforts and write short manuscript
    • implement web service for HGVS conversion
Challenges 
The major challenge in this project is to design an API that separates internal representations of variation from the multiple external representations. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
Resources
Degree of difficulty and needed skills 
Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.
Mentors
Reece Hart (Locus Development, San Francisco); Brad Chapman; James Casbon
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox