
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://biopython.org/w/skins/common/feed.css?303"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
	<channel>
		<title>Biopython - User contributions [en]</title>
		<link>http://biopython.org/wiki/Special:Contributions/Joaor</link>
		<description>User contributions</description>
		<language>en</language>
		<generator>MediaWiki 1.18.1</generator>
		<lastBuildDate>Sun, 19 May 2013 14:51:08 GMT</lastBuildDate>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: Redirect to main GSOC page&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[Google_Summer_of_Code]]&lt;/div&gt;</description>
			<pubDate>Wed, 13 Mar 2013 13:21:34 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>Google Summer of Code</title>
			<link>http://biopython.org/wiki/Google_Summer_of_Code</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Google_Summer_of_Code</guid>
			<description>&lt;p&gt;Joaor: Content update for GSOC 2013&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students/mentors to contact the [http://biopython.org/wiki/Mailing_lists mailing lists] with their own ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [http://casbon.me/ James Casbon]&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://www.linkedin.com/in/reece Reece Hart]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Proposals ==&lt;br /&gt;
=== 2013 ===&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for the Google Summer of Code 2013 will be published here once discussed. We encourage potential students and mentors to join the [http://biopython.org/wiki/Mailing_lists BioPython mailing lists] and actively participate in these discussions, either by submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
== Past Proposals ==&lt;br /&gt;
&lt;br /&gt;
=== 2012 ===&lt;br /&gt;
==== [http://biopython.org/wiki/SearchIO SearchIO] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
&lt;br /&gt;
====  [http://arklenna.tumblr.com/tagged/gsoc2012 Representation and manipulation of genomic variants] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.linkedin.com/in/reece Reece Hart] &lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman] &lt;br /&gt;
:  [http://casbon.me/ James Casbon]&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSoC2011_mtrellet Biomolecular Interface Analysis] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Mikael Trellet&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Add the new module backbone in current Bio.PDB code base&lt;br /&gt;
** Evaluate possible code reuse and call it into the new module&lt;br /&gt;
** Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
* Define a stable benchmark&lt;br /&gt;
** Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
* Extend IUPAC.Data module with residue information&lt;br /&gt;
** Deduce residues weight from Atom instead of direct dictionary storage&lt;br /&gt;
** Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
** Hydrophobicity scale(s)&lt;br /&gt;
* Implement Extended Residue class as a subclass of Residue&lt;br /&gt;
* Implement Interface object and InterfaceAnalysis module&lt;br /&gt;
* Develop functions for interface analysis&lt;br /&gt;
** Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
** Calculation of BSA calling MSMS or HSA&lt;br /&gt;
** Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
** Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
* Develop functions for Interface comparison&lt;br /&gt;
* Code organization and final testing&lt;br /&gt;
&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy/Medium. Working knowledge of the Bio.PDB module of BioPython. Knowledge of structural biology in general and associated file formats (PDB).&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_Mocapy A Python bridge for Mocapy++] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Michele Silva&lt;br /&gt;
;  Rationale&lt;br /&gt;
: Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_MocapyExt MocapyExt] ====&lt;br /&gt;
; Student&lt;br /&gt;
: Justinas V. Daugmaudis&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2010_Joao Improving Bio.PDB] ====&lt;br /&gt;
; Student&lt;br /&gt;
: [http://nmr.chem.uu.nl/~joaor João Rodrigues]&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community. Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed edPDB or the more complete Biskit library render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
:  Diana Jaunzeikare&lt;br /&gt;
&lt;br /&gt;
=== 2009 ===&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/PhyloXML PhyloXML] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  PhyloXML is an XML format for phylogenetic trees, designed to allow storing information about the trees themselves (such as branch lengths and multiple support values) along with data such as taxonomic and genomic annotations. Connecting these pieces of evolutionary information in a standard format is key for comparative genomics.&lt;br /&gt;
A Bioperl driver for phyloXML was created during the 2008 Summer of Code; this project aims to build a similar module for the popular Biopython package.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
:  Christian Zmasek&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/BioGeography Biogeographical Phylogenetics for BioPython] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
:  Stephen Smith&lt;br /&gt;
:  David Kidd&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
=== XXXX ===&lt;br /&gt;
====  Mock Proposal ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  aaa&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: zzz&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  yyy&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  xxx&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</description>
			<pubDate>Wed, 13 Mar 2013 13:20:59 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Google_Summer_of_Code</comments>		</item>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: Added earlier projects (2010, 2009)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students/mentors to contact the [http://biopython.org/wiki/Mailing_lists mailing lists] with their own ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [http://casbon.me/ James Casbon]&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://www.linkedin.com/in/reece Reece Hart]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Proposals ==&lt;br /&gt;
=== 2013 ===&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for 2013 will be published here once discussed. We encourage potential students and mentors to join the [http://biopython.org/wiki/Mailing_lists mailing lists] and actively participate in these discussions, either by submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
== Past Proposals ==&lt;br /&gt;
&lt;br /&gt;
=== 2012 ===&lt;br /&gt;
==== [http://biopython.org/wiki/SearchIO SearchIO] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
&lt;br /&gt;
====  [http://arklenna.tumblr.com/tagged/gsoc2012 Representation and manipulation of genomic variants] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.linkedin.com/in/reece Reece Hart] &lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman] &lt;br /&gt;
:  [http://casbon.me/ James Casbon]&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSoC2011_mtrellet Biomolecular Interface Analysis] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Mikael Trellet&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Add the new module backbone in current Bio.PDB code base&lt;br /&gt;
** Evaluate possible code reuse and call it into the new module&lt;br /&gt;
** Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
* Define a stable benchmark&lt;br /&gt;
** Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
* Extend IUPAC.Data module with residue information&lt;br /&gt;
** Deduce residues weight from Atom instead of direct dictionary storage&lt;br /&gt;
** Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
** Hydrophobicity scale(s)&lt;br /&gt;
* Implement Extended Residue class as a subclass of Residue&lt;br /&gt;
* Implement Interface object and InterfaceAnalysis module&lt;br /&gt;
* Develop functions for interface analysis&lt;br /&gt;
** Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
** Calculation of BSA calling MSMS or HSA&lt;br /&gt;
** Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
** Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
* Develop functions for Interface comparison&lt;br /&gt;
* Code organization and final testing&lt;br /&gt;
&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy/Medium. Working knowledge of the Bio.PDB module of BioPython. Knowledge of structural biology in general and associated file formats (PDB).&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_Mocapy A Python bridge for Mocapy++] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Michele Silva&lt;br /&gt;
;  Rationale&lt;br /&gt;
: Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_MocapyExt MocapyExt] ====&lt;br /&gt;
; Student&lt;br /&gt;
: Justinas V. Daugmaudis&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2010_Joao Improving Bio.PDB] ====&lt;br /&gt;
; Student&lt;br /&gt;
: [http://nmr.chem.uu.nl/~joaor João Rodrigues]&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community. Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed edPDB or the more complete Biskit library render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
:  Diana Jaunzeikare&lt;br /&gt;
&lt;br /&gt;
=== 2009 ===&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/PhyloXML PhyloXML] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  PhyloXML is an XML format for phylogenetic trees, designed to allow storing information about the trees themselves (such as branch lengths and multiple support values) along with data such as taxonomic and genomic annotations. Connecting these pieces of evolutionary information in a standard format is key for comparative genomics.&lt;br /&gt;
A Bioperl driver for phyloXML was created during the 2008 Summer of Code; this project aims to build a similar module for the popular Biopython package.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
:  Christian Zmasek&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/BioGeography Biogeographical Phylogenetics for BioPython] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
:  Stephen Smith&lt;br /&gt;
:  David Kidd&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
=== XXXX ===&lt;br /&gt;
====  Mock Proposal ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  aaa&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: zzz&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  yyy&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  xxx&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 08 Mar 2013 22:57:49 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: /* MocapyExt */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students to contact the mailing list with their own &lt;br /&gt;
ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [http://casbon.me/ James Casbon]&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://www.linkedin.com/in/reece Reece Hart]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Proposals ==&lt;br /&gt;
=== 2013 ===&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for 2013 will be published here once discussed. We encourage potential students to join the mailing lists and actively participate in these discussions, either by&lt;br /&gt;
submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
== Past Proposals ==&lt;br /&gt;
&lt;br /&gt;
=== 2012 ===&lt;br /&gt;
==== [http://biopython.org/wiki/SearchIO SearchIO] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
&lt;br /&gt;
====  [http://arklenna.tumblr.com/tagged/gsoc2012 Representation and manipulation of genomic variants] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.linkedin.com/in/reece Reece Hart] &lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman] &lt;br /&gt;
:  [http://casbon.me/ James Casbon]&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSoC2011_mtrellet Biomolecular Interface Analysis] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Mikael Trellet&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Add the new module backbone in current Bio.PDB code base&lt;br /&gt;
** Evaluate possible code reuse and call it into the new module&lt;br /&gt;
** Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
* Define a stable benchmark&lt;br /&gt;
** Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
* Extend IUPAC.Data module with residue information&lt;br /&gt;
** Deduce residues weight from Atom instead of direct dictionary storage&lt;br /&gt;
** Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
** Hydrophobicity scale(s)&lt;br /&gt;
* Implement Extended Residue class as a subclass of Residue&lt;br /&gt;
* Implement Interface object and InterfaceAnalysis module&lt;br /&gt;
* Develop functions for interface analysis&lt;br /&gt;
** Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
** Calculation of BSA calling MSMS or HSA&lt;br /&gt;
** Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
** Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
* Develop functions for Interface comparison&lt;br /&gt;
* Code organization and final testing&lt;br /&gt;
&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy/Medium. Working knowledge of the Bio.PDB module of BioPython. Knowledge of structural biology in general and associated file formats (PDB).&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_Mocapy A Python bridge for Mocapy++] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Michele Silva&lt;br /&gt;
;  Rationale&lt;br /&gt;
: Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_MocapyExt MocapyExt] ====&lt;br /&gt;
; Student&lt;br /&gt;
: Justinas V. Daugmaudis&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
=== 2009 ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
=== XXXX ===&lt;br /&gt;
====  Mock Proposal ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  aaa&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: zzz&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  yyy&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  xxx&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 08 Mar 2013 22:44:56 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: Added 2011 proposals&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students to contact the mailing list with their own &lt;br /&gt;
ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [http://casbon.me/ James Casbon]&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://www.linkedin.com/in/reece Reece Hart]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Proposals ==&lt;br /&gt;
=== 2013 ===&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for 2013 will be published here once discussed. We encourage potential students to join the mailing lists and actively participate in these discussions, either by&lt;br /&gt;
submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
== Past Proposals ==&lt;br /&gt;
&lt;br /&gt;
=== 2012 ===&lt;br /&gt;
==== [http://biopython.org/wiki/SearchIO SearchIO] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
&lt;br /&gt;
====  [http://arklenna.tumblr.com/tagged/gsoc2012 Representation and manipulation of genomic variants] ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://www.linkedin.com/in/reece Reece Hart] &lt;br /&gt;
:  [https://github.com/chapmanb Brad Chapman] &lt;br /&gt;
:  [http://casbon.me/ James Casbon]&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
====  [http://biopython.org/wiki/GSoC2011_mtrellet Biomolecular Interface Analysis] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Mikael Trellet&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Add the new module backbone in current Bio.PDB code base&lt;br /&gt;
** Evaluate possible code reuse and call it into the new module&lt;br /&gt;
** Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
* Define a stable benchmark&lt;br /&gt;
** Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
* Extend IUPAC.Data module with residue information&lt;br /&gt;
** Deduce residues weight from Atom instead of direct dictionary storage&lt;br /&gt;
** Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
** Hydrophobicity scale(s)&lt;br /&gt;
* Implement Extended Residue class as a subclass of Residue&lt;br /&gt;
* Implement Interface object and InterfaceAnalysis module&lt;br /&gt;
* Develop functions for interface analysis&lt;br /&gt;
** Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
** Calculation of BSA calling MSMS or HSA&lt;br /&gt;
** Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
** Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
* Develop functions for Interface comparison&lt;br /&gt;
* Code organization and final testing&lt;br /&gt;
&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy/Medium. Working knowledge of the Bio.PDB module of BioPython. Knowledge of structural biology in general and associated file formats (PDB).&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_Mocapy A Python bridge for Mocapy++] ====&lt;br /&gt;
;  Student&lt;br /&gt;
: Michele Silva&lt;br /&gt;
;  Rationale&lt;br /&gt;
: Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
====  [http://biopython.org/wiki/GSOC2011_MocapyExt MocapyExt] ====&lt;br /&gt;
; Student&lt;br /&gt;
: Justinas V. Daugmaudis&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  BioPython is a very popular library in Bioinformatics and Computational Biology. Mocapy++ is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. Mocapy++ is freely available under the GNU General Public Licence (GPL) from SourceForge. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure.&lt;br /&gt;
Such a highly useful and powerful library, which has been used in such projects as TorusDBN, Basilisk, FB5HMM with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: The MocapyEXT project is largely an engineering effort to bring a transparent Python plug-in interface to Mocapy++, where built-in and dynamically loaded node types could be used in a uniform manner. Also, externally implemented and dynamically loaded nodes could be modified by a user and these changes will not necessitate the recompilation of the client program, nor the accompanying Mocapy++ library. This will facilitate rapid prototyping, ease the adaptation of currently existing code, and improve the software interoperability whilst introducing minimal changes to the existing Mocapy++ interface, thus facilitating a smooth acceptance of the changes introduced by MocapyEXT.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  [http://etal.myweb.uga.edu/ Eric Talevich] &lt;br /&gt;
:  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
=== 2009 ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
=== XXXX ===&lt;br /&gt;
====  Mock Proposal ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  aaa&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
: zzz&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  yyy&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  xxx&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 08 Mar 2013 22:44:18 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: Reorganization. Added more mentors&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students to contact the mailing list with their own &lt;br /&gt;
ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [http://casbon.me/ James Casbon]&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://www.linkedin.com/in/reece Reece Hart]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Proposals ==&lt;br /&gt;
=== 2013 ===&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for 2013 will be published here once discussed. We encourage potential students to join the mailing lists and actively participate in these discussions, either by&lt;br /&gt;
submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
== Past Proposals ==&lt;br /&gt;
&lt;br /&gt;
=== 2012 ===&lt;br /&gt;
==== SearchIO ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  Peter Cock&lt;br /&gt;
&lt;br /&gt;
====  Representation and manipulation of genomic variants ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  Reece Hart (Locus Development, San Francisco); Brad Chapman; James Casbon&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
=== 2009 ===&lt;/div&gt;</description>
			<pubDate>Fri, 08 Mar 2013 22:17:59 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>GSOC</title>
			<link>http://biopython.org/wiki/GSOC</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC</guid>
			<description>&lt;p&gt;Joaor: Page Creation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The Open Bioinformatics foundation successfully [http://www.open-bio.org/wiki/Google_Summer_of_Code applied to participate in the Google Summer of Code].&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the [http://code.google.com/soc main Google Summer of Code page] for more details about the program.&lt;br /&gt;
&lt;br /&gt;
== Mentor List ==&lt;br /&gt;
&lt;br /&gt;
Usually, each BioPython proposal has one or more mentors assigned to it. Nevertheless, we encourage potential students to contact the mailing list with their own &lt;br /&gt;
ideas for proposals. There is therefore not a set list of 'available' mentors, since it highly depends on which projects are proposed every year.&lt;br /&gt;
&lt;br /&gt;
Past mentors include:&lt;br /&gt;
&lt;br /&gt;
*  [https://github.com/chapmanb Brad Chapman]&lt;br /&gt;
*  [http://www.hutton.ac.uk/staff/peter-cock Peter Cock]&lt;br /&gt;
*  [http://wiki.binf.ku.dk/User:Thomas_Hamelryck Thomas Hamelryck]&lt;br /&gt;
*  [http://nmr.chem.uu.nl/~joao João Rodrigues] &lt;br /&gt;
*  [http://etal.myweb.uga.edu/ Eric Talevich]&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
=== Proposal 2013 ===&lt;br /&gt;
&lt;br /&gt;
==== Add project ideas ====&lt;br /&gt;
&lt;br /&gt;
The BioPython proposals for 2013 will be published here once discussed. We encourage potential students to join the mailing lists and actively participate in these discussions, either by&lt;br /&gt;
submitting their own ideas or contributing to improving existing ones.&lt;br /&gt;
&lt;br /&gt;
=== Proposals 2012 ===&lt;br /&gt;
==== SearchIO ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle &amp;amp; water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.&lt;br /&gt;
;  Approach&lt;br /&gt;
:  Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object. Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets. &lt;br /&gt;
; Challenges&lt;br /&gt;
: The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python and have knowledge of the BioPython codebase. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  Peter Cock&lt;br /&gt;
&lt;br /&gt;
====  Representation and manipulation of genomic variants ====&lt;br /&gt;
;  Rationale&lt;br /&gt;
:  Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.&lt;br /&gt;
;  Approach &amp;amp; Goals&lt;br /&gt;
* Object representation&lt;br /&gt;
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)&lt;br /&gt;
** develop internal machine representation for variation types&lt;br /&gt;
** ensure coverage of essential standards, including HGVS, GFF, VCF&lt;br /&gt;
* External representations&lt;br /&gt;
** write parser and generators between objects and external string and file formats&lt;br /&gt;
* Manipulations&lt;br /&gt;
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).&lt;br /&gt;
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)&lt;br /&gt;
* Other&lt;br /&gt;
** release code to appropriate community efforts and write short manuscript&lt;br /&gt;
** implement web service for HGVS conversion&lt;br /&gt;
;  Difficulty and needed skills&lt;br /&gt;
:  Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.&lt;br /&gt;
;  Mentors&lt;br /&gt;
:  Reece Hart (Locus Development, San Francisco); Brad Chapman; James Casbon&lt;br /&gt;
&lt;br /&gt;
== Past Projects ==&lt;br /&gt;
&lt;br /&gt;
=== 2011 ===&lt;br /&gt;
=== 2010 ===&lt;br /&gt;
=== 2009 ===&lt;/div&gt;</description>
			<pubDate>Fri, 08 Mar 2013 22:13:02 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC</comments>		</item>
		<item>
			<title>Participants</title>
			<link>http://biopython.org/wiki/Participants</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Participants</guid>
			<description>&lt;p&gt;Joaor: Added João's details&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: People are listed here alphabetically by surname.&lt;br /&gt;
This is only a partial listing, see also the [http://biopython.org/SRC/biopython/CONTRIB contributor listing] in the Biopython source code.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Tiago Antao =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail 	          || [mailto:tiago@gmail.com tiagoantao@gmail.com]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation 	          || Liverpool School of Tropical Medicine&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location 	          || Liverpool, UK&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Almost all programming stuff&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Population genetics, Infectious diseases (malaria), domain specific languages (DSLs) and declarative programming, pharmacology&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.PopGen&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://tiago.org&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Sebastian Bassi =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail 	          || [mailto:sbassi@genesdigitales.com sbassi@genesdigitales.com]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation 	          || Universidad Nacional de Quilmes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location 	          || Balcarce, Buenos Aires, Argentina&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Bioinformatics and data manipulation&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || IT Manager Advanta Seeds in Balcarce Research Station&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || LCC and primer Tm calculation function&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.bioinformatica.info&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Jeffrey Chang =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || [mailto:jchang@smi.stanford.edu jchang@smi.stanford.edu]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || Postdoctoral Fellow, Duke University&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Durham, NC&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Eating spam&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Bioinformatics&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Co-Founder&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.jeffchang.com/&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Brad Chapman =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || Massachusetts General Hospital&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Boston, MA&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Docs, GenBank, [[BioSQL]]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://bcbio.wordpress.com/&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= [[User:Peter|Peter Cock]] =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || See my web page&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || The James Hutton Institute (formerly SCRI); previously MOAC Doctoral Training Centre, University of Warwick&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Dundee, Scotland, UK&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Bioinformatics, controlling R with rpy, ...&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Bacterial signalling, genomics, sequencings&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Sequence parsing including [[SeqIO|Bio.SeqIO]], [[AlignIO|Bio.AlignIO]], maintaining the [[BioSQL|BioSQL interface]], and documentation&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.hutton.ac.uk/staff/peter-cock and http://www.warwick.ac.uk/go/peter_cock/python/&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| github || http://github.com/peterjc&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Andrew Dalke =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || [mailto:dalke@dalkescientific.com dalke@dalkescientific.com]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || Dalke Scientific Software, LLC&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Santa Fe, NM&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Just about anything&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Large-scale usable systems for scientists&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Co-Founder, Seq, Martel, indexing, EUtils, patterns, parsing, ...&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.dalkescientific.com/&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= [[User:Mdehoon|Michiel de Hoon]] =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || See my web page&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || RIKEN Omics Science Center&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Yokohama&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || High-throughput data analysis &amp;amp; Scientific visualization&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || RNA Genomics&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.Cluster; Bio.Entrez; Windows installer&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://bonsai.ims.u-tokyo.ac.jp/~mdehoon&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Iddo Friedberg =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || idoerg &amp;quot;at&amp;quot; gmail.com&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || University of California San Diego&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || La-Jolla, CA, USA&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Maintaining World Domination&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Structural Bioinformatics, metagenomics&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || SubsMat, FSSP, bits of Align, bits of the Manual, and a lot of silly questions to the lists&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://iddo-friedberg.org&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Christian Gunning =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || bioboy at uga dot edu&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || human, mountain&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Athens, GA&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || strings, as glue; also on laundry and dirty dishes&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || arabidopsis; Biological Sequence Analysis, Durbin et al.; Primer3; www.swig.org; R programming language and rpy.sourceforge.net&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions ||&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.botany.uga.edu/courses/bioinformatics/current/index.html&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Thomas Hamelryck =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || thamelry - binf ku dk&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || University of Copenhagen&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Copenhagen, Denmark&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Annoying FORTRAN programmers&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Structural bioinformatics&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.PDB, KDTree, SVDSuperimposer&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.binf.ku.dk/users/thamelry&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Michael Hoffman =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || grouse at alumni period utexas period net&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || The University of Texas at Austin&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Austin, TX, USA&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Biopython!&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || RNA, Genome annotations&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.GFF, Bio.DocSQL&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://spice.cc.utexas.edu/~grouse/&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Frank Kauff =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || fkauff at biologie uni-kl de&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || University of Kaiserslautern&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Kaiserslautern, Germany&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Phylogenetics and everything else&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Phylogenetics and all that's related, Fungi, Lichens, Cyanobacteria&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Phd, Ace, Nexus (mostly with C. Cox)&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.uni-kl.de/wcms/ag-kauff.html&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= [[User:lpritc|Leighton Pritchard]] =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || lpritc(squiggly symbol)scri ac uk&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || [http://www.scri.ac.uk/ Scottish Crop Research Institute]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Invergowrie, Scotland&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Generally when I have to explain to a computer exactly what I want it to do&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Comparative genomics; Systems Biology; Protein sequence-structure-function relationships; Plant host-pathogen interactions and genomics (heavy on the pathogens).&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || GenomeDiagram, bits and bobs&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.scri.ac.uk/staff/leightonpritchard&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= João Rodrigues =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail 	          || [mailto:anaryin@gmail.com anaryin@gmail.com]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation 	          || Bijvoet Center for Biomolecular Research, Utrecht University&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location 	          || Utrecht, NL&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || Pretty much all my (programming) tasks&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Structural Biology, Biophysics, Molecular Simulations, Protein Docking, Homology Modelling, etc..&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.PDB (here and there)&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://nmr.chem.uu.nl/~joaor&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Bartek Wilczyński =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || bartek_AT_rezolwenta.eu.org&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || Institute of Mathematics, Polish Academy of Science&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Warsaw, Poland&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || most of his computations&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || mathematical models of gene regulation&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Bio.AlignAce&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://bartek.rezolwenta.eu.org&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Harry Zuzan =&lt;br /&gt;
{| border=&amp;quot;0&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| E-mail                  || [mailto:iliketobicycle@yahoo.ca iliketobicycle@yahoo.ca]&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Affiliation             || Genome Quebec&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Location                || Montreal&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Uses Python for         || you name it&lt;br /&gt;
&lt;br /&gt;
|- &lt;br /&gt;
&lt;br /&gt;
| Work/Research Interests || Statistics applied to molecular biology and genetics&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Biopython Contributions || Affy package for Affymetrix data&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
| Relevant URL            || http://www.oligopython.org&lt;br /&gt;
&lt;br /&gt;
|}&lt;/div&gt;</description>
			<pubDate>Mon, 21 Nov 2011 11:02:42 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Participants</comments>		</item>
		<item>
			<title>GSoC2011 mtrellet</title>
			<link>http://biopython.org/wiki/GSoC2011_mtrellet</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSoC2011_mtrellet</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Mtrellet|Mikael Trellet]] mikael.trellet@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: João Rodrigues&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [23rd May - 31st June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add the new module backbone in current Bio.PDB code base ====&lt;br /&gt;
&lt;br /&gt;
*Evaluate possible code reuse and call it into the new module&lt;br /&gt;
*Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Define a stable benchmark  ====&lt;br /&gt;
&lt;br /&gt;
*Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
*Add some basics unit tests&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
=== Weeks 2-3 [1st - 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Extend IUPAC.Data module with residue information ====&lt;br /&gt;
&lt;br /&gt;
* Deduce residues weight from Atom instead of direct dictionnary storage&lt;br /&gt;
* Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
* Hydrophobicity scale(s)&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 4 [14th - 21st June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement Extended Residue class as a subclass of Residue ====&lt;br /&gt;
&lt;br /&gt;
* Build Extended Residue on the fly or have it hard-coded (?)&lt;br /&gt;
* Allow regular operations on Residue to be performed seamlessly in Extended Residue (should come with inheritance)&lt;br /&gt;
* Unit tests on pdb files containing particular residues&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 5-6-7 [22nd June - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement InterfaceAnalysis module ====&lt;br /&gt;
&lt;br /&gt;
* Develop Interface class as a subclass of Model&lt;br /&gt;
* Develop method to automatically extract Interface from parsed structure upon class instantiation&lt;br /&gt;
** e.g. I = Interface(Structure)&lt;br /&gt;
** Allow threshold for distance&lt;br /&gt;
** Allow chain pairs to ignore (to avoid intra-molecule contacts)&lt;br /&gt;
* Unit tests with results from usual scripts, broadly used by scientists&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Mid term evaluation ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 7-8 [12th July - 25th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Develop functions for interface analysis ====&lt;br /&gt;
&lt;br /&gt;
* Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
* Calculation of BSA calling MSMS or HSA&lt;br /&gt;
* Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
* ...&lt;br /&gt;
* Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 9-10 [26th July - 8th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Develop functions for Interface comparison ====&lt;br /&gt;
&lt;br /&gt;
* Perhaps adapt current RMSD functions to allow usage of Interface Residues&lt;br /&gt;
* Otherwise, should be called through something like Ia.rmsd_to(Ib) where Ia and IB are interface objects&lt;br /&gt;
* Calculation of iRMSD&lt;br /&gt;
* Calculation of FCC (Fraction of Common Contacts)&lt;br /&gt;
* Rough Identity and Similarity percentage&lt;br /&gt;
* ...&lt;br /&gt;
* Unit tests, comparison with specific tools as Profit&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 11 [9th July - 8th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Code organization and final testing  ====&lt;br /&gt;
&lt;br /&gt;
== Unit tests ==&lt;br /&gt;
&lt;br /&gt;
Unit tests will be perfomed along the project, allowing to do only a larger test at the end gathering every tests already performed.&lt;br /&gt;
&lt;br /&gt;
Then the aim will be to optimized, if possible, some parts of the code in efficiency and rapidity without changes at algorithmic level. Several days will be booked to package code and be sure that everything can communicate with Biopython.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
=== Implementation of Interface object backbone ===&lt;br /&gt;
&lt;br /&gt;
* Theory&lt;br /&gt;
&lt;br /&gt;
We began to think of an easy way to add the Interface as a new part of the SMCRA scheme. The idea was to have this new scheme = SM-I-CRA. Unfortunately the Interface object is not as well defined as just a child of model and a parent of chains. Indeed, the main part of the interface is residues, and even residues pairs. We want to keep the information of the chain but we can't keep them as they are defined actually, since we will get some overlaps, duplication and miscompatibility between the chains of our model and the chains of our interface. In the same way, our try to link the creation of the interface with existing modules as StructureBuilder and Model wasn't successful.&lt;br /&gt;
So, we decided to simplify a bit the concept in adding the classes related to the Interface in an independent way. Obviously links will exist between the different levels of SMCRA but Interface would be considered now as a parallel entity, not integrated completely in the SMCRA scheme.&lt;br /&gt;
&lt;br /&gt;
* Coding&lt;br /&gt;
&lt;br /&gt;
Interface.py is the definition of the Interface object inherited from Entity with the following methods : '''__init__'''(self, id),  '''add'''(self, entity) and '''get_chains'''(self).&lt;br /&gt;
&lt;br /&gt;
The add module overrides the add method of Entity in order to have an easy way to class residues according to their respective chains.&lt;br /&gt;
The get_chains modules returns the chains involved in the interface defined by the Interface object.&lt;br /&gt;
&lt;br /&gt;
The second class created is InterfaceBuilder.py which deals directly with the interface building (hard to guess..!)&lt;br /&gt;
We find these different modules : '''__init__'''(self, model, id=None, threshold=5.0, include_waters=False, *chains),  '''_unpack_chains'''(self, list_of_tuples),  '''get_interface'''(self),  '''_add_residue'''(self, residue),  '''_build_interface'''(self, model, id, threshold, include_waters=False, *chains)&lt;br /&gt;
&lt;br /&gt;
'''__init__''' : In order to initialize an interface you need to provide the model for which you want to calculate the interface, that's the only mandatory argument.&lt;br /&gt;
&lt;br /&gt;
'''_unpack_chains''': Method used by __init__ so as to create self.chain_list, variable read in many parts of the class. It transforms a list of tuples (given by the user) in a list of characters representing the chains which will be involved in the definition of the interface.&lt;br /&gt;
&lt;br /&gt;
'''get_interface''': Returns simply the interface&lt;br /&gt;
&lt;br /&gt;
'''_add_residue''': Allows the user to add some specific residues to his interface&lt;br /&gt;
&lt;br /&gt;
'''_build_interface''': The machinery to build the interface, it uses NeighborSearch and Selection in order to define the interface depending on the arguments given by the user.&lt;br /&gt;
&lt;br /&gt;
* Github repository&lt;br /&gt;
&lt;br /&gt;
[https://github.com/mtrellet/biopython/commit/4cfa4359d0f927609c076ed7b66f37add5aabdfb Interface.py]&lt;br /&gt;
[https://github.com/mtrellet/biopython/commit/194efe37ac8f88d688e0cf528f1fb896c8441866 InterfaceBuilder.py]&lt;/div&gt;</description>
			<pubDate>Wed, 08 Jun 2011 08:36:21 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSoC2011_mtrellet</comments>		</item>
		<item>
			<title>GSoC2011 mtrellet</title>
			<link>http://biopython.org/wiki/GSoC2011_mtrellet</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSoC2011_mtrellet</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Mtrellet|Mikael Trellet]] mikael.trellet@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: João Rodrigues&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Analysis of protein-protein complexes interfaces at a residue level yields significant information on the overall binding process. Such information can be broadly used for example in binding affinity studies, interface design, and enzymology. To tap into it, there is a need for tools that systematically and automatically analyze protein structures, or that provide means to this end. Protorop (http://www.bioinformatics.sussex.ac.uk/protorp/) is an example of such a tool and the elevated number of citations the server has had since its publication acknowledge its importance. However, being a webserver, Protorop is not suited for large-scale analysis and it leaves the community dependent on its maintainers to keep the service available. On the other hand, Biopython’s structural biology module, Bio.PDB, provides the ideal parsing machinery and programmatic structures for the development of an offline, open-source library for interface analysis. Such a library could be easily used in large-scale analysis of protein-protein interfaces, for example in the CAPRI experiment evaluation or in benchmark statistics. It would be also reasonable, if time permits, to extend this module to deal with protein-DNA or protein-RNA complexes, as Biopython supports nucleic acids already.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [23rd May - 31st June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add the new module backbone in current Bio.PDB code base ====&lt;br /&gt;
&lt;br /&gt;
*Evaluate possible code reuse and call it into the new module&lt;br /&gt;
*Try simple calculations to be sure that there is stability between the different modules (parsing for example) and functions&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Define a stable benchmark  ====&lt;br /&gt;
&lt;br /&gt;
*Select few PDB files among interface size and proteins size would be different&lt;br /&gt;
*Add some basics unit tests&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
=== Weeks 2-3 [1st - 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Extend IUPAC.Data module with residue information ====&lt;br /&gt;
&lt;br /&gt;
* Deduce residues weight from Atom instead of direct dictionnary storage&lt;br /&gt;
* Polar/charge character (dictionary or influenced by pH)&lt;br /&gt;
* Hydrophobicity scale(s)&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 4 [14th - 21st June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement Extended Residue class as a subclass of Residue ====&lt;br /&gt;
&lt;br /&gt;
* Build Extended Residue on the fly or have it hard-coded (?)&lt;br /&gt;
* Allow regular operations on Residue to be performed seamlessly in Extended Residue (should come with inheritance)&lt;br /&gt;
* Unit tests on pdb files containing particular residues&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 5-6-7 [22nd June - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement InterfaceAnalysis module ====&lt;br /&gt;
&lt;br /&gt;
* Develop Interface class as a subclass of Model&lt;br /&gt;
* Develop method to automatically extract Interface from parsed structure upon class instantiation&lt;br /&gt;
** e.g. I = Interface(Structure)&lt;br /&gt;
** Allow threshold for distance&lt;br /&gt;
** Allow chain pairs to ignore (to avoid intra-molecule contacts)&lt;br /&gt;
* Unit tests with results from usual scripts, broadly used by scientists&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Mid term evaluation ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 7-8 [12th July - 25th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Develop functions for interface analysis ====&lt;br /&gt;
&lt;br /&gt;
* Calculation of interface polar character statistics (% of polar residues, apolar, etc)&lt;br /&gt;
* Calculation of BSA calling MSMS or HSA&lt;br /&gt;
* Calculation of SS element statistics in the interface through DSSP&lt;br /&gt;
* ...&lt;br /&gt;
* Unit tests and use of results as input for further calculations by other tools and scripts&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 9-10 [26th July - 8th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Develop functions for Interface comparison ====&lt;br /&gt;
&lt;br /&gt;
* Perhaps adapt current RMSD functions to allow usage of Interface Residues&lt;br /&gt;
* Otherwise, should be called through something like Ia.rmsd_to(Ib) where Ia and IB are interface objects&lt;br /&gt;
* Calculation of iRMSD&lt;br /&gt;
* Calculation of FCC (Fraction of Common Contacts)&lt;br /&gt;
* Rough Identity and Similarity percentage&lt;br /&gt;
* ...&lt;br /&gt;
* Unit tests, comparison with specific tools as Profit&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 11 [9th July - 8th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Code organization and final testing  ====&lt;br /&gt;
&lt;br /&gt;
== Unit tests ==&lt;br /&gt;
&lt;br /&gt;
Unit tests will be perfomed along the project, allowing to do only a larger test at the end gathering every tests already performed.&lt;br /&gt;
&lt;br /&gt;
Then the aim will be to optimized, if possible, some parts of the code in efficiency and rapidity without changes at algorithmic level. Several days will be booked to package code and be sure that everything can communicate with Biopython.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
=== Implementation of Interface object backbone ===&lt;br /&gt;
&lt;br /&gt;
* Theory&lt;br /&gt;
&lt;br /&gt;
We began to think of an easy way to add the Interface as a new part of the SMCRA scheme. The idea was to have this new scheme = SM-I-CRA. Unfortunately the Interface object is not as well defined as just a child of model and a parent of chains. Indeed, the main part of the interface is residues, and even residues pairs. We want to keep the information of the chain but we can't keep them as they are defined actually, since we will get some overlaps, duplication and miscompatibility between the chains of our model and the chains of our interface. In the same way, our try to link the creation of the interface with existing modules as StructureBuilder and Model wasn't successful.&lt;br /&gt;
So, we decided to simplify a bit the concept in adding the classes related to the Interface in an independent way. Obviously links will exist between the different levels of SMCRA but Interface would be considered now as a parallel entity, not integrated completely in the SMCRA scheme.&lt;br /&gt;
&lt;br /&gt;
* Coding&lt;br /&gt;
&lt;br /&gt;
Interface.py is the definition of the Interface object inherited from Entity with the following methods : '''__init__'''(self, id),  '''add'''(self, entity) and '''get_chains'''(self).&lt;br /&gt;
&lt;br /&gt;
The add module overrides the add method of Entity in order to have an easy way to class residues according to their respective chains.&lt;br /&gt;
The get_chains modules returns the chains involved in the interface defined by the Interface object.&lt;br /&gt;
&lt;br /&gt;
The second class created is InterfaceBuilder.py which deals directly with the interface building (hard to guess..!)&lt;br /&gt;
We find these different modules : '''__init__'''(self, model, id=None, threshold=5.0, include_waters=False, *chains),  '''_unpack_chains'''(self, list_of_tuples),  '''get_interface'''(self),  '''_add_residue'''(self, residue),  '''_build_interface'''(self, model, id, threshold, include_waters=False, *chains)&lt;br /&gt;
&lt;br /&gt;
'''__init__''' : In order to initialize an interface you need to provide the model for which you want to calculate the interface, that's the only mandatory argument.&lt;br /&gt;
&lt;br /&gt;
'''_unpack_chains''': Method used by __init__ so as to create self.chain_list, variable read in many parts of the class. It transforms a list of tuples (given by the user) in a list of characters representing the chains which will be involved in the definition of the interface.&lt;br /&gt;
&lt;br /&gt;
'''get_interface''': Returns simply the interface&lt;br /&gt;
&lt;br /&gt;
'''_add_residue''': Allows the user to add some specific residues to his interface&lt;br /&gt;
&lt;br /&gt;
'''_build_interface''': The machinery to build the interface, it uses NeighborSearch and Selection in order to define the interface depending on the arguments given by the user.&lt;br /&gt;
&lt;br /&gt;
* Github repository&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[https://github.com/mtrellet/biopython/commit/194efe37ac8f88d688e0cf528f1fb896c8441866 InterfaceBuilder.py]&lt;/div&gt;</description>
			<pubDate>Wed, 08 Jun 2011 08:35:56 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSoC2011_mtrellet</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* CATH Dataset */ Corrected statistics for new lengths.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 146 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  ms&lt;br /&gt;
 &amp;lt; 100                 3660            25.09&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5296            44.67&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2330            83.40&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       43              177.10&lt;br /&gt;
 &amp;gt;= 1000               1               320.10&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  ms&lt;br /&gt;
 &amp;lt; 100                 3660            32.55&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5296            57.75&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2330            107.54&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       43              236.62&lt;br /&gt;
 &amp;gt;= 1000               1               486.602&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  ms&lt;br /&gt;
 &amp;lt; 100                 3660            33.21&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5296            58.45&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2330            108.76&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       43              234.37&lt;br /&gt;
 &amp;gt;= 1000               1               797.583&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 589 residues&lt;br /&gt;
&lt;br /&gt;
Failed to parse 2 structures due to errors:&lt;br /&gt;
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN (still being discussed)&lt;br /&gt;
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            410.270&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           461.744&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           182.027&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           290.693&lt;br /&gt;
 &amp;gt;= 1000               9837            942.513&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 29516.480s  (~8.20h)&lt;br /&gt;
 Average Time per Structure: 405.246 ms/structure&lt;br /&gt;
 Average Structures per Second: 2.47 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            451.819&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           505.933&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           190.991&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           304.453&lt;br /&gt;
 &amp;gt;= 1000               9837            980.047&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           405.246&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.57.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Mon, 16 May 2011 13:46:40 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython 1.57+ (In progress) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 589 residues&lt;br /&gt;
&lt;br /&gt;
Failed to parse 2 structures due to errors:&lt;br /&gt;
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN (still being discussed)&lt;br /&gt;
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            410.270&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           461.744&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           182.027&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           290.693&lt;br /&gt;
 &amp;gt;= 1000               9837            942.513&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 29516.480s  (~8.20h)&lt;br /&gt;
 Average Time per Structure: 405.246 ms/structure&lt;br /&gt;
 Average Structures per Second: 2.47 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            451.819&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           505.933&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           190.991&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           304.453&lt;br /&gt;
 &amp;gt;= 1000               9837            980.047&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           405.246&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.57.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Mon, 16 May 2011 13:20:40 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* PDB Dataset */ Added 1.57 results. Minor corrections&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 589 residues&lt;br /&gt;
&lt;br /&gt;
Failed to parse 2 structures due to errors:&lt;br /&gt;
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN (still being discussed)&lt;br /&gt;
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            410.270&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           461.744&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           182.027&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           290.693&lt;br /&gt;
 &amp;gt;= 1000               9837            942.513&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 29516.480s  (~8.20h)&lt;br /&gt;
 Average Time per Structure: 405.246 ms/structure&lt;br /&gt;
 Average Structures per Second: 2.47 structures/s&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            451.819&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           505.933&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           190.991&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           304.453&lt;br /&gt;
 &amp;gt;= 1000               9837            980.047&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           405.246&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.57.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Mon, 16 May 2011 13:19:14 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython 1.49 */ Removed link to PNG (too much noise) and corrected statistics according to new lengths.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent  (ms)&lt;br /&gt;
 &amp;lt; 100                 8402            410.270&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        12226           461.744&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        26493           182.027&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       15878           290.693&lt;br /&gt;
 &amp;gt;= 1000               9837            942.513&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Mon, 16 May 2011 13:17:12 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Benchmarking Script */ Corrected residue length calculation line..&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Mon, 16 May 2011 13:15:34 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython 1.49 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 18:07:38 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython 1.57+ */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 18:07:15 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython PDB Branch */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 18:06:42 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: /* Biopython 1.49 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid dead objects still in memory causing the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython PDB Branch (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 18:06:14 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython 1.58-] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid a memory leak that caused the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 14:04:46 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements] Biopython 1.58- (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid a memory leak that caused the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython1.49.time Link to full results]&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 14:04:09 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements] Biopython 1.58- (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid a memory leak that caused the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- (In progress) ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 13:58:19 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>PDBParser</title>
			<link>http://biopython.org/wiki/PDBParser</link>
			<guid isPermaLink="false">http://biopython.org/wiki/PDBParser</guid>
			<description>&lt;p&gt;Joaor: Creation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This is a draft page for the PDBParser class.&lt;br /&gt;
&lt;br /&gt;
= Benchmark =&lt;br /&gt;
&lt;br /&gt;
A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed. &lt;br /&gt;
&lt;br /&gt;
== Datasets ==&lt;br /&gt;
&lt;br /&gt;
[http://release.cathdb.info/v3.4.0/CathDomainList CATH Domain Collection] - 11330 Structures containing only coordinate information (no Element assigned).&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/ Protein Data Bank Collection] - 72836 Structures containing both headers and coordinate information.&lt;br /&gt;
&lt;br /&gt;
== Versions Tested ==&lt;br /&gt;
&lt;br /&gt;
[http://biopython.org/DIST/biopython-1.49.zip Biopython 1.49]  (Nov. 2008)&lt;br /&gt;
&lt;br /&gt;
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.&lt;br /&gt;
&lt;br /&gt;
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements] Biopython 1.58- (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() &amp;amp;&amp;amp; Other minor changes&lt;br /&gt;
&lt;br /&gt;
== Benchmarking Script ==&lt;br /&gt;
&lt;br /&gt;
The following script was used to benchmark the parser. The garbage collection module - &amp;lt;b&amp;gt;gc&amp;lt;/b&amp;gt; - was necessary to avoid a memory leak that caused the machine to start swapping.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/env python&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot; Script to Benchmark Bio.PDB PDBParser &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
import sys, os, warnings&lt;br /&gt;
&lt;br /&gt;
# Parsing Function&lt;br /&gt;
def parse_structure(path):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Parses a PDB file &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
    &lt;br /&gt;
    s = P.get_structure('test', path)&lt;br /&gt;
&lt;br /&gt;
    return 0&lt;br /&gt;
&lt;br /&gt;
def fancy_output(tps):&lt;br /&gt;
    &amp;quot;&amp;quot;&amp;quot; Outputs the results in a nicer way &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
    print &amp;quot;# Bio.PDB PDBParser Benchmark&amp;quot;&lt;br /&gt;
    print&lt;br /&gt;
    print &amp;quot;Structure \tLenght \tTime Spent (ms)&amp;quot;&lt;br /&gt;
    for i,s in enumerate(tps):&lt;br /&gt;
        print &amp;quot; %s\t (%s) \t%3.3f&amp;quot; %(os.path.basename(pdb_library[i]), pdb_length[i], s)&lt;br /&gt;
    print &lt;br /&gt;
    print &amp;quot;Total time spent: %5.3fs&amp;quot; %(sum(tps)/1000)&lt;br /&gt;
    print &amp;quot;Average time per structure: %5.3fms&amp;quot; %(sum(tps)/len(tps))&lt;br /&gt;
&lt;br /&gt;
if __name__=='__main__':&lt;br /&gt;
&lt;br /&gt;
    import time, gc&lt;br /&gt;
    from Bio.PDB import PDBParser&lt;br /&gt;
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).&lt;br /&gt;
   &lt;br /&gt;
    library_path = sys.argv[1]&lt;br /&gt;
&lt;br /&gt;
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]&lt;br /&gt;
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues&lt;br /&gt;
    sys.stderr.write(&amp;quot;Loaded %s structures (Average Length: %4.3f residues)\n&amp;quot; %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))&lt;br /&gt;
&lt;br /&gt;
    tps = []&lt;br /&gt;
    # Run the Test&lt;br /&gt;
    for i, pdb_file in enumerate(pdb_library):    &lt;br /&gt;
        sys.stderr.write( &amp;quot;[%s] %i Structure(s) Parsed \n&amp;quot; %(os.path.basename(pdb_file), i+1) )&lt;br /&gt;
        a = time.time()&lt;br /&gt;
        parse_structure(pdb_file)&lt;br /&gt;
        b = time.time()-a&lt;br /&gt;
        tps.append(b*1000)&lt;br /&gt;
        gc.collect()&lt;br /&gt;
    # Output Results&lt;br /&gt;
    fancy_output(tps)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Results ==&lt;br /&gt;
&lt;br /&gt;
=== CATH Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 147 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 530.686s&lt;br /&gt;
 Average Time per Structure: 46.84ms/structure&lt;br /&gt;
 Average Structures per Second: 21.38 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            25.11&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            44.68&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            83.41&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              180.35&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           46.84&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 686.176s&lt;br /&gt;
 Average Time per Structure: 60.56ms/structure&lt;br /&gt;
 Average Structures per Second: 16.51 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            32.57&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            57.76&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            107.56&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              242.30&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           60.56&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 695.405s&lt;br /&gt;
 Average Time per Structure: 61.37ms/structure&lt;br /&gt;
 Average Structures per Second: 16.29 structures/s&lt;br /&gt;
 Failed to parse 0 structures due to errors.&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 3663            33.24&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        5295            58.46&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        2328            108.77&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       44              247.171&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 11330           61.38&lt;br /&gt;
&lt;br /&gt;
=== PDB Dataset ===&lt;br /&gt;
&lt;br /&gt;
Average Structure Length: 263 residues&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.49 ====&lt;br /&gt;
&lt;br /&gt;
 Total Time Spent: 27801.934s (7.72h)&lt;br /&gt;
 Average Time per Structure: 381.706ms/structure&lt;br /&gt;
 Average Structures per Second: 2.62 structures/s&lt;br /&gt;
 Failed to parse 2 structures due to errors:&lt;br /&gt;
   1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN&lt;br /&gt;
   2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs&lt;br /&gt;
&lt;br /&gt;
 Length                N. Structures   Average Time Spent (ms)&lt;br /&gt;
 &amp;lt; 100                 10986           392.227&lt;br /&gt;
 100 =&amp;lt; x &amp;lt; 200        19641           371.716&lt;br /&gt;
 200 =&amp;lt; x &amp;lt; 500        35299           301.699&lt;br /&gt;
 500 =&amp;lt; x &amp;lt; 1000       6291            597.472&lt;br /&gt;
 &amp;gt;= 1000               619             2881.520&lt;br /&gt;
&lt;br /&gt;
 TOTAL                 72836           381.706&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.57+ ====&lt;br /&gt;
&lt;br /&gt;
In progress&lt;br /&gt;
&lt;br /&gt;
==== Biopython 1.58- ====&lt;br /&gt;
&lt;br /&gt;
Queued...&lt;/div&gt;</description>
			<pubDate>Thu, 12 May 2011 13:57:33 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:PDBParser</comments>		</item>
		<item>
			<title>Google Summer of Code</title>
			<link>http://biopython.org/wiki/Google_Summer_of_Code</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Google_Summer_of_Code</guid>
			<description>&lt;p&gt;Joaor: Added GSOC 2010 project and updated the dates to 2011&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2010. We are supporting João Rodrigues in his project, &amp;quot;[[GSOC2010_Joao|Extending Bio.PDB: broadening the usefulness of BioPython's Structural Biology module]].&amp;quot;&lt;br /&gt;
&lt;br /&gt;
In 2009, Biopython was involved with GSoC in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:&lt;br /&gt;
&lt;br /&gt;
* Nick Matzke worked on [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython Biogeographical Phylogenetics].&lt;br /&gt;
* Eric Talevich added support for [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biopython_support_for_parsing_and_writing_phyloXML parsing and writing phyloXML].&lt;br /&gt;
&lt;br /&gt;
In 2010, another project was funded:&lt;br /&gt;
&lt;br /&gt;
* João Rodrigues worked on [http://www.biopython.org/wiki/GSOC2010_Joao the Structural Biology module Bio.PDB] adding several features used in everyday structural bioinformatics.&lt;br /&gt;
&lt;br /&gt;
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].&lt;br /&gt;
&lt;br /&gt;
== 2011 Project ideas ==&lt;br /&gt;
&lt;br /&gt;
=== Biopython and PyCogent interoperability ===&lt;br /&gt;
&lt;br /&gt;
; Rationale : [http://pycogent.sourceforge.net/ PyCogent] and [http://biopython.org/wiki/Main_Page Biopython] are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.&lt;br /&gt;
&lt;br /&gt;
; Approach : The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:&lt;br /&gt;
&lt;br /&gt;
:* Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.).&lt;br /&gt;
:* Building workflows using Codon Usage analyses in PyCogent with clustering code in Biopython.&lt;br /&gt;
:* Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code.&lt;br /&gt;
:* Integrate Biopython's [http://biopython.org/wiki/Phylo phyloXML support], developed during GSoC 2009, with PyCogent.&lt;br /&gt;
:* Develop a standardised controller architecture for interrogation of genome databases by extending PyCogent's Ensembl code, including export to Biopython objects.&lt;br /&gt;
&lt;br /&gt;
; Challenges : This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.&lt;br /&gt;
&lt;br /&gt;
; Involved toolkits or projects :&lt;br /&gt;
&lt;br /&gt;
:* [http://biopython.org/wiki/Main_Page Biopython]&lt;br /&gt;
:* [http://pycogent.sourceforge.net/ PyCogent]&lt;br /&gt;
&lt;br /&gt;
; Degree of difficulty and needed skills : Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.&lt;br /&gt;
&lt;br /&gt;
; Mentors : [http://jcsmr.anu.edu.au/org/dmb/compgen/ Gavin Huttley], [http://chem.colorado.edu/index.php?option=com_content&amp;amp;view=article&amp;amp;id=263:rob-knight Rob Knight], [http://bcbio.wordpress.com Brad Chapman], [[User:EricTalevich|Eric Talevich]]&lt;br /&gt;
&lt;br /&gt;
=== Galaxy phylogenetics pipeline development ===&lt;br /&gt;
&lt;br /&gt;
; Rationale : [http://main.g2.bx.psu.edu/ Galaxy] is a popular web based interface for integrating biological tools and analysis pipelines. It is widely used by bench biologists for their analysis work, and by computational biologists for building interfaces to developed tools. [http://hyphy.org HyPhy] provides a popular package for molecular evolution and sequence statistical analysis, and the [http://www.datamonkey.org/ datamonkey.org] server provides web based workflows to perform a number of common tasks with HyPhy. This project bridges these two complementary projects by bringing HyPhy workflows into the Galaxy system, standardizing these analyses on a widely used platform.&lt;br /&gt;
&lt;br /&gt;
; Approach : The student would bring existing workflows from datamonkey.org to Galaxy. The general approach would be to pick a datamonkey.org workflow, wrap the relevant tools using [http://bitbucket.org/galaxy/galaxy-central/wiki/AddToolTutorial Galaxy's XML tool definition language], and implement a shared pipeline with [http://screencast.g2.bx.psu.edu/galaxy/flash/WorkflowFromHistory.html Galaxy's workflow system]. Functional tests will be developed for tools and workflows, along with high level documentation for end users.&lt;br /&gt;
&lt;br /&gt;
; Challenges : This project requires the student to become comfortable working in the existing Galaxy framework. This is a useful practical skill as Galaxy is widely used in the biological community. Similarly, the student should become familiar with the statistical evolutionary methods in HyPhy to feel comfortable wrapping and testing them in Galaxy. Since the tools would be widely used from the main Galaxy website and installed instances, we place a strong emphasis on students who feel comfortable building tests and examples that would ensure the developed workflows function as expected.&lt;br /&gt;
&lt;br /&gt;
; Involved toolkits or projects : &lt;br /&gt;
&lt;br /&gt;
:* [http://bitbucket.org/galaxy/galaxy-central/wiki/Home Galaxy]&lt;br /&gt;
:* [http://hyphy.org HyPhy]&lt;br /&gt;
:* [http://www.datamonkey.org Adaptive Evolution Server]&lt;br /&gt;
&lt;br /&gt;
; Degree of difficulty and needed skills : Medium to Hard. As envisioned, the project would involve implementing full phylogenetic pipelines with the Galaxy toolkits. This would require becoming familiar with the Galaxy tool integration framework as well as being comfortable with HyPhy tools and current pipelines. This would involve comfort with XML for developing the tool interfaces, and Python for integrating scripts and tests with Galaxy and HyPhy.&lt;br /&gt;
&lt;br /&gt;
; Mentors : [http://www.hyphy.org/sergei/ Sergei L Kosakovsky Pond], [http://bcbio.wordpress.com Brad Chapman], [http://www.bx.psu.edu/~anton/ Anton Nekrutenko]&lt;br /&gt;
&lt;br /&gt;
=== Accessing R phylogenetic tools from Python ===&lt;br /&gt;
&lt;br /&gt;
; Rationale : The [http://www.r-project.org/ R statistical language] is a powerful open-source environment for statistical computation and visualization. [http://www.python.org/ Python] serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using [http://bitbucket.org/lgautier/rpy2/ Rpy2], allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.&lt;br /&gt;
&lt;br /&gt;
; Approach : Rpy2 contains higher level interfaces to popular R libraries. For instance, the [http://rpy.sourceforge.net/rpy2/doc-2.1/html/graphics.html#package-ggplot2 ggplot2 interface] allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the [http://bodegaphylo.wikispot.org/Phylogenetics_and_Comparative_Methods_in_R Bodega Bay Marine Lab wiki]. Some examples of R libraries for which integration would be welcomed are:&lt;br /&gt;
&lt;br /&gt;
:* [http://ape.mpl.ird.fr/ ape (Analysis of Phylogenetics and Evolution)] -- an interactive library environment for phylogenetic and evolutionary analyses&lt;br /&gt;
:* [http://pbil.univ-lyon1.fr/ADE-4/home.php?lang=eng ade4] -- Data Analysis functions to analyse Ecological and Environmental data in the framework of Euclidean Exploratory methods&lt;br /&gt;
:* [http://cran.r-project.org/web/packages/geiger/index.html geiger] -- Running macroevolutionary simulation, and estimating parameters related to diversification from comparative phylogenetic data.&lt;br /&gt;
:* [http://picante.r-forge.r-project.org/ picante] -- R tools for integrating phylogenies and ecology&lt;br /&gt;
:* [http://mefa.r-forge.r-project.org/ mefa] -- multivariate data handling for ecological and biogeographical data&lt;br /&gt;
&lt;br /&gt;
; Challenges : The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.&lt;br /&gt;
&lt;br /&gt;
; Involved toolkits or projects :&lt;br /&gt;
&lt;br /&gt;
:* [http://ape.mpl.ird.fr/ ape (Analysis of Phylogenetics and Evolution)]&lt;br /&gt;
:* [http://bitbucket.org/lgautier/rpy2/ Rpy2]&lt;br /&gt;
:* [http://biopython.org/wiki/Main_Page Biopython]&lt;br /&gt;
&lt;br /&gt;
; Degree of difficulty and needed skills : Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests (e.g. [http://www.warwick.ac.uk/go/peter_cock/python/heatmap/ microarrays and heatmaps]); there is also the possibility to venture far into data visualization once access to analysis methods is made. [http://kiwi.cs.dal.ca/GenGIS/Main_Page GenGIS] and can give ideas about what is possible.&lt;br /&gt;
&lt;br /&gt;
; Mentors : [http://dk.linkedin.com/pub/laurent-gautier/8/81/869 Laurent Gautier], [http://bcbio.wordpress.com Brad Chapman], [http://www.scri.ac.uk/staff/petercock Peter Cock]&lt;/div&gt;</description>
			<pubDate>Thu, 10 Mar 2011 17:24:08 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Google_Summer_of_Code</comments>		</item>
		<item>
			<title>Struct</title>
			<link>http://biopython.org/wiki/Struct</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Struct</guid>
			<description>&lt;p&gt;Joaor: Added Struct Content&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This module extends Bio.PDB, providing additional features that prove useful to structural biologists using Biopython.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Availability ===&lt;br /&gt;
&lt;br /&gt;
This module is only available (for now) in [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 João's GSOC2010 branch in Github].&lt;br /&gt;
&lt;br /&gt;
After getting the branch, it can be accessed easily by an import statement. It should suffice to access all of the functions. Importing specific parts of the module is not supposed to be necessary unless likewise specific operations are sought after.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== I/O Functions === &lt;br /&gt;
&lt;br /&gt;
Bio.Struct provides a simple I/O interface to read and write structures in PDB format. It has two methods '''read''' and '''write''' that wrap Bio.PDB.PDBParser and Bio.PDB.PDBIO respectively. It does not intend to replace these methods, merely being a simpler way of performing I/O tasks.&lt;br /&gt;
&lt;br /&gt;
==== read() ====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;s = Struct.read('protein_A.pdb')&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The name of the resulting Structure object is based on the filename. &lt;br /&gt;
&lt;br /&gt;
It has an additional ''id'' argument that allows specification of the Structure object name, much like the first argument in ''PDBParser.get_structure()''.&lt;br /&gt;
&lt;br /&gt;
==== write() ====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;Struct.write(s)&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The name of the resulting output is based on the Structure object id. &lt;br /&gt;
&lt;br /&gt;
It has an additional ''name'' argument that allows naming the output file (much like the main argument in ''PDBIO.save()''). If a file already exists by that name, the write() function automatically renames the output adding ''_0'' (or _1, etc).&lt;br /&gt;
&lt;br /&gt;
=== Protein Class ===&lt;br /&gt;
&lt;br /&gt;
We created a Structure-based class named Protein to confer protein specific methods to a given SMCRA object. A method was also added to Bio.PDB.Structure that allows easy interconversion between the two classes: '''as_protein()'''. We will discuss each method in detail below.&lt;br /&gt;
&lt;br /&gt;
==== as_protein() ====&lt;br /&gt;
&lt;br /&gt;
Converts a Structure object to the Protein class. The conversion also filters all residues and excludes all those that are not aminoacids (HETATMs are also excluded). This filtering can be disabled by setting the optional ''filter_residues'' argument to False.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;s = Struct.read('protein_A.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Edited for shortening purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(p)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== search_ss_bonds() ====&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of Cysteine residues in close enough proximity to be forming a SS bond. The threshold for a S-S contact to be defined as a SS bond is 3.0A, but it can be manually specified through the ''threshold'' argument.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;for bond in p.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== coarse_grain() ====&lt;br /&gt;
&lt;br /&gt;
Despite coarse-graining being general to all molecules, the current implemented methods concern proteins only. To ease the introduction of new CG-models, a CG_models.py class is present that defines how each residue should be coarse grained. As of now, three models are supported.&lt;br /&gt;
&lt;br /&gt;
===== Example Usage: MARTINI =====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===== Supported Models =====&lt;br /&gt;
&lt;br /&gt;
====== CA Trace ======&lt;br /&gt;
&lt;br /&gt;
Protein residues are reduced to their alpha carbon atom. This is the default method.&lt;br /&gt;
&lt;br /&gt;
====== ENCAD 3pt Model ======&lt;br /&gt;
&lt;br /&gt;
ENCAD is the Energy Calculation and Dynamics program, developed by Michal Levitt since the 80s. Its 3pt coarse grained model reduces each protein residue to 3 points (some exceptions are reduced to 4 points): Ca, O, and a side chain bead centered on a particular pre-defined atom.&lt;br /&gt;
&lt;br /&gt;
====== MARTINI Model ======&lt;br /&gt;
&lt;br /&gt;
MARTINI is a well known CG model that, as ENCAD, reduces protein residues to 3/4 beads: Ca, O, and a side chain bead in a particular position.&lt;br /&gt;
&lt;br /&gt;
==== check_missing_atoms() ====&lt;br /&gt;
&lt;br /&gt;
Compares the residues in the Protein object with a pre-defined topology (derived from AMBER) to check for missing atoms. Outputs a dictionary of tuples for each incomplete residue. Automatically ignores Hydrogen atoms (''ha_only'' argument can be set to False to override this) and allows the usage of a particular template through the ''template'' argument (default None). Templates are dictionaries with residue names as keys and lists with atom names as values.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;p.check_missing_atoms()&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== find_seq_homologues() ====&lt;br /&gt;
&lt;br /&gt;
Bridges Bio.Blast.NCBIWWW qblast() function. It allows a direct sequence homology search through that function using the Protein object's sequence. Auto-adjusts parameters for short sequences. For more complex homology searches, use the Bio.Blast.NCBIWWW module directly as this is supposed to be just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
Allows an argument, ''raw_output'', that replaces the default parsed results with raw XML output from the BLAST search.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 13 Aug 2010 23:41:35 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Struct</comments>		</item>
		<item>
			<title>Struct</title>
			<link>http://biopython.org/wiki/Struct</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Struct</guid>
			<description>&lt;p&gt;Joaor: Creation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes Bio.Struct, a new module based on [[Bio.PDB]] that adds further structural biology features.&lt;br /&gt;
&lt;br /&gt;
== Aims == &lt;br /&gt;
&lt;br /&gt;
Some thing here...&lt;/div&gt;</description>
			<pubDate>Fri, 13 Aug 2010 00:52:03 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Struct</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Support for MODELLER PIR format in SeqIO */ Edited mpir to pir-modeller&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sequence Homologues from Structures ===&lt;br /&gt;
&lt;br /&gt;
Biopython supports BLAST (local and remote through NCBI servers). We bridged both Bio.PDB and Bio.Blast modules to allow an easier search for sequence homologues. For now, it supports remote BLAST through Bio.Blast.NCBIWWW and functions as a blackbox - i.e. users cannot change any search parameter. If one wants to fully use BLAST he/she should use the regular BLAST module. This is just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It is accessible only to Protein objects. It queries the PDB subset database of NCBI BLAST servers with the Structure object's sequence, auto-adjusting parameters for short sequences (less than 15 residues).&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1A8O.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
1AUM 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Support for MODELLER PIR format in SeqIO ===&lt;br /&gt;
&lt;br /&gt;
MODELLER PIR format support was added to SeqIO as 'pir-modeller'. Currently, the format can be read but not written. An example of the format follows, as well as an example of the parser's usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt;P1;5fd1&lt;br /&gt;
structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19&lt;br /&gt;
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA&lt;br /&gt;
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import SeqIO&lt;br /&gt;
&lt;br /&gt;
handle = open('test_pir.txt')&lt;br /&gt;
&lt;br /&gt;
records = SeqIO.parse(handle, 'pir-modeller')&lt;br /&gt;
&lt;br /&gt;
for i in records:&lt;br /&gt;
    print i&lt;br /&gt;
&lt;br /&gt;
ID: 5fd1&lt;br /&gt;
Name: 5fd1&lt;br /&gt;
Description: ferredoxin&lt;br /&gt;
Number of features: 0&lt;br /&gt;
/r_factor= 0.19&lt;br /&gt;
/end_residue=106&lt;br /&gt;
/initial_chain=a&lt;br /&gt;
/end_chain=a&lt;br /&gt;
/record_type=X-Ray Structure&lt;br /&gt;
/initial_residue=1&lt;br /&gt;
/resolution= 1.90&lt;br /&gt;
/source_organism=Azotobacter vinelandii&lt;br /&gt;
Seq('AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAI...LER', ProteinAlphabet())&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 13 Aug 2010 00:38:45 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Support for MODELLER PIR format in SeqIO */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sequence Homologues from Structures ===&lt;br /&gt;
&lt;br /&gt;
Biopython supports BLAST (local and remote through NCBI servers). We bridged both Bio.PDB and Bio.Blast modules to allow an easier search for sequence homologues. For now, it supports remote BLAST through Bio.Blast.NCBIWWW and functions as a blackbox - i.e. users cannot change any search parameter. If one wants to fully use BLAST he/she should use the regular BLAST module. This is just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It is accessible only to Protein objects. It queries the PDB subset database of NCBI BLAST servers with the Structure object's sequence, auto-adjusting parameters for short sequences (less than 15 residues).&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1A8O.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
1AUM 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Support for MODELLER PIR format in SeqIO ===&lt;br /&gt;
&lt;br /&gt;
MODELLER PIR format support was added to SeqIO as 'mpir'. Currently, the format can be read but not written. An example of the format follows, as well as an example of the parser's usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt;P1;5fd1&lt;br /&gt;
structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19&lt;br /&gt;
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA&lt;br /&gt;
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import SeqIO&lt;br /&gt;
&lt;br /&gt;
handle = open('test_pir.txt')&lt;br /&gt;
&lt;br /&gt;
records = SeqIO.parse(handle, 'mpir')&lt;br /&gt;
&lt;br /&gt;
for i in records:&lt;br /&gt;
    print i&lt;br /&gt;
&lt;br /&gt;
ID: 5fd1&lt;br /&gt;
Name: 5fd1&lt;br /&gt;
Description: ferredoxin&lt;br /&gt;
Number of features: 0&lt;br /&gt;
/r_factor= 0.19&lt;br /&gt;
/end_residue=106&lt;br /&gt;
/initial_chain=a&lt;br /&gt;
/end_chain=a&lt;br /&gt;
/record_type=X-Ray Structure&lt;br /&gt;
/initial_residue=1&lt;br /&gt;
/resolution= 1.90&lt;br /&gt;
/source_organism=Azotobacter vinelandii&lt;br /&gt;
Seq('AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAI...LER', ProteinAlphabet())&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:53:37 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Support for MODELLER PIR format in SeqIO */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sequence Homologues from Structures ===&lt;br /&gt;
&lt;br /&gt;
Biopython supports BLAST (local and remote through NCBI servers). We bridged both Bio.PDB and Bio.Blast modules to allow an easier search for sequence homologues. For now, it supports remote BLAST through Bio.Blast.NCBIWWW and functions as a blackbox - i.e. users cannot change any search parameter. If one wants to fully use BLAST he/she should use the regular BLAST module. This is just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It is accessible only to Protein objects. It queries the PDB subset database of NCBI BLAST servers with the Structure object's sequence, auto-adjusting parameters for short sequences (less than 15 residues).&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1A8O.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
1AUM 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Support for MODELLER PIR format in SeqIO ===&lt;br /&gt;
&lt;br /&gt;
MODELLER PIR format support was added to SeqIO as 'mpir'. Currently, the format can be read but not written. An example of the format follows, as well as an example of the parser's usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
&amp;gt;P1;5fd1&lt;br /&gt;
structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19&lt;br /&gt;
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA&lt;br /&gt;
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import SeqIO&lt;br /&gt;
&lt;br /&gt;
handle = open('test_pir.txt')&lt;br /&gt;
&lt;br /&gt;
records = SeqIO.parse(handle, 'mpir')&lt;br /&gt;
&lt;br /&gt;
for i in records:&lt;br /&gt;
    print i&lt;br /&gt;
&lt;br /&gt;
ID: 5fd1&lt;br /&gt;
Name: 5fd1&lt;br /&gt;
Description: ferredoxin&lt;br /&gt;
Number of features: 0&lt;br /&gt;
/r_factor= 0.19&lt;br /&gt;
/end_residue=106&lt;br /&gt;
/initial_chain=a&lt;br /&gt;
/end_chain=a&lt;br /&gt;
/record_type=X-Ray Structure&lt;br /&gt;
/initial_residue=1&lt;br /&gt;
/resolution= 1.90&lt;br /&gt;
/source_organism=Azotobacter vinelandii&lt;br /&gt;
Seq('AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAI...LER', ProteinAlphabet())&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:52:50 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Added MODELLER PIR support to SeqIO&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sequence Homologues from Structures ===&lt;br /&gt;
&lt;br /&gt;
Biopython supports BLAST (local and remote through NCBI servers). We bridged both Bio.PDB and Bio.Blast modules to allow an easier search for sequence homologues. For now, it supports remote BLAST through Bio.Blast.NCBIWWW and functions as a blackbox - i.e. users cannot change any search parameter. If one wants to fully use BLAST he/she should use the regular BLAST module. This is just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It is accessible only to Protein objects. It queries the PDB subset database of NCBI BLAST servers with the Structure object's sequence, auto-adjusting parameters for short sequences (less than 15 residues).&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1A8O.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
1AUM 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Support for MODELLER PIR format in SeqIO ===&lt;br /&gt;
&lt;br /&gt;
MODELLER PIR format support was added to SeqIO as 'mpir'. Currently, the format can be read but not written. An example of the format follows, as well as an example of the parser's usage.&lt;br /&gt;
&lt;br /&gt;
&amp;gt;P1;5fd1&lt;br /&gt;
structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19&lt;br /&gt;
AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAIFSEDEVPEDMQEFIQLNAELA&lt;br /&gt;
EVWPNITEKKDPLPDAEDWDGVKGKLQHLER*&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import SeqIO&lt;br /&gt;
&lt;br /&gt;
handle = open('test_pir.txt')&lt;br /&gt;
&lt;br /&gt;
records = SeqIO.parse(handle, 'mpir')&lt;br /&gt;
&lt;br /&gt;
for i in records:&lt;br /&gt;
    print i&lt;br /&gt;
&lt;br /&gt;
ID: 5fd1&lt;br /&gt;
Name: 5fd1&lt;br /&gt;
Description: ferredoxin&lt;br /&gt;
Number of features: 0&lt;br /&gt;
/r_factor= 0.19&lt;br /&gt;
/end_residue=106&lt;br /&gt;
/initial_chain=a&lt;br /&gt;
/end_chain=a&lt;br /&gt;
/record_type=X-Ray Structure&lt;br /&gt;
/initial_residue=1&lt;br /&gt;
/resolution= 1.90&lt;br /&gt;
/source_organism=Azotobacter vinelandii&lt;br /&gt;
Seq('AFVVTDNCIKCKYTDCVEVCPVDCFYEGPNFLVIHPDECIDCALCEPECPAQAI...LER', ProteinAlphabet())&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:49:44 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Added find_seq_homologues description&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Sequence Homologues from Structures ===&lt;br /&gt;
&lt;br /&gt;
Biopython supports BLAST (local and remote through NCBI servers). We bridged both Bio.PDB and Bio.Blast modules to allow an easier search for sequence homologues. For now, it supports remote BLAST through Bio.Blast.NCBIWWW and functions as a blackbox - i.e. users cannot change any search parameter. If one wants to fully use BLAST he/she should use the regular BLAST module. This is just a convenience function.&lt;br /&gt;
&lt;br /&gt;
It is accessible only to Protein objects. It queries the PDB subset database of NCBI BLAST servers with the Structure object's sequence, auto-adjusting parameters for short sequences (less than 15 residues).&lt;br /&gt;
&lt;br /&gt;
It returns a list ranked by Expectation Value with some informational values (e-value, identities, positives, gaps), the PDB code of the match, and the alignment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1A8O.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
seq_homologues = p.find_seq_homologues()&lt;br /&gt;
&lt;br /&gt;
for homologues in seq_homologues:&lt;br /&gt;
    print homologues[0], homologues[1]&lt;br /&gt;
    print homologues[-1]&lt;br /&gt;
    print&lt;br /&gt;
&lt;br /&gt;
2BUO 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
1AUM 1.82482e-31&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW-TETLLVQNANPDCKTILKALGPGATLEE--TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNW TETLLVQNANPDCKTILKALGPGATLEE  TACQG&lt;br /&gt;
DIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPGATLEEMMTACQG&lt;br /&gt;
&lt;br /&gt;
.....&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:44:12 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Weeks 2-5 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hydrogenation of PDB files ===&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===  Coarse Grain Structure ===&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Removal of disordered atoms ===&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Current Development ===&lt;br /&gt;
&lt;br /&gt;
Unit Tests are being developed for all these previous features. Some other features were already developed and are being tested. Additionally, other unplanned features have been developed and tested (check missing atoms).&lt;br /&gt;
&lt;br /&gt;
Current work is focusing on the Nomenclature Normalization and the Homology features. We have established contact with the Dali Server developers and are working towards a solution with the Structural Homology feature. However, it seems that since results take too long to run, this is not a valuable addition to the project..&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:29:48 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Week 1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Renumbering residues of a structure ===&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Probe disulphide bridges in the structure ===&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
===  Extract Biological Unit ===&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2-5 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Current Development ===&lt;br /&gt;
&lt;br /&gt;
Unit Tests are being developed for all these previous features. Some other features were already developed and are being tested. Additionally, other unplanned features have been developed and tested (check missing atoms).&lt;br /&gt;
&lt;br /&gt;
Current work is focusing on the Nomenclature Normalization and the Homology features. We have established contact with the Dali Server developers and are working towards a solution with the Structural Homology feature. However, it seems that since results take too long to run, this is not a valuable addition to the project..&lt;/div&gt;</description>
			<pubDate>Thu, 12 Aug 2010 04:26:48 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Project Progress */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2-5 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Current Development ===&lt;br /&gt;
&lt;br /&gt;
Unit Tests are being developed for all these previous features. Some other features were already developed and are being tested. Additionally, other unplanned features have been developed and tested (check missing atoms).&lt;br /&gt;
&lt;br /&gt;
Current work is focusing on the Nomenclature Normalization and the Homology features. We have established contact with the Dali Server developers and are working towards a solution with the Structural Homology feature. However, it seems that since results take too long to run, this is not a valuable addition to the project..&lt;/div&gt;</description>
			<pubDate>Thu, 05 Aug 2010 01:33:24 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Coarse Grain Structure */  Added MARTINI description&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2-5 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
As of now, 3 CG models are supported.&lt;br /&gt;
&lt;br /&gt;
1) CA-Trace&lt;br /&gt;
2) ENCAD 3-point model (CA, O, Side Chain bead)&lt;br /&gt;
3) MARTINI protein model (BB, Side Chain points [S1 to S4])&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_encad = p.coarse_grain('ENCAD_3P')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_encad.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&lt;br /&gt;
cg_martini = p.coarse_grain('MARTINI')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_martini.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;, &amp;lt;Atom S2&amp;gt;, &amp;lt;Atom S3&amp;gt;]&lt;br /&gt;
......&lt;br /&gt;
CYS [&amp;lt;Atom BB&amp;gt;, &amp;lt;Atom S1&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom BB&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Thu, 05 Aug 2010 01:28:02 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: /* Hydrogenation of PDB files */ Added local implementation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2-5 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module. This first implementation works for proteins only. I'll add general molecule support later.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
from Bio.Struct import Hydrogenate as H&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('1ctf.pdb')&lt;br /&gt;
p = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
prot = H.Hydrogenate_Protein()&lt;br /&gt;
prot.add_hydrogens(p)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Coarse-Graining makes use of this function when reducing each residue to 3 points (ENCAD-like): Ca, peptide-bond O, and center of mass of side chain. The default CG function reduces the protein to a Ca-trace. I am still working on implementing perhaps some other CG-representations such as that from CABS.&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_3pt = p.coarse_grain('3pt')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_3pt.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 02 Jul 2010 18:52:20 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Week 3 and 4 Update.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2-5 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
The interface for the WHATIF Protonation service has been implemented, although it should be regarded as **highly experimental** for now. Interfacing this server included writing a small parser for a PDBXML-like format, which is expected to have serious bugs in its initial versions. I ran some simple tests and it works. It doesn't support water molecules yet, nor any other molecules other than proteins. Such issues will be hopefully solved later on..&lt;br /&gt;
&lt;br /&gt;
For those brave enough to want to test it (and help me debug it), here's an example usage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.WWW import WHATIF&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
server = WHATIF.WHATIF() # Performs a sort of PING to the server. Gracefully exits if the servers are down.&lt;br /&gt;
&lt;br /&gt;
# Get the protein structure&lt;br /&gt;
structure = Struct.read('4PTI.pdb')&lt;br /&gt;
protein = structure.as_protein() # This excludes water molecules&lt;br /&gt;
&lt;br /&gt;
# Upload the structure to the WHATIF server&lt;br /&gt;
# This should convert the structure from a Structure object to a string via tempfile and PDBIO&lt;br /&gt;
# I was having some issues uploading structures...&lt;br /&gt;
&lt;br /&gt;
id = server.UploadPDB(protein)&lt;br /&gt;
&lt;br /&gt;
# Protonate&lt;br /&gt;
# Returns a Structure Object / WARNING! Bug prone for now.&lt;br /&gt;
&lt;br /&gt;
protein_h = server.PDBasXMLwithSymwithPolarH(id)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Regarding the local implementation, after much reading I settled on using PyMol's algorithm. It seems to allow for protonation of any structure, regardless of its nature (protein, DNA, etc). Its vectorial and matrix operations can likely be optimized with Numpy and Biopython's Vector.py module.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
A Center of Mass function was developed first as part of a new module Bio.Struct.Geometry. It allows for calculation of the center of geometry (all masses are equal) and center of mass (taking into account elemental masses for the atoms). The masses are a new Atom object feature derived from [http://www.chem.qmul.ac.uk/iupac/AtWt/ this list] and from PyMol. Essentially, all atoms of a structure now get their mass defined when the Structure is created (check Atom.py and [http://lists.open-bio.org/pipermail/biopython-dev/2010-June/007880.html this thread] for details). This is obviously experimental.&lt;br /&gt;
&lt;br /&gt;
To calculate the center of mass of any Entity (Structure, Model, Chain, Residue) or a List of Atoms:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.Struct.Geometry import center_of_mass&lt;br /&gt;
from Bio import Struct&lt;br /&gt;
&lt;br /&gt;
s = Struct.read('4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
print center_of_mass.__doc__&lt;br /&gt;
&lt;br /&gt;
    Returns gravitic or geometric center of mass of an Entity.&lt;br /&gt;
    Geometric assumes all masses are equal (geometric=True)&lt;br /&gt;
    Defaults to Gravitic.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s)&lt;br /&gt;
[14.833301303933874, 21.431581746366263, 4.1218478418007134]&lt;br /&gt;
&lt;br /&gt;
print center_of_mass(s, geometric=True)&lt;br /&gt;
[14.805324902127458, 21.365571977563405, 4.1108949403803985]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Coarse-Graining makes use of this function when reducing each residue to 3 points (ENCAD-like): Ca, peptide-bond O, and center of mass of side chain. The default CG function reduces the protein to a Ca-trace. I am still working on implementing perhaps some other CG-representations such as that from CABS.&lt;br /&gt;
&lt;br /&gt;
An example, picking up the s Structure from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
p = s.as_protein() # To expose the CG method&lt;br /&gt;
&lt;br /&gt;
ca_trace = p.coarse_grain()&lt;br /&gt;
&lt;br /&gt;
# One atom per residue&lt;br /&gt;
print ( len(list(p.get_residues())) == len(list(ca_trace.get_atoms())) )&lt;br /&gt;
True&lt;br /&gt;
&lt;br /&gt;
cg_3pt = p.coarse_grain('3pt')&lt;br /&gt;
&lt;br /&gt;
for residue in cg_3pt.get_residues():&lt;br /&gt;
  print residue.resname, residue.child_list&lt;br /&gt;
&lt;br /&gt;
ARG [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
ASP [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PHE [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
LEU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLU [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
PRO [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
TYR [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
.....&lt;br /&gt;
CYS [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
GLY [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;]&lt;br /&gt;
ALA [&amp;lt;Atom CA&amp;gt;, &amp;lt;Atom O&amp;gt;, &amp;lt;Atom CMA&amp;gt;]&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
Implement as part of Structure.py and based loosely on the [http://www.biopython.org/wiki/Remove_PDB_disordered_atoms contribution of Ramon Crehuet]. The DisorderedAtom objects are removed from the residue and a single Atom object is added corresponding to the location of the user's choice (keep_loc argument) which defaults to A.&lt;br /&gt;
&lt;br /&gt;
An example, still keeping s from above:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
s = s.remove_disordered_atoms(verbose=True)&lt;br /&gt;
0 residues were modified&lt;br /&gt;
&lt;br /&gt;
# Now if we load a structure with disordered atoms&lt;br /&gt;
ds = Struct.read('1MC2.pdb')&lt;br /&gt;
ds.remove_disordered_atoms(verbose=True)&lt;br /&gt;
Residue TRP:1010 has 8 disordered atoms: CD1/CD2/NE1/CE2/CE3/CZ2/CZ3/CH2&lt;br /&gt;
Residue VAL:1018 has 3 disordered atoms: CB/CG1/CG2&lt;br /&gt;
Residue LEU:1024 has 4 disordered atoms: CB/CG/CD1/CD2&lt;br /&gt;
Residue ARG:1043 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue MET:1092 has 4 disordered atoms: CB/CG/SD/CE&lt;br /&gt;
Residue ARG:1107 has 7 disordered atoms: CB/CG/CD/NE/CZ/NH1/NH2&lt;br /&gt;
Residue GLU:1108 has 4 disordered atoms: CG/CD/OE1/OE2&lt;br /&gt;
Residue ASP:1111 has 4 disordered atoms: CB/CG/OD1/OD2&lt;br /&gt;
Residue SER:1116 has 1 disordered atoms: OG&lt;br /&gt;
Residue SER:1131 has 1 disordered atoms: O&lt;br /&gt;
10 residues were modified&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 25 Jun 2010 22:12:18 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Weeks 2 and 3. Hydrogenation Discussion and Coarse Graining.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 2 and 3 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Hydrogenation of PDB files ====&lt;br /&gt;
&lt;br /&gt;
Following discussion between the mentors and me, we decided that maybe it was better to not only include a webserver for this purpose but also a local algorithm. This would not limit the user when there he/she lacks an internet connection.&lt;br /&gt;
&lt;br /&gt;
Our webserver of choice was WHATIF. The simplicity of its access justifies its choice, together with the stability and proven results of the method. The service we want to implement is available as a test webservice via a REST or SOAP interface. Access is as simple as this, via REST:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
#!/usr/bin/python&lt;br /&gt;
&lt;br /&gt;
import urllib&lt;br /&gt;
import xml.dom.minidom&lt;br /&gt;
&lt;br /&gt;
# read a local PDB file&lt;br /&gt;
f = open('../pdb1crn.ent', 'r')&lt;br /&gt;
data = f.read()&lt;br /&gt;
f.close()&lt;br /&gt;
&lt;br /&gt;
# now upload this PDB file to the What IF webservice&lt;br /&gt;
f = urllib.urlopen(&amp;quot;http://www.cmbi.ru.nl/wiwsd/rest/UploadPDB&amp;quot;, data)&lt;br /&gt;
x = xml.dom.minidom.parse(f)&lt;br /&gt;
id = x.getElementsByTagName(&amp;quot;response&amp;quot;)[0].childNodes[0].data&lt;br /&gt;
&lt;br /&gt;
# Call a what-if function, we use SymmetryContact as an example&lt;br /&gt;
f = urllib.urlopen(&amp;quot;http://www.cmbi.ru.nl/wiwsd/rest/SymmetryContact/id/&amp;quot; + id)&lt;br /&gt;
x = xml.dom.minidom.parse(f)&lt;br /&gt;
&lt;br /&gt;
# and now we have the data, print out a simple list&lt;br /&gt;
for node in x.getElementsByTagName(&amp;quot;response&amp;quot;):&lt;br /&gt;
	nr = node.getElementsByTagName(&amp;quot;number&amp;quot;)[0].childNodes[0].data&lt;br /&gt;
	cnt = node.getElementsByTagName(&amp;quot;contact_count&amp;quot;)[0].childNodes[0].data&lt;br /&gt;
	print nr + &amp;quot;\t&amp;quot; + cnt&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We are still discussing about the technical details of the implementation.&lt;br /&gt;
&lt;br /&gt;
Regarding the local algorithm, several approaches were studied: [http://zhang.bioinformatics.ku.edu/HAAD/ HAAD], [http://manual.gromacs.org/current/online/protonate.html GROMACS], PyMol[http://pymolwiki.org/index.php/H_Add 3], MMTK[http://sourcesup.cru.fr/projects/mmtk/ 4]. All of them add hydrogens based on geometrical criteria, with HAAD doing some sort of pseudo-minimization to get the atoms in the right orientation. Although I think this is valuable strategy, it is a bit beyond the scope of this project. Furthermore, the only way I know of getting perfectly good orientations is under a real force field. Therefore, I believe that adding hydrogens based on geometrical constraints is a simple yet good addition to BioPython.&lt;br /&gt;
&lt;br /&gt;
I am still studying the algorithms to try and replicate their efficiency (don't want to reinvent the wheel). Since some of the code is GPL'ed it can be directly associated with our library.&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
This feature has been written and implemented for proteins only. It was based on ENCADs coarse graining strategy (3pt per residue) but I'm still adding CABS structure too. I wrote a center of mass function that is independent of any Entity subclass and can therefore be useful for other purposes. We are deciding where to place this code. I will add an example when the code is mature enough.&lt;/div&gt;</description>
			<pubDate>Thu, 17 Jun 2010 20:39:16 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Added progress&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
== Project Code ==&lt;br /&gt;
&lt;br /&gt;
Hosted at [http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 this GitHub branch]&lt;br /&gt;
&lt;br /&gt;
== Project Progress ==&lt;br /&gt;
&lt;br /&gt;
Since I'm adding some methods that are useful/logical only for proteins, having them exposed in Structure.py for every molecule could be misleading. We decided then to add a 'as_protein()' method that allows protein-specific methods to be accessed. The following example demonstrates how this call works. Note how the &amp;quot;search_ss_bonds&amp;quot; method is absent from dir(s) but not from dir(prot).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
dir(s)&lt;br /&gt;
# Cut for viewing purposes&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'set_parent', 'xtra']&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
dir(prot)&lt;br /&gt;
&lt;br /&gt;
['__doc__', ... , 'renumber_residues', 'search_ss_bonds', 'set_parent', 'xtra']&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Week 1 ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
Since parse_pdb_header is far from optimal and is likely to change in the future, I opted to forfeit reading SEQREQ records to account for gaps. However, ignoring this information and renumbering based on ATOM records would make us lose information on gaps. I opted to subtract the first residue number-1 to all residues thus making the numbering start in 1 and still keep gaps. I also added an argument (start) to allow the user to set which number to start the counting from.&lt;br /&gt;
&lt;br /&gt;
Example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '1IHM.pdb')&lt;br /&gt;
&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1029 icode= &amp;gt;&lt;br /&gt;
&lt;br /&gt;
s.renumber_residues()&lt;br /&gt;
print list(s.get_residues())[0]&lt;br /&gt;
&amp;lt;Residue ASP het=  resseq=1 icode= &amp;gt;&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
The same rationale from SEQRES applies for the exclusion of looking up SSBOND. Also, instead of using NeighborSearch to look for pairs of cysteins in bond distance, I instead used the minus operator since it has been overloaded to return the distance between two atoms (Page 10 of the [http://www.biopython.org/DIST/docs/cookbook/biopdb_faq.pdf FAQ]). The average distance cited in the literature is 2.05A but other software packages and my own tests set 3.0A as a good threshold. Still, the user can set his own threshold manually.&lt;br /&gt;
&lt;br /&gt;
The function returns an iterator with tuples of pairs of residues.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
s = p.get_structure('example', '4PTI.pdb')&lt;br /&gt;
&lt;br /&gt;
prot = s.as_protein()&lt;br /&gt;
&lt;br /&gt;
for bond in prot.search_ss_bonds():&lt;br /&gt;
  print bond&lt;br /&gt;
&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=5 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=55 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=14 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=38 icode= &amp;gt;)&lt;br /&gt;
(&amp;lt;Residue CYS het=  resseq=30 icode= &amp;gt;, &amp;lt;Residue CYS het=  resseq=51 icode= &amp;gt;)&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
Added parsing for REMARK350 to parse_pdb_header since there was already a bit written for another REMARK section. This extracts the transformation matrices and the translation vector from the header, that is then fed to the Structure function. Each new rotated structure is created as a new MODEL. I chose this because crystal structures very rarely have more than one MODEL instance and also because NMR models don't have REMARK 350 that often (at least to my knowledge).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;python&amp;gt;&lt;br /&gt;
from Bio.PDB import PDBParser&lt;br /&gt;
&lt;br /&gt;
p = PDBParser()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
s1 = p.get_structure('a', '4PTI.pdb')&lt;br /&gt;
s1.build_biological_unit()&lt;br /&gt;
'Processed 0 transformations on the structure.' # Identity matrix is ignored.&lt;br /&gt;
&lt;br /&gt;
s2 = p.get_structure('b', 'homol_1bd8.pdb') # A homology model&lt;br /&gt;
s2.build_biological_unit()&lt;br /&gt;
'PDB File lacks appropriate REMARK 350 entries to build Biological Unit.'&lt;br /&gt;
&lt;br /&gt;
s3 = p.get_structure('c', '1IHM.pdb')&lt;br /&gt;
s3.build_biological_unit()&lt;br /&gt;
'Processed 59 transformations on the structure.'&lt;br /&gt;
&amp;lt;/python&amp;gt;&lt;/div&gt;</description>
			<pubDate>Fri, 11 Jun 2010 16:01:15 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;/div&gt;</description>
			<pubDate>Thu, 20 May 2010 01:23:05 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>User:Joaor</title>
			<link>http://biopython.org/wiki/User:Joaor</link>
			<guid isPermaLink="false">http://biopython.org/wiki/User:Joaor</guid>
			<description>&lt;p&gt;Joaor: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hello :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''' About Me '''&lt;br /&gt;
&lt;br /&gt;
This is the Wiki page of João Rodrigues, currently doing research in protein structure refinement with [http://csb.stanford.edu Prof. Michael Levitt] at Stanford University. I have previously worked with [http://haddock.chem.uu.nl Prof. Alexandre Bonvin] in protein-protein docking at Utrecht University.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''' Contributions to Biopython '''&lt;br /&gt;
&lt;br /&gt;
I have been a member of the Biopython community since 2007 but only became an active participant with the GSOC 2010 programme. My project is on the enrichment of the Bio.PDB module with several features that will hopefully make it more attractive and useful to the structural biology community.&lt;br /&gt;
&lt;br /&gt;
A detailed layout of my project is to be presented [[GSOC2010_Joao|elsewhere]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[http://stanford.edu/~joaor My personal page]  ||  Last updated: 05.2010&lt;/div&gt;</description>
			<pubDate>Wed, 19 May 2010 15:13:37 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/User_talk:Joaor</comments>		</item>
		<item>
			<title>GSOC2010 Joao</title>
			<link>http://biopython.org/wiki/GSOC2010_Joao</link>
			<guid isPermaLink="false">http://biopython.org/wiki/GSOC2010_Joao</guid>
			<description>&lt;p&gt;Joaor: Creation&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Author &amp;amp; Mentors == &lt;br /&gt;
&lt;br /&gt;
[[User:Joaor|João Rodrigues]] anaryin@gmail.com&lt;br /&gt;
&lt;br /&gt;
'''Mentors'''&lt;br /&gt;
: Eric Talevich&lt;br /&gt;
: Diana Jaunzeikare&lt;br /&gt;
: Peter Cock&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Abstract	==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Biopython is a very popular library in Bioinformatics and Computational Biology. Its Bio.PDB module, originally developed by Thomas Hamelryck, is a simple yet powerful tool for structural biologists. Although it provides a reliable PDB parser feature and it allows several calculations (Neighbour Search, RMS) to be made on macromolecules, it still lacks a number of features that are part of a researcher's daily routine. Probing for disulphide bridges in a structure and adding polar hydrogen atoms accordingly are two examples that can be incorporated in Bio.PDB, given the module's clever structure and good overall organisation. Cosmetic operations such as chain removal and residue renaming – to account for the different existing nomenclatures – and renumbering would also be greatly appreciated by the community.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Another aspect that can be improved for Bio.PDB is a smooth integration/interaction layer for heavy-weights in macromolecule simulation such as MODELLER, GROMACS, AutoDock, HADDOCK. It could be argued that the easiest solution would be to code hooks to these packages' functions and routines. However, projects such as the recently developed [http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html edPDB] or the more complete [http://biskit.pasteur.fr/ Biskit library] render, in my opinion, such interfacing efforts redundant. Instead, I believe it to be more advantageous to include these software' input/output formats in Biopython's SeqIO and AlignIO modules. This, together with the creation of interfaces for model validation/structure checking services/software would allow Biopython to be used as a pre- and post-simulation tool. Eventually, it would pave the way for its inclusion in pipelines and workflows for structure modelling, molecular dynamics, and docking simulations.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project Schedule ==&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
The schedule below was organised to be flexible, which means that some features will likely be done early. Also, the weeks include documentation and unit testing efforts for the features, with extended periods for reviewing these efforts at the two points during the project (halfway, final week).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Community Bonding Period ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Getting familiar with development environment (Git Hub account, Git, Biopython's repository, Bug tracking system, etc)&lt;br /&gt;
&lt;br /&gt;
*Gather scientific literature and discuss some of the to-be-implemented methods.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 1 [31st May - 6th June] ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Renumbering residues of a structure ====&lt;br /&gt;
&lt;br /&gt;
*Read SEQRES record to account for gaps&lt;br /&gt;
*Alternatively read ATOM records.&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Probe disulphide bridges in the structure ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
*Also use SSBOND in header&lt;br /&gt;
&lt;br /&gt;
   &lt;br /&gt;
====  Extract Biological Unit ====&lt;br /&gt;
&lt;br /&gt;
*REMARK350 contains rotation and translation information&lt;br /&gt;
*If REMARK is absent, do nothing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 2 [7th – 13th June] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Structure Hydrogenation ====&lt;br /&gt;
&lt;br /&gt;
*Add all/polar hydrogens through interface with WHATIF server.&lt;br /&gt;
*Optionally define a set pH&lt;br /&gt;
** [http://www3.interscience.wiley.com/journal/112117957/abstract pKa values algorithm]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Hydrogenation Report ====&lt;br /&gt;
&lt;br /&gt;
*Produces a brief list of polar hydrogen atoms in the structure.&lt;br /&gt;
** Chain | Residue [number] | Atom&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 3-5 [14th June- 4th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Removal of disordered atoms ====&lt;br /&gt;
&lt;br /&gt;
*[[Remove_PDB_disordered_atoms|Solution proposed in the Biopython wiki]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Residue name normalisation ====&lt;br /&gt;
&lt;br /&gt;
*Build conversion table from different nomenclatures (research them during c.bonding period )&lt;br /&gt;
*Write function to make a given structure compliant with a given software nomenclature:&lt;br /&gt;
** Amber&lt;br /&gt;
** CNS/HADDOCK&lt;br /&gt;
** GROMACS&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Coarse Grain Structure ====&lt;br /&gt;
&lt;br /&gt;
*Implement function to reduce complexity of a structure&lt;br /&gt;
** 1pt*c-alpha&lt;br /&gt;
** 2pt*c-alpha / c-beta&lt;br /&gt;
** 3pt*c-alpha / c-beta / side-chain pseudo-centroid OR side-chain centroid&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 6 (Mid-Term) [5th - 11th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
: Testing and consolidating the features thoroughly.&lt;br /&gt;
: Write documentation &amp;amp; examples for each feature, to be included in Biopython's Wiki and Bio.PDB's FAQ.&lt;br /&gt;
: Mid-term Evaluations. Discussing with mentors current state of project and adjust following schedule to comply with project's needs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 7 [12th - 19th July] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add support for MODELLER's PIR format to Biopython ====&lt;br /&gt;
&lt;br /&gt;
*[http://www.salilab.org/modeller/manual/node445.html#alignmentformat Format Description]&lt;br /&gt;
*SeqIO&lt;br /&gt;
*AlignIO&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Allow conversion of Structure Object to Sequence Object ====&lt;br /&gt;
&lt;br /&gt;
*Based on Bio.PDB.Polypeptide function&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Weeks 8-10 [20th July - 9th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Add Sequence/Structure Homology functions ====&lt;br /&gt;
&lt;br /&gt;
*Create call to Biopython's BLAST interfaces&lt;br /&gt;
** Allow direct blast from structure object ( e.g. protein.find_homoseq() )&lt;br /&gt;
** Returns list of tuples with E-Value *Dictionary (name, length of alignment, etc..)&lt;br /&gt;
*Create interface with structural homology web services&lt;br /&gt;
** e.g. [http://ekhidna.biocenter.helsinki.fi/dali_server/ Dali server]&lt;br /&gt;
** Return list of tuples with Z-Score*Dictionary (name, etc...)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Implement basic structure validation checks ====&lt;br /&gt;
&lt;br /&gt;
*Via NeighbourSearch class&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
*Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;br /&gt;
&lt;br /&gt;
** Same Charge contacts&lt;br /&gt;
** Atom Clashes&lt;br /&gt;
:Via ResidueDepth Class&lt;br /&gt;
** Buried Charges&lt;br /&gt;
*Interface WHATIF PDBReport web service&lt;br /&gt;
** Parse WARNING and ERROR messages&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Week 11 [10th - 17th August] === &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====  Reviewing documentation, code, write tests for new functions. ====&lt;/div&gt;</description>
			<pubDate>Thu, 13 May 2010 08:18:36 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:GSOC2010_Joao</comments>		</item>
		<item>
			<title>Active projects</title>
			<link>http://biopython.org/wiki/Active_projects</link>
			<guid isPermaLink="false">http://biopython.org/wiki/Active_projects</guid>
			<description>&lt;p&gt;Joaor: /* Current projects */  Added GSOC 2010 project&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page provides a central location to collect references to active projects. This is a good place to start if you are interested in contributing to Biopython and want to find larger projects in progress. For developers, use this to reference [[git]] branches or other projects which you will be working on for an extended period of time. Please keep it up to date as projects are finished and integrated into Biopython.&lt;br /&gt;
&lt;br /&gt;
== Current projects ==&lt;br /&gt;
&lt;br /&gt;
=== Population Genetics development ===&lt;br /&gt;
&lt;br /&gt;
Giovanni and Tiago are working on expanding population genetics code in Biopython. See the [[PopGen_dev|PopGen development page]] for more details.&lt;br /&gt;
&lt;br /&gt;
=== GFF parser ===&lt;br /&gt;
&lt;br /&gt;
Brad is working on a Biopython GFF parser. Source code is available from [http://github.com/chapmanb/bcbb/tree/master/gff git hub]. Documentation is in progress at [[GFF Parsing]]. See blog posts on the [http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/ initial implementation] and [http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/ MapReduce parallel version].&lt;br /&gt;
&lt;br /&gt;
=== Phylo ===&lt;br /&gt;
&lt;br /&gt;
[[User:EricTalevich|Eric]] is working on a new module for phylogenetics, [[Phylo|Bio.Phylo]]. It grew out of a [[Google Summer of Code]] 2009 [http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798969 project], mentored by Brad, to add support for [http://www.phyloxml.org/ phyloXML] to Biopython; it also refactors part of Bio.Nexus. Most of the code has been pushed to the main development branch on GitHub already, but new features appear first on Eric's [http://github.com/etal/biopython/tree/phyloxml phyloxml branch].&lt;br /&gt;
&lt;br /&gt;
=== Biogeography ===&lt;br /&gt;
&lt;br /&gt;
[[Matzke|Nick]] is working on developing a Biogeography module for BioPython.  This work was funded by a [[Google Summer of Code]] 2009 [http://socghop.appspot.com/program/home/google/gsoc2009 project] through NESCENT's [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009 Phyloinformatics Summer of Code 2009]. See the project proposal at: [http://socghop.appspot.com/student_project/show/google/gsoc2009/nescent/t124022798250 Biogeographical Phylogenetics for BioPython]. The mentors are [http://blackrim.org/ Stephen Smith] (primary), [http://bcbio.wordpress.com/ Brad Chapman], and [http://evoviz.nescent.org/ David Kidd].  The code currently lives at the Bio/Geography directory of the [http://github.com/nmatzke/biopython/tree/Geography Geography fork of the nmatzke branch on  GitHub], and you can see a timeline and other info about ongoing development [[BioGeography|here]]. The new module is being documented on this wiki as [[BioGeography]].&lt;br /&gt;
&lt;br /&gt;
=== Roche 454 SFF parsing in Bio.SeqIO ===&lt;br /&gt;
&lt;br /&gt;
See [http://bugzilla.open-bio.org/show_bug.cgi?id=2837 Bug 2837], based on code from Jose Blanca. Recently merged into the trunk and should be included in Biopython 1.54 onwards.&lt;br /&gt;
&lt;br /&gt;
=== Multiple Sequence Alignments ===&lt;br /&gt;
&lt;br /&gt;
Peter is working on a new alignment class for sequence alignments (not the kind in next gen sequencing), the core of which was recently merged into the trunk and should be included in Biopython 1.54 onwards.&lt;br /&gt;
&lt;br /&gt;
=== Open Enhancement Bugs ===&lt;br /&gt;
&lt;br /&gt;
This [http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&amp;amp;bug_status=NEW&amp;amp;bug_status=ASSIGNED&amp;amp;bug_status=REOPENED&amp;amp;bug_severity=enhancement Bugzilla Search] will list all open enhancement bugs (any filed by core developers are fairly likely to be integrated, some are just wish list entries).&lt;br /&gt;
&lt;br /&gt;
=== (GSOC 2010) Extending Bio.PDB ===&lt;br /&gt;
&lt;br /&gt;
GSOC 2010 project that aims to introduce several new features to the Bio.PDB structural biology module. Includes functions to add polar hydrogens to structures, probing for SS bridges based on structural information and annotations, renumbering residues, coarse-graining a structure, etc. A more comprehensive layout of the project is available [[GSOC2010_Joao|here]].&lt;br /&gt;
&lt;br /&gt;
== Project ideas ==&lt;br /&gt;
&lt;br /&gt;
Please add any ideas or proposals for new additions to Biopython. Bugs and enhancements for current code should be discussed though our bugzilla interface.&lt;br /&gt;
&lt;br /&gt;
* Build a general tool to filter sequences containing ambiguous or low quality bases. Chris Fields from BioPerl is interested in coordinating the BioPerl/Biopython implementations. See these threads on the mailing lists for discussion: http://lists.open-bio.org/pipermail/biopython/2009-July/005355.html, http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html&lt;br /&gt;
&lt;br /&gt;
* Use SQLAlchemy, an object relational mapper, for BioSQL internals. This would add an additional external dependency to Biopython, but provides ready support for additional databases like SQLite. It also would provide a raw object interface to BioSQL databases when the SeqRecord-like interface is not sufficient. Brad and Kyle have some initial code for this.&lt;br /&gt;
&lt;br /&gt;
* Revamp the GEO SOFT parser, drawing on the ideas used in [http://www.bioconductor.org/packages/bioc/html/GEOquery.html Sean Davis' GEOquery parser in R/Bioconductor].  See also [http://www.warwick.ac.uk/go/peter_cock/r/geo/ this page].&lt;br /&gt;
&lt;br /&gt;
== Enhancement list ==&lt;br /&gt;
&lt;br /&gt;
Maintaining software involves incremental improvements for new format changes and removal of bugs. Please see our [http://bugzilla.open-bio.org/ bugzilla] page for a current list. Post to the developer mailing list if you are interested in tackling any open issues.&lt;/div&gt;</description>
			<pubDate>Thu, 13 May 2010 07:25:37 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/Talk:Active_projects</comments>		</item>
		<item>
			<title>User:Joaor</title>
			<link>http://biopython.org/wiki/User:Joaor</link>
			<guid isPermaLink="false">http://biopython.org/wiki/User:Joaor</guid>
			<description>&lt;p&gt;Joaor: Creation.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Hello :)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''' About Me '''&lt;br /&gt;
&lt;br /&gt;
This is the Wiki page of João Rodrigues, currently doing research in protein structure refinement with [http://csb.stanford.edu Prof. Michael Levitt] at Stanford University. I have previously worked with [http://haddock.chem.uu.nl Prof. Alexandre Bonvin] in protein-protein docking at Utrecht University.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''' Contributions to Biopython '''&lt;br /&gt;
&lt;br /&gt;
I have been a member of the Biopython community since 2007 but only became an active participant with the GSOC 2010 programme. My project involves enrich the Bio.PDB module with several features that will hopefully make it more attractive and useful to the structural biology community.&lt;br /&gt;
&lt;br /&gt;
A detailed layout of my project is to be presented [[GSOC2010_Joao|elsewhere]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[http://stanford.edu/~joaor My personal page]  ||  Last updated: 05.2010&lt;/div&gt;</description>
			<pubDate>Thu, 13 May 2010 07:16:54 GMT</pubDate>			<dc:creator>Joaor</dc:creator>			<comments>http://biopython.org/wiki/User_talk:Joaor</comments>		</item>
	</channel>
</rss>