Mocapy++ is a machine learning toolkit for training and using Bayesian networks. It has been used to develop probabilistic models of biomolecular structures. The goal of this project is to develop a Python interface to Mocapy++ and integrate it with Biopython. This will allow the training of a probabilistic model using data extracted from a database. The integration of Mocapy++ with Biopython will provide a strong support for the field of protein structure prediction, design and simulation.

Discovering the structure of biomolecules is one of the biggest problems in biology. Given an amino acid or base sequence, what is the three dimensional structure? One approach to biomolecular structure prediction is the construction of probabilistic models. A Bayesian network is a probabilistic model composed of a set of variables and their joint probability distribution, represented as a directed acyclic graph. A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences can be time-series or sequences of symbols, such as protein sequences. Directional statistics is concerned mainly with observations which are unit vectors in the plane or in three-dimensional space. The sample space is typically a circle or a sphere. There must be special directional methods which take into account the structure of the sample spaces. The union of graphical models and directional statistics allows the development of probabilistic models of biomolecular structures. Through the use of dynamic Bayesian networks with directional output it becomes possible to construct a joint probability distribution over sequence and structure. Biomolecular structures can be represented in a geometrically natural, continuous space. Mocapy++ is an open source toolkit for inference and learning using dynamic Bayesian networks that provides support for directional statistics. Mocapy++ is excellent for constructing probabilistic models of biomolecular structures; it has been used to develop models of protein and RNA structure in atomic detail. Mocapy++ is used in several high-impact publications, and will form the core of the molecular modeling package Phaistos, which will be released soon. The goal of this project is to develop a highly useful Python interface to Mocapy++, and to integrate that interface with the Biopython project. Through the Bio.PDB module, Biopython provides excellent functionality for data mining biomolecular structure databases. Integrating Mocapy++ and Biopython will allow training a probabilistic model using data extracted from a database. Integrating Mocapy++ with Biopython will create a powerful toolkit for researchers to quickly implement and test new ideas, try a variety of approaches and refine their methods. It will provide strong support for the field of biomolecular structure prediction, design, and simulation.

Michele Silva michele.silva@gmail.com

**Mentors**

Thomas Hamelryck

Eric Talevich

’'’Gain understanding of SEM and directional statistics ‘’’

- Review the theory behind machine learning for bioinformatics, Markov chain Monte Carlo and dynamic Bayesian networks.

- Build the theoretical background on the algorithms used in Mocapy++, such as parameter learning of Bayesian networks using Stochastic Expectation Maximization (SEM).

’'’Study Mocapy++’s use cases ‘’’

- Read several papers and attempt to replicate part of the experiments described using Mocapy++.

- Get a better understanding of biological sequence analysis done through probabilistic models of proteins and nucleic acids.

’'’Work with Mocapy++ ‘’’

- Understand Mocapy++’s internal architecture and algorithms by exploring its source code and running its test cases.

- Research other applications of Mocapy++ in Bioinformatics.

’'’Design Mocapy++’s Python interface ‘’’

- Explore the source code of Biopython to understand its design and implementation. The Mocapy++ interface to be included in Biopython must be made compatible with the methods of solving problems in Biopython.

- Design a Python interface for Mocapy++, based on its data structures and algorithms. Examine Mocapy++’s use cases and existing test cases to provide guidance for the interface design.

’'’Implement Python bindings ‘’’

- Implement test cases in Python using the new interface to Mocapy++.

- Implement python bindings for the defined interface.

’'’Explore Mocapy++’s applications ‘’’

- Develop example applications that involve data mining of biomolecular structure databases using Biopython.

- Formulate probabilistic models using Python-Mocapy++. Apply the models to solve biological problems. Examples of problems that can be solved using dynamic Bayesian networks include deciding if a pair of sequences is evolutionarily related, finding sequences which are homologous to a known evolutionary family and predicting RNA secondary structure.

Hosted at the gSoC11 Mocapy branch

There is already an effort to provide bindings for Mocapy++ using Swig. However, Swig is not the best option if performance is to be required. The Sage project aims at providing an open source alternative to Mathematica or Maple. Cython was developed in conjunction with Sage (it is an independent project, though), thus it is based on Sage’s requirements. They tried Swig, but declined it for performance issues. According to the Sage programming guide “The idea was to write code in C++ for SAGE that needed to be fast, then wrap it in SWIG. This ground to a halt, because the result was not sufficiently fast. First, there is overhead when writing code in C++ in the first place. Second, SWIG generates several layers of code between Python and the code that does the actual work”. This was written back in 2004, but it seems things didn’t evolve much. The only reason I would consider Swig is for future including Mocapy++ bindings on BioJava and BioRuby projects.

Boost Python is comprehensive and well accepted by the Python community. I would go for it for its extensive use and testing. I would decline it for being hard to debug and having a complicated building system. I don’t think it would be worth including a boost dependency just for the sake of creating the Python bindings, but since Mocapy++ already depends on Boost, using it becomes a more attractive option. In my personal experience, Boost Python is very mature and there are no limitations on what one can do with it. When it comes to performance, Cython still overcomes it. Have a look at the Cython C++ wrapping benchmarks and check the timings of Cython against Boost Python. There are also previous benchmarks comparing Swig and Boost Python.

It is incredibly faster than other options to create python bindings to C++ code, according to several benchmarks available on the web. Check the Simple benchmark between Cython and Boost.Python. It is also very clean and simple, yet powerful. Python’s doc on porting extension modules mentions cython: “If you are writing a new extension module, you might consider Cython.” Cython has now support for efficient interaction with numpy arrays. it is a young, but developing language and I would definitely give it a try for its leanness and speed.

Since Boost is well supported and Mocapy++ already relies on it, we decided to use Boost.Python for the bindings.

For further information see Mocapy++Biopython - Box of ideas.

The source code for the prototype is on the gSoC11 branch: http://mocapy.svn.sourceforge.net/viewvc/mocapy/branches/gSoC11/bindings_prototype/

Bindings for a few Mocapy++ features and a couple of examples to find possible implementation and performance issues.

**Procedure**

- Implemented the examples hmm_discrete and discrete_hmm_with_prior in Python, assuming the interface Mocapy++ already provides.

- Implemented the bindings to provide a minimum subset of functionality, in order to run the implemented examples.

- Compared the performance of C++ and Python versions.

Mocapy++’s interface remained unchanged, so the tests look similar to the ones in Mocapy/examples.

In the prototype the bindings were all implemented in a single module. For the actual implementation, we could mirror the src packages structure, having separated bindings for each package such as discrete, inference, etc.

It was possible to implement all the functionality required to run the examples. It was not possible to use the vector_indexing_suite when creating bindings for vectors of MDArrays. A few operators (in the MDArray) must be implemented in order to export indexable C++ containers to Python.

Two Mocapy++ examples that use discrete nodes were implemented in Python. There was no problem in exposing Mocapy’s data structures and algorithms. The performance of the Python version is very close to the original Mocapy++.

For additional details have a look at the Mocapy++ Bindings Prototype report.

’’’ Data structures ‘’’

Mocapy uses an internal data structure to represent arrays: MDArray. In order to make it easier for the user to interact with Mocapy’s API, it was decided to provide an interface that accepts numpy arrays. Therefore, it was necessary to implement a translation between a numpy array and an MDArray.

The translation from MDArray to python was done through the use of Boost.Python to_python_converter. We’ve implemented a template method convert_MDArray_to_numpy_array, which converts an MDArray of any basic type to a corresponding numpy array. In order to perform the translation the original array’s shape and internal data are copied into a new numpy array.

The numpy array was created using the Numpy Array API. The creation of a new PyArrayObject using existing data (PyArray_SimpleNewFromData) doesn’t copy the array data, it just stores a pointer to it. Thus, one can only free the data when there is no reference to the object. This was done through the use of a Capsule. Besides encapsulating the data, the capsule also stores a destructor to be used when the array is destroyed. The PyArrayObject has a field named “base” which points to the capsule.

The translation from Python to C++, i.e. creating an MDArray from a numpy array is slightly more complex. Boost.Python will provide a chunk of memory into which the new C++ object must constructed in-place. See the How to write boost.python converters article for more details.

A translation between std::vector of basic types (double, int…) and Python list was also implemented. For std::vector of custom types, such as Node, the translation to a Python list was not performed. If done the same ways as for basic types, a type error is raised: “TypeError: No to_python (by-value) converter found for C++ type”. When using vector_indexing_suite this problem was already solved. See Wrapping std::vector<AbstractClass*>. The only inconvenience of using the vector_indexing_suite is creating new types such as vector_Node, instead of using a standard Python list.

The code for the translations is in the mocapy_data_structures module.

’’’ Core functions ‘’’

The mocapy Python packages follow Mocapy’s current source tree. For each package, a shared library with the bindings was created. This makes compilation faster and debug easier. Also, if a single library was created it wouldn’t be possible to define packages.

Each of the libraries is called libmocapy_

The bindings code can be found in the Bindings directory.

Currently, tests to the just created interface are being developed. There are a few tests already implemented under the framework package: mocapy/framework/tests

’’’ Data structures ‘’’

While implementing the bindings for the remaining Mocapy++ functionality there were problems with methods that take pointers and references to an mdarray:

- It is not possible to call a method which takes a pointer if the object is created on the python side. See how to call a function that expects a pointer?.

- It is not possible to automatically translate a non const reference. The custom rvalue converters only match functions with the following signatures: