Bio.SubsMat package¶
Module contents¶
Substitution matrices, log odds matrices, and operations on them.
General:¶
This module provides a class and a few routines for generating substitution matrices, similar ot BLOSUM or PAM matrices, but based on user-provided data. The class used for these matrices is SeqMat
Matrices are implemented as a dictionary. Each index contains a 2-tuple, which are the two residue/nucleotide types replaced. The value differs according to the matrix’s purpose: e.g in a log-odds frequency matrix, the value would be log(Pij/(Pi*Pj)) where: Pij: frequency of substitution of letter (residue/nucleotide) i by j Pi, Pj: expected frequencies of i and j, respectively.
Usage:¶
The following section is laid out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and investigated. Most people just want a log-odds matrix, that’s all.
Generating an Accepted Replacement Matrix:¶
Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the _counted_ number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be: [‘A’,’C’]: 10, [‘C’,’A’] 12 As order doesn’t matter, user can already provide only one entry: [‘A’,’C’]: 22 A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrix. A Full protein alphabet matrix would be of the size 20x20 = 400. A Half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don’t change. (The matrix diagonal). Given an alphabet size of N: Full matrix size:N*N Half matrix size: N(N+1)/2
If you provide a full matrix, the constructor will create a half-matrix automatically. If you provide a half-matrix, make sure of a (low, high) sorted order in the keys: there should only be a (‘A’,’C’) not a (‘C’,’A’).
Internal functions:
Generating the observed frequency matrix (OFM):¶
Use: OFM = _build_obs_freq_mat(ARM) The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.
Generating an expected frequency matrix (EFM):¶
Use: EFM = _build_exp_freq_mat(OFM,exp_freq_table) exp_freq_table: should be a freqTableC instantiation. See freqTable.py for detailed information. Briefly, the expected frequency table has the frequencies of appearance for each member of the alphabet
Generating a substitution frequency matrix (SFM):¶
Use: SFM = _build_subs_mat(OFM,EFM) Accepts an OFM, EFM. Provides the division product of the corresponding values.
Generating a log-odds matrix (LOM):¶
Use: LOM=_build_log_odds_mat(SFM[,logbase=10,factor=10.0,roundit=1]) Accepts an SFM. logbase: base of the logarithm used to generate the log-odds values. factor: factor used to multiply the log-odds values. roundit: default - true. Whether to round the values. Each entry is generated by log(LOM[key])*factor And rounded if required.
External:¶
In most cases, users will want to generate a log-odds matrix only, without explicitly calling the OFM –> EFM –> SFM stages. The function build_log_odds_matrix does that. User provides an ARM and an expected frequency table. The function returns the log-odds matrix.
Methods for subtraction, addition and multiplication of matrices:¶
Generation of an expected frequency table from an observed frequency matrix.
Calculation of linear correlation coefficient between two matrices.
Calculation of relative entropy is now done using the _make_relative_entropy method and is stored in the member self.relative_entropy
Calculation of entropy is now done using the _make_entropy method and is stored in the member self.entropy.
Jensen-Shannon distance between the distributions from which the matrices are derived. This is a distance function based on the distribution’s entropies.
-
class
Bio.SubsMat.
SeqMat
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
dict
A Generic sequence matrix class.
The key is a 2-tuple containing the letter indices of the matrix. Those should be sorted in the tuple (low, high). Because each matrix is dealt with as a half-matrix.
-
__init__
(self, data=None, alphabet=None, mat_name='', build_later=0)¶ Initialize.
User may supply:
data: matrix itself
mat_name: its name. See below.
alphabet: an instance of Bio.Alphabet, or a subclass. If not supplied, constructor builds its own from that matrix.
build_later: skip the matrix size assertion. User will build the matrix after creating the instance. Constructor builds a half matrix filled with zeroes.
-
make_entropy
(self)¶ Calculate and set the entropy attribute.
-
sum
(self)¶ Return sum of the results.
-
print_full_mat
(self, f=None, format='%4d', topformat='%4s', alphabet=None, factor=1, non_sym=None)¶ Print the full matrix to the file handle f or stdout.
-
print_mat
(self, f=None, format='%4d', bottomformat='%4s', alphabet=None, factor=1)¶ Print a nice half-matrix.
f=sys.stdout to see on the screen.
User may pass own alphabet, which should contain all letters in the alphabet of the matrix, but may be in a different order. This order will be the order of the letters on the axes.
-
__str__
(self)¶ Print a nice half-matrix.
-
__sub__
(self, other)¶ Return integer subtraction product of the two matrices.
-
__mul__
(self, other)¶ Matrix multiplication.
Returns a matrix for which each entry is the multiplication product of the two matrices passed.
-
__add__
(self, other)¶ Matrix addition.
-
-
class
Bio.SubsMat.
AcceptedReplacementsMatrix
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
Bio.SubsMat.SeqMat
Accepted replacements matrix.
-
class
Bio.SubsMat.
ObservedFrequencyMatrix
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
Bio.SubsMat.SeqMat
Observed frequency matrix.
-
class
Bio.SubsMat.
ExpectedFrequencyMatrix
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
Bio.SubsMat.SeqMat
Expected frequency matrix.
-
class
Bio.SubsMat.
SubstitutionMatrix
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
Bio.SubsMat.SeqMat
Substitution matrix.
-
calculate_relative_entropy
(self, obs_freq_mat)¶ Calculate and return relative entropy w.r.t. observed frequency matrix.
-
-
class
Bio.SubsMat.
LogOddsMatrix
(data=None, alphabet=None, mat_name='', build_later=0)¶ Bases:
Bio.SubsMat.SeqMat
Log odds matrix.
-
calculate_relative_entropy
(self, obs_freq_mat)¶ Calculate and return relative entropy w.r.t. observed frequency matrix.
-
-
Bio.SubsMat.
make_log_odds_matrix
(acc_rep_mat, exp_freq_table=None, logbase=2, factor=1.0, round_digit=9, keep_nd=0)¶ Make log-odds matrix.
-
Bio.SubsMat.
observed_frequency_to_substitution_matrix
(obs_freq_mat)¶ Convert observed frequency table into substitution matrix.
-
Bio.SubsMat.
read_text_matrix
(data_file)¶ Read a matrix from a text file.
-
Bio.SubsMat.
two_mat_relative_entropy
(mat_1, mat_2, logbase=2, diag=3)¶ Return relative entropy of two matrices.
-
Bio.SubsMat.
two_mat_correlation
(mat_1, mat_2)¶ Return linear correlation coefficient between two matrices.
-
Bio.SubsMat.
two_mat_DJS
(mat_1, mat_2, pi_1=0.5, pi_2=0.5)¶ Return Jensen-Shannon Distance between two observed frequence matrices.