Package SubsMat
source code
Substitution matrices, log odds matrices, and operations on them.
General:
This module provides a class and a few routines for generating
substitution matrices, similar ot BLOSUM or PAM matrices, but based on
userprovided data.
The class used for these matrices is SeqMat
Matrices are implemented as a dictionary. Each index contains a 2tuple,
which are the two residue/nucleotide types replaced. The value differs
according to the matrix's purpose: e.g in a logodds frequency matrix, the
value would be log(Pij/(Pi*Pj)) where:
Pij: frequency of substitution of letter (residue/nucleotide) i by j
Pi, Pj: expected frequencies of i and j, respectively.
Usage:
The following section is laid out in the order by which most people wish
to generate a logodds matrix. Of course, interim matrices can be
generated and investigated. Most people just want a logodds matrix,
that's all.
Generating an Accepted Replacement Matrix:
Initially, you should generate an accepted replacement matrix (ARM)
from your data. The values in ARM are the _counted_ number of
replacements according to your data. The data could be a set of pairs
or multiple alignments. So for instance if Alanine was replaced by
Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding
ARM entries would be:
['A','C']: 10,
['C','A'] 12
As order doesn't matter, user can already provide only one entry:
['A','C']: 22
A SeqMat instance may be initialized with either a full (first
method of counting: 10, 12) or half (the latter method, 22) matrix. A
Full protein alphabet matrix would be of the size 20x20 = 400. A Half
matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because
sameletter entries don't change. (The matrix diagonal). Given an
alphabet size of N:
Full matrix size:N*N
Half matrix size: N(N+1)/2
If you provide a full matrix, the constructor will create a halfmatrix
automatically.
If you provide a halfmatrix, make sure of a (low, high) sorted order in
the keys: there should only be
a ('A','C') not a ('C','A').
Internal functions:
Generating the observed frequency matrix (OFM):
Use: OFM = _build_obs_freq_mat(ARM)
The OFM is generated from the ARM, only instead of replacement counts, it
contains replacement frequencies.
Generating an expected frequency matrix (EFM):
Use: EFM = _build_exp_freq_mat(OFM,exp_freq_table)
exp_freq_table: should be a freqTableC instantiation. See freqTable.py for
detailed information. Briefly, the expected frequency table has the
frequencies of appearance for each member of the alphabet
Generating a substitution frequency matrix (SFM):
Use: SFM = _build_subs_mat(OFM,EFM)
Accepts an OFM, EFM. Provides the division product of the corresponding
values.
Generating a logodds matrix (LOM):
Use: LOM=_build_log_odds_mat(SFM[,logbase=10,factor=10.0,roundit=1])
Accepts an SFM. logbase: base of the logarithm used to generate the
logodds values. factor: factor used to multiply the logodds values.
roundit: default  true. Whether to round the values.
Each entry is generated by log(LOM[key])*factor
And rounded if required.
External:
In most cases, users will want to generate a logodds matrix only, without
explicitly calling the OFM > EFM > SFM stages. The function
build_log_odds_matrix does that. User provides an ARM and an expected
frequency table. The function returns the logodds matrix.
Methods for subtraction, addition and multiplication of matrices:
 Generation of an expected frequency table from an observed frequency
matrix.
 Calculation of linear correlation coefficient between two matrices.
 Calculation of relative entropy is now done using the
_make_relative_entropy method and is stored in the member
self.relative_entropy
 Calculation of entropy is now done using the _make_entropy method and
is stored in the member self.entropy.
 JensenShannon distance between the distributions from which the
matrices are derived. This is a distance function based on the
distribution's entropies.



_exp_freq_table_from_obs_freq(obs_freq_mat)
Build expected frequence table from observed frequences (PRIVATE). 
source code





_build_subs_mat(obs_freq_mat,
exp_freq_mat)
Build the substitution matrix (PRIVATE). 
source code





make_log_odds_matrix(acc_rep_mat,
exp_freq_table=None,
logbase=2,
factor=1.0,
round_digit=9,
keep_nd=0)
Make logodds matrix. 
source code



observed_frequency_to_substitution_matrix(obs_freq_mat)
Convert observed frequency table into substitution matrix. 
source code



read_text_matrix(data_file)
Read a matrix from a text file. 
source code



two_mat_relative_entropy(mat_1,
mat_2,
logbase=2,
diag=3)
Return relative entropy of two matrices. 
source code



two_mat_correlation(mat_1,
mat_2)
Return linear correlation coefficient between two matrices. 
source code



two_mat_DJS(mat_1,
mat_2,
pi_1=0.5,
pi_2=0.5)
Return JensenShannon Distance between two observed frequence matrices. 
source code



NOTYPE = 0


ACCREP = 1


OBSFREQ = 2


SUBS = 3
hash(x)


EXPFREQ = 4


LO = 5


EPSILON = 1e14


diagNO = 1


diagONLY = 2


diagALL = 3
hash(x)


__package__ = ' Bio.SubsMat '

Build observed frequency matrix (PRIVATE).
Build the observed frequency matrix. from an accepted replacements matrix.
The acc_rep_mat matrix should be generated by the user.

Build an expected frequency matrix (PRIVATE).
exp_freq_table: should be a FreqTable instance

_build_log_odds_mat(subs_mat,
logbase=2,
factor=10.0,
round_digit=0,
keep_nd=0)
 source code

Build a logodds matrix (PRIVATE).
 logbase=2: base of logarithm used to build (default 2)
 factor=10.: a factor by which each matrix entry is multiplied
 round_digit: roundoff place after decimal point
 keep_nd: if true, keeps the 999 value for nondetermined values (for which
there are no substitutions in the frequency substitutions matrix). If false,
plants the minimum logodds value of the matrix in entries containing 999.
