Bio.motifs.matrix module

Support for various forms of sequence motif matrices.

Implementation of frequency (count) matrices, position-weight matrices, and position-specific scoring matrices.

class Bio.motifs.matrix.GenericPositionMatrix(alphabet, values)

Bases: dict

Base class for the support of position matrix operations.

__init__(alphabet, values): Initialize the class.

__str__(): Return a string containing nucleotides and counts of the alphabet in the Matrix.

__getitem__(key): Return the position matrix of index key.

property consensus: Return the consensus sequence.

property anticonsensus: Return the anticonsensus sequence.

property degenerate_consensus: Return the degenerate consensus sequence.

calculate_consensus(substitution_matrix=None, plurality=None, identity=0, setcase=None)

Return the consensus sequence (as a string) for the given parameters.

This function largely follows the conventions of the EMBOSS cons tool.

Arguments:

substitution_matrix - the scoring matrix used when comparing sequences. By default, it is None, in which case we simply count the frequency of each letter. Instead of the default value, you can use the substitution matrices available in Bio.Align.substitution_matrices. Common choices are BLOSUM62 (also known as EBLOSUM62) for protein, and NUC.4.4 (also known as EDNAFULL) for nucleotides. NOTE: This has not yet been implemented.
plurality - threshold value for the number of positive matches, divided by the total count in a column, required to reach consensus. If substitution_matrix is None, then this argument must be None, and is ignored; a ValueError is raised otherwise. If substitution_matrix is not None, then the default value of the plurality is 0.5.
identity - number of identities, divided by the total count in a column, required to define a consensus value. If the number of identities is less than identity * total count in a column, then the undefined character (‘N’ for nucleotides and ‘X’ for amino acid sequences) is used in the consensus sequence. If identity is 1.0, then only columns of identical letters contribute to the consensus. Default value is zero.
setcase - threshold for the positive matches, divided by the total count in a column, above which the consensus is in upper-case and below which the consensus is in lower-case. By default, this is equal to 0.5.

property gc_content: Compute the fraction GC content.

reverse_complement(): Compute reverse complement.

class Bio.motifs.matrix.FrequencyPositionMatrix(alphabet, values)

Bases: GenericPositionMatrix

Class for the support of frequency calculations on the Position Matrix.

normalize(pseudocounts=None)

Create and return a position-weight matrix by normalizing the counts matrix.

If pseudocounts is None (default), no pseudocounts are added to the counts.

If pseudocounts is a number, it is added to the counts before calculating the position-weight matrix.

Alternatively, the pseudocounts can be a dictionary with a key for each letter in the alphabet associated with the motif.

__annotations__ = {}

class Bio.motifs.matrix.PositionWeightMatrix(alphabet, counts)

Bases: GenericPositionMatrix

Class for the support of weight calculations on the Position Matrix.

__init__(alphabet, counts): Initialize the class.

log_odds(background=None)

Return the Position-Specific Scoring Matrix.

The Position-Specific Scoring Matrix (PSSM) contains the log-odds scores computed from the probability matrix and the background probabilities. If the background is None, a uniform background distribution is assumed.

__annotations__ = {}

class Bio.motifs.matrix.PositionSpecificScoringMatrix(alphabet, values)

Bases: GenericPositionMatrix

Class for the support of Position Specific Scoring Matrix calculations.

calculate(sequence)

Return the PWM score for a given sequence for all positions.

Notes:

the sequence can only be a DNA sequence
the search is performed only on one strand
if the sequence and the motif have the same length, a single number is returned
otherwise, the result is a one-dimensional numpy array

search(sequence, threshold=0.0, both=True, chunksize=10**6)

Find hits with PWM score above given threshold.

A generator function, returning found hits in the given sequence with the pwm score higher than the threshold.

property max

Maximal possible score for this motif.

returns the score computed for the consensus sequence.

property min

Minimal possible score for this motif.

returns the score computed for the anticonsensus sequence.

property gc_content: Compute the GC-ratio.

mean(background=None): Return expected value of the score of a motif.

std(background=None): Return standard deviation of the score of a motif.

dist_pearson(other)

Return the similarity score based on pearson correlation for the given motif against self.

We use the Pearson’s correlation of the respective probabilities.

dist_pearson_at(other, offset): Return the similarity score based on pearson correlation at the given offset.

__annotations__ = {}

distribution(background=None, precision=10**3): Calculate the distribution of the scores at the given precision.