Pairwise sequence alignment
Pairwise sequence alignment is the process of aligning two sequences to
each other by optimizing the similarity score between them. The
Bio.Align
module contains the PairwiseAligner
class for global
and local alignments using the Needleman-Wunsch, Smith-Waterman, Gotoh
(three-state), and Waterman-Smith-Beyer global and local pairwise
alignment algorithms, with numerous options to change the alignment
parameters. We refer to Durbin et al. [Durbin1998]
for in-depth information on sequence alignment algorithms.
Basic usage
To generate pairwise alignments, first create a PairwiseAligner
object:
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
The PairwiseAligner
object aligner
(see
Section The pairwise aligner object) stores the alignment parameters
to be used for the pairwise alignments. These attributes can be set in
the constructor of the object:
>>> aligner = Align.PairwiseAligner(match_score=1.0)
or after the object is made:
>>> aligner.match_score = 1.0
Use the aligner.score
method to calculate the alignment score
between two sequences:
>>> target = "GAACT"
>>> query = "GAT"
>>> score = aligner.score(target, query)
>>> score
3.0
The aligner.align
method returns Alignment
objects, each
representing one alignment between the two sequences:
>>> alignments = aligner.align(target, query)
>>> alignment = alignments[0]
>>> alignment
<Alignment object (2 rows x 5 columns) at ...>
Iterate over the Alignment
objects and print them to see the
alignments:
>>> for alignment in alignments:
... print(alignment)
...
target 0 GAACT 5
0 ||--| 5
query 0 GA--T 3
target 0 GAACT 5
0 |-|-| 5
query 0 G-A-T 3
Each alignment stores the alignment score:
>>> alignment.score
3.0
as well as pointers to the sequences that were aligned:
>>> alignment.target
'GAACT'
>>> alignment.query
'GAT'
Internally, the alignment is stored in terms of the sequence coordinates:
>>> alignment = alignments[0]
>>> alignment.coordinates
array([[0, 2, 4, 5],
[0, 2, 2, 3]])
Here, the two rows refer to the target and query sequence. These coordinates show that the alignment consists of the following three blocks:
target[0:2]
aligned toquery[0:2]
;target[2:4]
aligned to a gap, sincequery[2:2]
is an empty string;target[4:5]
aligned toquery[2:3]
.
The number of aligned sequences is always 2 for a pairwise alignment:
>>> len(alignment)
2
The alignment length is defined as the number of columns in the alignment as printed. This is equal to the sum of the number of matches, number of mismatches, and the total length of gaps in the target and query:
>>> alignment.length
5
The aligned
property, which returns the start and end indices of
aligned subsequences, returns two tuples of length 2 for the first
alignment:
>>> alignment.aligned
array([[[0, 2],
[4, 5]],
[[0, 2],
[2, 3]]])
while for the alternative alignment, two tuples of length 3 are returned:
>>> alignment = alignments[1]
>>> print(alignment)
target 0 GAACT 5
0 |-|-| 5
query 0 G-A-T 3
>>> alignment.aligned
array([[[0, 1],
[2, 3],
[4, 5]],
[[0, 1],
[1, 2],
[2, 3]]])
Note that different alignments may have the same subsequences aligned to each other. In particular, this may occur if alignments differ from each other in terms of their gap placement only:
>>> aligner.mode = "global"
>>> aligner.mismatch_score = -10
>>> alignments = aligner.align("AAACAAA", "AAAGAAA")
>>> len(alignments)
2
>>> print(alignments[0])
target 0 AAAC-AAA 7
0 |||--||| 8
query 0 AAA-GAAA 7
>>> alignments[0].aligned
array([[[0, 3],
[4, 7]],
[[0, 3],
[4, 7]]])
>>> print(alignments[1])
target 0 AAA-CAAA 7
0 |||--||| 8
query 0 AAAG-AAA 7
>>> alignments[1].aligned
array([[[0, 3],
[4, 7]],
[[0, 3],
[4, 7]]])
The map
method can be applied on a pairwise alignment alignment1
to find the pairwise alignment of the query of alignment2
to the
target of alignment1
, where the target of alignment2
and the
query of alignment1
are identical. A typical example is where
alignment1
is the pairwise alignment between a chromosome and a
transcript, alignment2
is the pairwise alignment between the
transcript and a sequence (e.g., an RNA-seq read), and we want to find
the alignment of the sequence to the chromosome:
>>> aligner.mode = "local"
>>> aligner.open_gap_score = -1
>>> aligner.extend_gap_score = 0
>>> chromosome = "AAAAAAAACCCCCCCAAAAAAAAAAAGGGGGGAAAAAAAA"
>>> transcript = "CCCCCCCGGGGGG"
>>> alignments1 = aligner.align(chromosome, transcript)
>>> len(alignments1)
1
>>> alignment1 = alignments1[0]
>>> print(alignment1)
target 8 CCCCCCCAAAAAAAAAAAGGGGGG 32
0 |||||||-----------|||||| 24
query 0 CCCCCCC-----------GGGGGG 13
>>> sequence = "CCCCGGGG"
>>> alignments2 = aligner.align(transcript, sequence)
>>> len(alignments2)
1
>>> alignment2 = alignments2[0]
>>> print(alignment2)
target 3 CCCCGGGG 11
0 |||||||| 8
query 0 CCCCGGGG 8
>>> mapped_alignment = alignment1.map(alignment2)
>>> print(mapped_alignment)
target 11 CCCCAAAAAAAAAAAGGGG 30
0 ||||-----------|||| 19
query 0 CCCC-----------GGGG 8
>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'
Mapping the alignment does not depend on the sequence contents. If we delete the sequence contents, the same alignment is found in PSL format (though we obviously lose the ability to print the sequence alignment):
>>> from Bio.Seq import Seq
>>> alignment1.target = Seq(None, len(alignment1.target))
>>> alignment1.query = Seq(None, len(alignment1.query))
>>> alignment2.target = Seq(None, len(alignment2.target))
>>> alignment2.query = Seq(None, len(alignment2.query))
>>> mapped_alignment = alignment1.map(alignment2)
>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'
By default, a global pairwise alignment is performed, which finds the
optimal alignment over the whole length of target
and query
.
Instead, a local alignment will find the subsequence of target
and
query
with the highest alignment score. Local alignments can be
generated by setting aligner.mode
to "local"
:
>>> aligner.mode = "local"
>>> target = "AGAACTC"
>>> query = "GAACT"
>>> score = aligner.score(target, query)
>>> score
5.0
>>> alignments = aligner.align(target, query)
>>> for alignment in alignments:
... print(alignment)
...
target 1 GAACT 6
0 ||||| 5
query 0 GAACT 5
Note that there is some ambiguity in the definition of the best local alignments if segments with a score 0 can be added to the alignment. We follow the suggestion by Waterman & Eggert [Waterman1987] and disallow such extensions.
The pairwise aligner object
The PairwiseAligner
object stores all alignment parameters to be
used for the pairwise alignments. To see an overview of the values for
all parameters, use
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(match_score=1.0, mode="local")
>>> print(aligner)
Pairwise sequence aligner with parameters
wildcard: None
match_score: 1.000000
mismatch_score: 0.000000
target_internal_open_gap_score: 0.000000
target_internal_extend_gap_score: 0.000000
target_left_open_gap_score: 0.000000
target_left_extend_gap_score: 0.000000
target_right_open_gap_score: 0.000000
target_right_extend_gap_score: 0.000000
query_internal_open_gap_score: 0.000000
query_internal_extend_gap_score: 0.000000
query_left_open_gap_score: 0.000000
query_left_extend_gap_score: 0.000000
query_right_open_gap_score: 0.000000
query_right_extend_gap_score: 0.000000
mode: local
See Sections Substitution scores,
Affine gap scores, and
General gap scores below for the definition of
these parameters. The attribute mode
(described above in
Section Basic usage) can be set equal to "global"
or "local"
to specify global or local pairwise alignment,
respectively.
Depending on the gap scoring parameters (see
Sections Affine gap scores and
General gap scores) and mode, a
PairwiseAligner
object automatically chooses the appropriate
algorithm to use for pairwise sequence alignment. To verify the selected
algorithm, use
>>> aligner.algorithm
'Smith-Waterman'
This attribute is read-only.
A PairwiseAligner
object also stores the precision \(\epsilon\)
to be used during alignment. The value of \(\epsilon\) is stored in
the attribute aligner.epsilon
, and by default is equal to
\(10^{-6}\):
>>> aligner.epsilon
1e-06
Two scores will be considered equal to each other for the purpose of the alignment if the absolute difference between them is less than \(\epsilon\).
Substitution scores
Substitution scores define the value to be added to the total score when
two letters (nucleotides or amino acids) are aligned to each other. The
substitution scores to be used by the PairwiseAligner
can be
specified in two ways:
By specifying a match score for identical letters, and a mismatch scores for mismatched letters. Nucleotide sequence alignments are typically based on match and mismatch scores. For example, by default BLAST [Altschul1990] uses a match score of \(+1\) and a mismatch score of \(-2\) for nucleotide alignments by
megablast
, with a gap penalty of 2.5 (see section Affine gap scores for more information on gap scores). Match and mismatch scores can be specified by setting thematch
andmismatch
attributes of thePairwiseAligner
object:>>> from Bio import Align >>> aligner = Align.PairwiseAligner() >>> aligner.match_score 1.0 >>> aligner.mismatch_score 0.0 >>> score = aligner.score("ACGT", "ACAT") >>> print(score) 3.0 >>> aligner.match_score = 1.0 >>> aligner.mismatch_score = -2.0 >>> aligner.gap_score = -2.5 >>> score = aligner.score("ACGT", "ACAT") >>> print(score) 1.0
When using match and mismatch scores, you can specify a wildcard character (
None
by default) for unknown letters. These will get a zero score in alignments, irrespective of the value of the match or mismatch score:>>> aligner.wildcard = "?" >>> score = aligner.score("ACGT", "AC?T") >>> print(score) 3.0
Alternatively, you can use the
substitution_matrix
attribute of thePairwiseAligner
object to specify a substitution matrix. This allows you to apply different scores for different pairs of matched and mismatched letters. This is typically used for amino acid sequence alignments. For example, by default BLAST [Altschul1990] uses the BLOSUM62 substitution matrix for protein alignments byblastp
. This substitution matrix is available from Biopython:>>> from Bio.Align import substitution_matrices >>> substitution_matrices.load() ['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', 'BLOSUM62', ..., 'TRANS'] >>> matrix = substitution_matrices.load("BLOSUM62") >>> print(matrix) # Matrix made by matblas from blosum62.iij ... A R N D C Q ... A 4.0 -1.0 -2.0 -2.0 0.0 -1.0 ... R -1.0 5.0 0.0 -2.0 -3.0 1.0 ... N -2.0 0.0 6.0 1.0 -3.0 0.0 ... D -2.0 -2.0 1.0 6.0 -3.0 0.0 ... C 0.0 -3.0 -3.0 -3.0 9.0 -3.0 ... Q -1.0 1.0 0.0 0.0 -3.0 5.0 ... ... >>> aligner.substitution_matrix = matrix >>> score = aligner.score("ACDQ", "ACDQ") >>> score 24.0 >>> score = aligner.score("ACDQ", "ACNQ") >>> score 19.0
When using a substitution matrix,
X
is not interpreted as an unknown character. Instead, the score provided by the substitution matrix will be used:>>> matrix["D", "X"] -1.0 >>> score = aligner.score("ACDQ", "ACXQ") >>> score 17.0
By default, aligner.substitution_matrix
is None
. The attributes
aligner.match_score
and aligner.mismatch_score
are ignored if
aligner.substitution_matrix
is not None
. Setting
aligner.match_score
or aligner.mismatch_score
to valid values
will reset aligner.substitution_matrix
to None
.
Affine gap scores
Affine gap scores are defined by a score to open a gap, and a score to extend an existing gap:
\(\textrm{gap score} = \textrm{open gap score} + (n-1) \times \textrm{extend gap score}\),
where \(n\) is the length of the gap. Biopython’s pairwise sequence
aligner allows fine-grained control over the gap scoring scheme by
specifying the following twelve attributes of a PairwiseAligner
object:
Opening scores |
Extending scores |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
These attributes allow for different gap scores for internal gaps and on either end of the sequence, as shown in this example:
target |
query |
score |
---|---|---|
A |
query left open gap score |
|
C |
query left extend gap score |
|
C |
query left extend gap score |
|
G |
G |
match score |
G |
T |
mismatch score |
G |
query internal open gap score |
|
A |
query internal extend gap score |
|
A |
query internal extend gap score |
|
T |
T |
match score |
A |
A |
match score |
G |
query internal open gap score |
|
C |
C |
match score |
C |
target internal open gap score |
|
C |
target internal extend gap score |
|
C |
C |
match score |
T |
G |
mismatch score |
C |
C |
match score |
C |
target internal open gap score |
|
A |
A |
match score |
T |
target right open gap score |
|
A |
target right extend gap score |
|
A |
target right extend gap score |
For convenience, PairwiseAligner
objects have additional attributes
that refer to a number of these values collectively, as shown
(hierarchically) in Table Meta-attributes of the pairwise aligner objects..
Meta-attribute |
Attributes it maps to |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
General gap scores
For even more fine-grained control over the gap scores, you can specify a gap scoring function. For example, the gap scoring function below disallows a gap after two nucleotides in the query sequence:
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> def my_gap_score_function(start, length):
... if start == 2:
... return -1000
... else:
... return -1 * length
...
>>> aligner.query_gap_score = my_gap_score_function
>>> alignments = aligner.align("AACTT", "AATT")
>>> for alignment in alignments:
... print(alignment)
...
target 0 AACTT 5
0 -|.|| 5
query 0 -AATT 4
target 0 AACTT 5
0 |-.|| 5
query 0 A-ATT 4
target 0 AACTT 5
0 ||.-| 5
query 0 AAT-T 4
target 0 AACTT 5
0 ||.|- 5
query 0 AATT- 4
Using a pre-defined substitution matrix and gap scores
By default, a PairwiseAligner
object is initialized with a match
score of +1.0, a mismatch score of 0.0, and all gap scores equal to 0.0,
While this has the benefit of being a simple scoring scheme, in general
it does not give the best performance. Instead, you can use the argument
scoring
to select a predefined scoring scheme when initializing a
PairwiseAligner
object. Currently, the provided scoring schemes are
blastn
and megablast
, which are suitable for nucleotide
alignments, and blastp
, which is suitable for protein alignments.
Selecting these scoring schemes will initialize the PairwiseAligner
object to the default scoring parameters used by BLASTN, MegaBLAST, and
BLASTP, respectively.
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(scoring="blastn")
>>> print(aligner)
Pairwise sequence aligner with parameters
substitution_matrix: <Array object at ...>
target_internal_open_gap_score: -7.000000
target_internal_extend_gap_score: -2.000000
target_left_open_gap_score: -7.000000
target_left_extend_gap_score: -2.000000
target_right_open_gap_score: -7.000000
target_right_extend_gap_score: -2.000000
query_internal_open_gap_score: -7.000000
query_internal_extend_gap_score: -2.000000
query_left_open_gap_score: -7.000000
query_left_extend_gap_score: -2.000000
query_right_open_gap_score: -7.000000
query_right_extend_gap_score: -2.000000
mode: global
>>> print(aligner.substitution_matrix[:, :])
A T G C S W R Y K M B V H D N
A 2.0 -3.0 -3.0 -3.0 -3.0 -1.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -2.0
T -3.0 2.0 -3.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -2.0
G -3.0 -3.0 2.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0
C -3.0 -3.0 -3.0 2.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -3.0 -2.0
S -3.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
W -1.0 -1.0 -3.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
R -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
Y -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
K -3.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -2.0
M -1.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
B -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
V -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
H -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
D -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
N -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0
Iterating over alignments
The alignments
returned by aligner.align
are a kind of immutable
iterable objects (similar to range
). While they appear similar to a
tuple
or list
of Alignment
objects, they are different in
the sense that each Alignment
object is created dynamically when it
is needed. This approach was chosen because the number of alignments can
be extremely large, in particular for poor alignments (see
Section Examples for an example).
You can perform the following operations on alignments
:
len(alignments)
returns the number of alignments stored. This function returns quickly, even if the number of alignments is huge. If the number of alignments is extremely large (typically, larger than 9,223,372,036,854,775,807, which is the largest integer that can be stored as along int
on 64 bit machines),len(alignments)
will raise anOverflowError
. A large number of alignments suggests that the alignment quality is low.>>> from Bio import Align >>> aligner = Align.PairwiseAligner() >>> alignments = aligner.align("AAA", "AA") >>> len(alignments) 3
You can extract a specific alignment by index:
>>> from Bio import Align >>> aligner = Align.PairwiseAligner() >>> alignments = aligner.align("AAA", "AA") >>> print(alignments[2]) target 0 AAA 3 0 -|| 3 query 0 -AA 2 >>> print(alignments[0]) target 0 AAA 3 0 ||- 3 query 0 AA- 2
You can iterate over alignments, for example as in
>>> for alignment in alignments: ... print(alignment) ...
The
alignments
iterator can be converted into alist
ortuple
:>>> alignments = list(alignments)
It is wise to check the number of alignments by calling
len(alignments)
before attempting to calllist(alignments)
to save all alignments as a list.The alignment score (which has the same value for each alignment in
alignments
) is stored as an attribute. This allows you to check the alignment score before proceeding to extract individual alignments:>>> print(alignments.score) 2.0
Aligning to the reverse strand
By default, the pairwise aligner aligns the forward strand of the query
to the forward strand of the target. To calculate the alignment score
for query
to the reverse strand of target
, use strand="-"
:
>>> from Bio import Align
>>> from Bio.Seq import reverse_complement
>>> target = "AAAACCC"
>>> query = "AACC"
>>> aligner = Align.PairwiseAligner()
>>> aligner.mismatch_score = -1
>>> aligner.internal_gap_score = -1
>>> aligner.score(target, query) # strand is "+" by default
4.0
>>> aligner.score(target, reverse_complement(query), strand="-")
4.0
>>> aligner.score(target, query, strand="-")
0.0
>>> aligner.score(target, reverse_complement(query))
0.0
The alignments against the reverse strand can be obtained by specifying
strand="-"
when calling aligner.align
:
>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target 0 AAAACCC 7
0 --||||- 7
query 0 --AACC- 4
>>> print(alignments[0].format("bed"))
target 2 6 query 4 + 2 6 0 1 4, 0,
>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target 0 AAAACCC 7
0 --||||- 7
query 4 --AACC- 0
>>> print(alignments[0].format("bed"))
target 2 6 query 4 - 2 6 0 1 4, 0,
>>> alignments = aligner.align(target, query, strand="-")
>>> len(alignments)
2
>>> print(alignments[0])
target 0 AAAACCC---- 7
0 ----------- 11
query 4 -------GGTT 0
>>> print(alignments[1])
target 0 ----AAAACCC 7
0 ----------- 11
query 4 GGTT------- 0
Note that the score for aligning query
to the reverse strand of
target
may be different from the score for aligning the reverse
complement of query
to the forward strand of target
if the left
and right gap scores are different:
>>> aligner.left_gap_score = -0.5
>>> aligner.right_gap_score = -0.2
>>> aligner.score(target, query)
2.8
>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target 0 AAAACCC 7
0 --||||- 7
query 0 --AACC- 4
>>> aligner.score(target, reverse_complement(query), strand="-")
3.1
>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target 0 AAAACCC 7
0 --||||- 7
query 4 --AACC- 0
Substitution matrices
Substitution matrices [Durbin1998] provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. Biopython provides a ton of common substitution matrices, including the famous PAM and BLOSUM series of matrices, and also provides functionality for creating your own substitution matrices.
Array objects
You can think of substitutions matrices as 2D arrays in which the
indices are letters (nucleotides or amino acids) rather than integers.
The Array
class in Bio.Align.substitution_matrices
is a subclass
of numpy arrays that supports indexing both by integers and by specific
strings. An Array
instance can either be a one-dimensional array or
a square two-dimensional arrays. A one-dimensional Array
object can
for example be used to store the nucleotide frequency of a DNA sequence,
while a two-dimensional Array
object can be used to represent a
scoring matrix for sequence alignments.
To create a one-dimensional Array
, only the alphabet of allowed
letters needs to be specified:
>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT")
>>> print(counts)
A 0.0
C 0.0
G 0.0
T 0.0
The allowed letters are stored in the alphabet
property:
>>> counts.alphabet
'ACGT'
This property is read-only; modifying the underlying _alphabet
attribute may lead to unexpected results. Elements can be accessed both
by letter and by integer index:
>>> counts["C"] = -3
>>> counts[2] = 7
>>> print(counts)
A 0.0
C -3.0
G 7.0
T 0.0
>>> counts[1]
-3.0
Using a letter that is not in the alphabet, or an index that is out of
bounds, will cause a IndexError
:
>>> counts["U"]
Traceback (most recent call last):
...
IndexError: 'U'
>>> counts["X"] = 6
Traceback (most recent call last):
...
IndexError: 'X'
>>> counts[7]
Traceback (most recent call last):
...
IndexError: index 7 is out of bounds for axis 0 with size 4
A two-dimensional Array
can be created by specifying dims=2
:
>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT", dims=2)
>>> print(counts)
A C G T
A 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0
G 0.0 0.0 0.0 0.0
T 0.0 0.0 0.0 0.0
Again, both letters and integers can be used for indexing, and
specifying a letter that is not in the alphabet will cause an
IndexError
:
>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> print(counts)
A C G T
A 0.0 12.0 0.0 0.0
C 0.0 0.0 0.0 0.0
G 0.0 5.0 0.0 0.0
T 0.0 0.0 0.0 -2.0
>>> counts["X", 1]
Traceback (most recent call last):
...
IndexError: 'X'
>>> counts["A", 5]
Traceback (most recent call last):
...
IndexError: index 5 is out of bounds for axis 1 with size 4
Selecting a row or column from the two-dimensional array will return a
one-dimensional Array
:
>>> counts = Array("ACGT", dims=2)
>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> counts["G"]
Array([0., 5., 0., 0.],
alphabet='ACGT')
>>> counts[:, "C"]
Array([12., 0., 5., 0.],
alphabet='ACGT')
Array
objects can thus be used as an array and as a dictionary. They
can be converted to plain numpy arrays or plain dictionary objects:
>>> import numpy as np
>>> x = Array("ACGT")
>>> x["C"] = 5
>>> x
Array([0., 5., 0., 0.],
alphabet='ACGT')
>>> a = np.array(x) # create a plain numpy array
>>> a
array([0., 5., 0., 0.])
>>> d = dict(x) # create a plain dictionary
>>> d
{'A': 0.0, 'C': 5.0, 'G': 0.0, 'T': 0.0}
While the alphabet of an Array
is usually a string, you may also use
a tuple of (immutable) objects. This is used for example for a codon
substitution matrix (as in the
substitution_matrices.load("SCHNEIDER")
example shown later), where
the keys are not individual nucleotides or amino acids but instead
three-nucleotide codons.
While the alphabet
property of an Array
is immutable, you can
create a new Array
object by selecting the letters you are
interested in from the alphabet. For example,
>>> a = Array("ABCD", dims=2, data=np.arange(16).reshape(4, 4))
>>> print(a)
A B C D
A 0.0 1.0 2.0 3.0
B 4.0 5.0 6.0 7.0
C 8.0 9.0 10.0 11.0
D 12.0 13.0 14.0 15.0
>>> b = a.select("CAD")
>>> print(b)
C A D
C 10.0 8.0 11.0
A 2.0 0.0 3.0
D 14.0 12.0 15.0
Note that this also allows you to reorder the alphabet.
Data for letters that are not found in the alphabet are set to zero:
>>> c = a.select("DEC")
>>> print(c)
D E C
D 15.0 0.0 14.0
E 0.0 0.0 0.0
C 11.0 0.0 10.0
As the Array
class is a subclass of numpy array, it can be used as
such. A ValueError
is triggered if the Array
objects appearing
in a mathematical operation have different alphabets, for example
>>> from Bio.Align.substitution_matrices import Array
>>> d = Array("ACGT")
>>> r = Array("ACGU")
>>> d + r
Traceback (most recent call last):
...
ValueError: alphabets are inconsistent
Calculating a substitution matrix from a pairwise sequence alignment
As Array
is a subclass of a numpy array, you can apply mathematical
operations on an Array
object in much the same way. Here, we
illustrate this by calculating a scoring matrix from the alignment of
the 16S ribosomal RNA gene sequences of Escherichia coli and Bacillus
subtilis. First, we create a PairwiseAligner
object (see
Chapter Pairwise sequence alignment) and initialize it with the default
scores used by blastn
:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner(scoring="blastn")
>>> aligner.mode = "local"
Next, we read in the 16S ribosomal RNA gene sequence of Escherichia
coli and Bacillus subtilis (provided in Tests/Align/ecoli.fa
and
Tests/Align/bsubtilis.fa
), and align them to each other:
>>> from Bio import SeqIO
>>> sequence1 = SeqIO.read("ecoli.fa", "fasta")
>>> sequence2 = SeqIO.read("bsubtilis.fa", "fasta")
>>> alignments = aligner.align(sequence1, sequence2)
The number of alignments generated is very large:
>>> len(alignments)
1990656
However, as they only differ trivially from each other, we arbitrarily choose the first alignment, and count the number of each substitution:
>>> alignment = alignments[0]
>>> substitutions = alignment.substitutions
>>> print(substitutions)
A C G T
A 307.0 19.0 34.0 19.0
C 15.0 280.0 25.0 29.0
G 34.0 24.0 401.0 20.0
T 24.0 36.0 20.0 228.0
We normalize against the total number to find the probability of each substitution, and create a symmetric matrix of observed frequencies:
>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
A C G T
A 0.2026 0.0112 0.0224 0.0142
C 0.0112 0.1848 0.0162 0.0215
G 0.0224 0.0162 0.2647 0.0132
T 0.0142 0.0215 0.0132 0.1505
The background probability is the probability of finding an A, C, G, or T nucleotide in each sequence separately. This can be calculated as the sum of each row or column:
>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
A 0.2505
C 0.2337
G 0.3165
T 0.1993
The number of substitutions expected at random is simply the product of the background distribution with itself:
>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
A C G T
A 0.0627 0.0585 0.0793 0.0499
C 0.0585 0.0546 0.0740 0.0466
G 0.0793 0.0740 0.1002 0.0631
T 0.0499 0.0466 0.0631 0.0397
The scoring matrix can then be calculated as the logarithm of the odds-ratio of the observed and the expected probabilities:
>>> oddsratios = observed_frequencies / expected_frequencies
>>> import numpy as np
>>> scoring_matrix = np.log2(oddsratios)
>>> print(scoring_matrix)
A C G T
A 1.7 -2.4 -1.8 -1.8
C -2.4 1.8 -2.2 -1.1
G -1.8 -2.2 1.4 -2.3
T -1.8 -1.1 -2.3 1.9
The matrix can be used to set the substitution matrix for the pairwise aligner (see Chapter Pairwise sequence alignment):
>>> aligner.substitution_matrix = scoring_matrix
Calculating a substitution matrix from a multiple sequence alignment
In this example, we’ll first read a protein sequence alignment from the Clustalw file protein.aln (also available online here)
>>> from Bio import Align
>>> filename = "protein.aln"
>>> alignment = Align.read(filename, "clustal")
Section ClustalW contains more information on doing this.
The substitutions
property of the alignment stores the number of
times different residues substitute for each other:
>>> substitutions = alignment.substitutions
To make the example more readable, we’ll select only amino acids with polar charged side chains:
>>> substitutions = substitutions.select("DEHKR")
>>> print(substitutions)
D E H K R
D 2360.0 270.0 15.0 1.0 48.0
E 241.0 3305.0 15.0 45.0 2.0
H 0.0 18.0 1235.0 8.0 0.0
K 0.0 9.0 24.0 3218.0 130.0
R 2.0 2.0 17.0 103.0 2079.0
Rows and columns for other amino acids were removed from the matrix.
Next, we normalize the matrix and make it symmetric.
>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
D E H K R
D 0.1795 0.0194 0.0006 0.0000 0.0019
E 0.0194 0.2514 0.0013 0.0021 0.0002
H 0.0006 0.0013 0.0939 0.0012 0.0006
K 0.0000 0.0021 0.0012 0.2448 0.0089
R 0.0019 0.0002 0.0006 0.0089 0.1581
Summing over rows or columns gives the relative frequency of occurrence of each residue:
>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
D 0.2015
E 0.2743
H 0.0976
K 0.2569
R 0.1697
>>> sum(background) == 1.0
True
The expected frequency of residue pairs is then
>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
D E H K R
D 0.0406 0.0553 0.0197 0.0518 0.0342
E 0.0553 0.0752 0.0268 0.0705 0.0465
H 0.0197 0.0268 0.0095 0.0251 0.0166
K 0.0518 0.0705 0.0251 0.0660 0.0436
R 0.0342 0.0465 0.0166 0.0436 0.0288
Here, background[:, None]
creates a 2D array consisting of a single
column with the values of expected_frequencies
, and
rxpected_frequencies[None, :]
a 2D array with these values as a
single row. Taking their dot product (inner product) creates a matrix of
expected frequencies where each entry consists of two
expected_frequencies
values multiplied with each other. For example,
expected_frequencies['D', 'E']
is equal to
residue_frequencies['D'] * residue_frequencies['E']
.
We can now calculate the log-odds matrix by dividing the observed frequencies by the expected frequencies and taking the logarithm:
>>> import numpy as np
>>> scoring_matrix = np.log2(observed_frequencies / expected_frequencies)
>>> print(scoring_matrix)
D E H K R
D 2.1 -1.5 -5.1 -10.4 -4.2
E -1.5 1.7 -4.4 -5.1 -8.3
H -5.1 -4.4 3.3 -4.4 -4.7
K -10.4 -5.1 -4.4 1.9 -2.3
R -4.2 -8.3 -4.7 -2.3 2.5
This matrix can be used as the substitution matrix when performing alignments. For example,
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = scoring_matrix
>>> aligner.gap_score = -3.0
>>> alignments = aligner.align("DEHEK", "DHHKK")
>>> print(alignments[0])
target 0 DEHEK 5
0 |.|.| 5
query 0 DHHKK 5
>>> print("%.2f" % alignments.score)
-2.18
>>> score = (
... scoring_matrix["D", "D"]
... + scoring_matrix["E", "H"]
... + scoring_matrix["H", "H"]
... + scoring_matrix["E", "K"]
... + scoring_matrix["K", "K"]
... )
>>> print("%.2f" % score)
-2.18
(see Chapter Pairwise sequence alignment for details).
Reading Array
objects from file
Bio.Align.substitution_matrices
includes a parser to read one- and
two-dimensional Array
objects from file. One-dimensional arrays are
represented by a simple two-column format, with the first column
containing the key and the second column the corresponding value. For
example, the file hg38.chrom.sizes
(obtained from UCSC), available
in the Tests/Align
subdirectory of the Biopython distribution,
contains the size in nucleotides of each chromosome in human genome
assembly hg38:
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
...
chrUn_KI270385v1 990
chrUn_KI270423v1 981
chrUn_KI270392v1 971
chrUn_KI270394v1 970
To parse this file, use
>>> from Bio.Align import substitution_matrices
>>> with open("hg38.chrom.sizes") as handle:
... table = substitution_matrices.read(handle)
...
>>> print(table)
chr1 248956422.0
chr2 242193529.0
chr3 198295559.0
chr4 190214555.0
...
chrUn_KI270423v1 981.0
chrUn_KI270392v1 971.0
chrUn_KI270394v1 970.0
Use dtype=int
to read the values as integers:
>>> with open("hg38.chrom.sizes") as handle:
... table = substitution_matrices.read(handle, int)
...
>>> print(table)
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
...
chrUn_KI270423v1 981
chrUn_KI270392v1 971
chrUn_KI270394v1 970
For two-dimensional arrays, we follow the file format of substitution
matrices provided by NCBI. For example, the BLOSUM62 matrix, which is
the default substitution matrix for NCBI’s protein-protein BLAST
[Altschul1990] program blastp
, is stored as
follows:
# Matrix made by matblas from blosum62.iij
# * column uses minimum score
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
...
This file is included in the Biopython distribution under
Bio/Align/substitution_matrices/data
. To parse this file, use
>>> from Bio.Align import substitution_matrices
>>> with open("BLOSUM62") as handle:
... matrix = substitution_matrices.read(handle)
...
>>> print(matrix.alphabet)
ARNDCQEGHILKMFPSTWYVBZX*
>>> print(matrix["A", "D"])
-2.0
The header lines starting with #
are stored in the attribute
header
:
>>> matrix.header[0]
'Matrix made by matblas from blosum62.iij'
We can now use this matrix as the substitution matrix on an aligner object:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = matrix
To save an Array object, create a string first:
>>> text = str(matrix)
>>> print(text)
# Matrix made by matblas from blosum62.iij
# * column uses minimum score
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S ...
A 4.0 -1.0 -2.0 -2.0 0.0 -1.0 -1.0 0.0 -2.0 -1.0 -1.0 -1.0 -1.0 -2.0 -1.0 1.0 ...
R -1.0 5.0 0.0 -2.0 -3.0 1.0 0.0 -2.0 0.0 -3.0 -2.0 2.0 -1.0 -3.0 -2.0 -1.0 ...
N -2.0 0.0 6.0 1.0 -3.0 0.0 0.0 0.0 1.0 -3.0 -3.0 0.0 -2.0 -3.0 -2.0 1.0 ...
D -2.0 -2.0 1.0 6.0 -3.0 0.0 2.0 -1.0 -1.0 -3.0 -4.0 -1.0 -3.0 -3.0 -1.0 0.0 ...
C 0.0 -3.0 -3.0 -3.0 9.0 -3.0 -4.0 -3.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0 -3.0 -1.0 ...
...
and write the text
to a file.
Loading predefined substitution matrices
Biopython contains a large set of substitution matrices defined in the
literature, including BLOSUM (Blocks Substitution Matrix)
[Henikoff1992] and PAM (Point Accepted Mutation)
matrices [Dayhoff1978]. These matrices are available
as flat files in the Bio/Align/substitution_matrices/data
directory,
and can be loaded into Python using the load
function in the
substitution_matrices
submodule. For example, the BLOSUM62 matrix
can be loaded by running
>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("BLOSUM62")
This substitution matrix has an alphabet consisting of the 20 amino acids used in the genetic code, the three ambiguous amino acids B (asparagine or aspartic acid), Z (glutamine or glutamic acid), and X (representing any amino acid), and the stop codon represented by an asterisk:
>>> m.alphabet
'ARNDCQEGHILKMFPSTWYVBZX*'
To get a full list of available substitution matrices, use load
without an argument:
>>> substitution_matrices.load()
['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', ..., 'TRANS']
Note that the substitution matrix provided by Schneider et al. [Schneider2005] uses an alphabet consisting of three-nucleotide codons:
>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')
Examples
Suppose you want to do a global pairwise alignment between the same two
hemoglobin sequences from above (HBA_HUMAN
, HBB_HUMAN
) stored in
alpha.faa
and beta.faa
:
>>> from Bio import Align
>>> from Bio import SeqIO
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
72.0
showing an alignment score of 72.0. To see the individual alignments, do
>>> alignments = aligner.align(seq1.seq, seq2.seq)
In this example, the total number of optimal alignments is huge (more
than \(4 \times 10^{37}\)), and calling len(alignments)
will
raise an OverflowError
:
>>> len(alignments)
Traceback (most recent call last):
...
OverflowError: number of optimal alignments is larger than 9223372036854775807
Let’s have a look at the first alignment:
>>> alignment = alignments[0]
The alignment object stores the alignment score, as well as the alignment itself:
>>> print(alignment.score)
72.0
>>> print(alignment)
target 0 MV-LS-PAD--KTN--VK-AA-WGKV-----GAHAGEYGAEALE-RMFLSF----P-TTK
0 ||-|--|----|----|--|--||||-----|---||--|--|--|--|------|-|--
query 0 MVHL-TP--EEK--SAV-TA-LWGKVNVDEVG---GE--A--L-GR--L--LVVYPWT--
target 41 TY--FPHF----DLSHGS---AQVK-G------HGKKV--A--DA-LTNAVAHV-DDMPN
60 ----|--|----|||------|-|--|------|||||--|--|--|--|--|--|---|
query 39 --QRF--FESFGDLS---TPDA-V-MGNPKVKAHGKKVLGAFSD-GL--A--H-LD---N
target 79 ALS----A-LSD-LHAH--KLR-VDPV-NFK-LLSHC---LLVT--LAAHLPA----EFT
120 -|-----|-||--||----||--|||--||--||------|-|---||-|-------|||
query 81 -L-KGTFATLS-ELH--CDKL-HVDP-ENF-RLL---GNVL-V-CVLA-H---HFGKEFT
target 119 PA-VH-ASLDKFLAS---VSTV------LTS--KYR- 142
180 |--|--|------|----|--|------|----||-- 217
query 124 P-PV-QA------A-YQKV--VAGVANAL--AHKY-H 147
Better alignments are usually obtained by penalizing gaps: higher costs
for opening a gap and lower costs for extending an existing gap. For
amino acid sequences match scores are usually encoded in matrices like
PAM
or BLOSUM
. Thus, a more meaningful alignment for our example
can be obtained by using the BLOSUM62 matrix, together with a gap open
penalty of 10 and a gap extension penalty of 0.5:
>>> from Bio import Align
>>> from Bio import SeqIO
>>> from Bio.Align import substitution_matrices
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -0.5
>>> aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
292.5
>>> alignments = aligner.align(seq1.seq, seq2.seq)
>>> len(alignments)
2
>>> print(alignments[0].score)
292.5
>>> print(alignments[0])
target 0 MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGS
0 ||-|.|..|..|.|.||||--...|.|.|||.|.....|.|...|..|-|||-----.|.
query 0 MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
target 53 AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
60 ..||.|||||..|.....||.|........||.||..||.|||.||.||...|...||.|
query 58 PKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
target 113 LPAEFTPAVHASLDKFLASVSTVLTSKYR 142
120 ...||||.|.|...|..|.|...|..||. 149
query 118 FGKEFTPPVQAAYQKVVAGVANALAHKYH 147
This alignment has the same score that we obtained earlier with EMBOSS needle using the same sequences and the same parameters.
To perform a local alignment, set aligner.mode
to 'local'
:
>>> aligner.mode = "local"
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -1
>>> alignments = aligner.align("LSPADKTNVKAA", "PEEKSAV")
>>> print(len(alignments))
1
>>> alignment = alignments[0]
>>> print(alignment)
target 2 PADKTNV 9
0 |..|..| 7
query 0 PEEKSAV 7
>>> print(alignment.score)
16.0
Generalized pairwise alignments
In most cases, PairwiseAligner
is used to perform alignments of
sequences (strings or Seq
objects) consisting of single-letter
nucleotides or amino acids. More generally, PairwiseAligner
can also
be applied to lists or tuples of arbitrary objects. This section will
describe some examples of such generalized pairwise alignments.
Generalized pairwise alignments using a substitution matrix and alphabet
Schneider et al. [Schneider2005] created a substitution matrix for aligning three-nucleotide codons (see below in section Substitution matrices for more information). This substitution matrix is associated with an alphabet consisting of all three-letter codons:
>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')
We can use this matrix to align codon sequences to each other:
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1.0
>>> s1 = ("AAT", "CTG", "TTT", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
2
>>> print(alignments[0])
AAT CTG TTT TTT
||| ... ||| ---
AAT TTA TTT ---
>>> print(alignments[1])
AAT CTG TTT TTT
||| ... --- |||
AAT TTA --- TTT
Note that aligning TTT
to TTA
, as in this example:
AAT CTG TTT TTT
||| --- ... |||
AAT --- TTA TTT
would get a much lower score:
>>> print(m["CTG", "TTA"])
7.6
>>> print(m["TTT", "TTA"])
-0.3
presumably because CTG
and TTA
both code for leucine, while
TTT
codes for phenylalanine. The three-letter codon substitution
matrix also reveals a preference among codons representing the same
amino acid. For example, TTA
has a preference for CTG
preferred
compared to CTC
, though all three code for leucine:
>>> s1 = ("AAT", "CTG", "CTC", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
1
>>> print(alignments[0])
AAT CTG CTC TTT
||| ... --- |||
AAT TTA --- TTT
>>> print(m["CTC", "TTA"])
6.5
Generalized pairwise alignments using match/mismatch scores and an alphabet
Using the three-letter amino acid symbols, the sequences above translate to
>>> s1 = ("Asn", "Leu", "Leu", "Phe")
>>> s2 = ("Asn", "Leu", "Phe")
We can align these sequences directly to each other by using a three-letter amino acid alphabet:
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.alphabet = ['Ala', 'Arg', 'Asn', 'Asp', 'Cys',
... 'Gln', 'Glu', 'Gly', 'His', 'Ile',
... 'Leu', 'Lys', 'Met', 'Phe', 'Pro',
... 'Ser', 'Thr', 'Trp', 'Tyr', 'Val'] # fmt: skip
...
We use +6/-1 match and mismatch scores as an approximation of the BLOSUM62 matrix, and align these sequences to each other:
>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
Asn Leu Leu Phe
||| ||| --- |||
Asn Leu --- Phe
>>> print(alignments[1])
Asn Leu Leu Phe
||| --- ||| |||
Asn --- Leu Phe
>>> print(alignments.score)
18.0
Generalized pairwise alignments using match/mismatch scores and integer sequences
Internally, the first step when performing an alignment is to replace the two sequences by integer arrays consisting of the indices of each letter in each sequence in the alphabet associated with the aligner. This step can be bypassed by passing integer arrays directly:
>>> import numpy as np
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> s1 = np.array([2, 10, 10, 13], np.int32)
>>> s2 = np.array([2, 10, 13], np.int32)
>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| || -- ||
2 10 -- 13
>>> print(alignments[1])
2 10 10 13
| -- || ||
2 -- 10 13
>>> print(alignments.score)
18.0
Note that the indices should consist of 32-bit integers, as specified in
this example by numpy.int32
.
Unknown letters can again be included by defining a wildcard character, and using the corresponding Unicode code point number as the index:
>>> aligner.wildcard = "?"
>>> ord(aligner.wildcard)
63
>>> s2 = np.array([2, 63, 13], np.int32)
>>> aligner.gap_score = -3
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| .. -- ||
2 63 -- 13
>>> print(alignments[1])
2 10 10 13
| -- .. ||
2 -- 63 13
>>> print(alignments.score)
9.0
Generalized pairwise alignments using a substitution matrix and integer sequences
Integer sequences can also be aligned using a substitution matrix, in this case a numpy square array without an alphabet associated with it. In this case, all index values must be non-negative, and smaller than the size of the substitution matrix:
>>> from Bio import Align
>>> import numpy as np
>>> aligner = Align.PairwiseAligner()
>>> m = np.eye(5)
>>> m[0, 1:] = m[1:, 0] = -2
>>> m[2, 2] = 3
>>> print(m)
[[ 1. -2. -2. -2. -2.]
[-2. 1. 0. 0. 0.]
[-2. 0. 3. 0. 0.]
[-2. 0. 0. 1. 0.]
[-2. 0. 0. 0. 1.]]
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1
>>> s1 = np.array([0, 2, 3, 4], np.int32)
>>> s2 = np.array([0, 3, 2, 1], np.int32)
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
0 - 2 3 4
| - | . -
0 3 2 1 -
>>> print(alignments[1])
0 - 2 3 4
| - | - .
0 3 2 - 1
>>> print(alignments.score)
2.0
Codon alignments
The CodonAligner
class in the Bio.Align
module implements a
specialized aligner for aligning a nucleotide sequence to the amino acid
sequence it encodes. Such alignments are non-trivial if frameshifts
occur during translation.
Aligning a nucleotide sequence to an amino acid sequence
To align a nucleotide sequence to an amino acid sequence, first create a
CodonAligner
object:
>>> from Bio import Align
>>> aligner = Align.CodonAligner()
The CodonAligner
object aligner
stores the alignment parameters
to be used for the alignments:
>>> print(aligner)
Codon aligner with parameters
wildcard: 'X'
match_score: 1.0
mismatch_score: 0.0
frameshift_minus_two_score: -3.0
frameshift_minus_one_score: -3.0
frameshift_plus_one_score: -3.0
frameshift_plus_two_score: -3.0
The wildcard
, match_score
, and mismatch_score
parameters are
defined in the same was as for the PairwiseAligner
class described
above (see Section The pairwise aligner object). The values
specified by the frameshift_minus_two_score
,
frameshift_minus_one_score
, frameshift_plus_one_score
, and
frameshift_plus_two_score
parameters are added to the alignment
score whenever a -2, -1, +1, or +2 frame shift, respectively, occurs in
the alignment. By default, the frame shift scores are set to -3.0.
Similar to the PairwiseAligner
class
(Table Meta-attributes of the pairwise aligner objects.), the CodonAligner
class defines additional attributes that refer to a number of these
values collectively, as shown in
Table Meta-attributes of CodonAligner objects..
Meta-attribute |
Attributes it maps to |
---|---|
|
|
|
|
|
|
|
|
|
|
Now let’s consider two nucleotide sequences and the amino acid sequences they encode:
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> nuc1 = Seq("TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG")
>>> rna1 = SeqRecord(nuc1, id="rna1")
>>> nuc2 = Seq("TCAGGGACTTCGAGAACCAAGCGCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG")
>>> rna2 = SeqRecord(nuc2, id="rna2")
>>> aa1 = Seq("SGTARTKLLLLLAALCAAGGALE")
>>> aa2 = Seq("SGTSRTKRLLLLAALGAAGGALE")
>>> pro1 = SeqRecord(aa1, id="pro1")
>>> pro2 = SeqRecord(aa2, id="pro2")
While the two protein sequences both consist of 23 amino acids, the first nucleotide sequence consists of \(3 \times 23 = 69\) nucleotides while the second nucleotide sequence tonsists of only 68 nucleotides:
>>> len(pro1)
23
>>> len(pro2)
23
>>> len(rna1)
69
>>> len(rna2)
68
This is due to a -1 frame shift event during translation of the second
nucleotide sequence. Use CodonAligner.align
to align rna1
to
pro1
, and rna2
to pro2
, returning an iterator of
Alignment
objects:
>>> alignments1 = aligner.align(pro1, rna1)
>>> len(alignments1)
1
>>> alignment1 = next(alignments1)
>>> print(alignment1)
pro1 0 S G T A R T K L L L L L A A L C A A G G
rna1 0 TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGG
pro1 20 A L E 23
rna1 60 GCGCTGGAG 69
>>> alignment1.coordinates
array([[ 0, 23],
[ 0, 69]])
>>> alignment1[0]
'SGTARTKLLLLLAALCAAGGALE'
>>> alignment1[1]
'TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG'
>>> alignments2 = aligner.align(pro2, rna2)
>>> len(alignments2)
1
>>> alignment2 = next(alignments2)
>>> print(alignment2)
pro2 0 S G T S R T K R 8
rna2 0 TCAGGGACTTCGAGAACCAAGCGC 24
pro2 8 L L L L A A L G A A G G A L E 23
rna2 23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68
>>> alignment2[0]
'SGTSRTKRLLLLAALGAAGGALE'
>>> alignment2[1]
'TCAGGGACTTCGAGAACCAAGCGCCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG'
>>> alignment2.coordinates
array([[ 0, 8, 8, 23],
[ 0, 24, 23, 68]])
While alignment1
is a continuous alignment of the 69 nucleotides to
the 23 amino acids, in alignment2
we find a -1 frame shift after 24
nucleotides. As alignment2[1]
contains the nucleotide sequence after
applying the -1 frame shift, it is one nucleotide longer than nuc2
and can be translated directly, resulting in the amino acid sequence
aa2
:
>>> from Bio.Seq import translate
>>> len(nuc2)
68
>>> len(alignment2[1])
69
>>> translate(alignment2[1])
'SGTSRTKRLLLLAALGAAGGALE'
>>> _ == aa2
True
The alignment score is stored as an attribute on the alignments1
and
alignments2
iterators, and on the individual alignments
alignment1
and alignment2
:
>>> alignments1.score
23.0
>>> alignment1.score
23.0
>>> alignments2.score
20.0
>>> alignment2.score
20.0
where the score of the rna1
-pro1
alignment is equal to the
number of aligned amino acids, and the score of the rna2
-pro2
alignment is 3 less due to the penalty for the frame shift. To calculate
the alignment score without calculating the alignment itself, the
score
method can be used:
>>> score = aligner.score(pro1, rna1)
>>> print(score)
23.0
>>> score = aligner.score(pro2, rna2)
>>> print(score)
20.0
Generating a multiple sequence alignment of codon sequences
Suppose we have a third related amino acid sequence and its associated nucleotide sequence:
>>> aa3 = Seq("MGTALLLLLAALCAAGGALE")
>>> pro3 = SeqRecord(aa3, id="pro3")
>>> nuc3 = Seq("ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG")
>>> rna3 = SeqRecord(nuc3, id="rna3")
>>> nuc3.translate() == aa3
True
As above, we use the CodonAligner
to align the nucleotide sequence
to the amino acid sequence:
>>> alignments3 = aligner.align(pro3, rna3)
>>> len(alignments3)
1
>>> alignment3 = next(alignments3)
>>> print(alignment3)
pro3 0 M G T A L L L L L A A L C A A G G A L E
rna3 0 ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG
pro3 20
rna3 60
The three amino acid sequences can be aligned to each other, for example using ClustalW. Here, we create the alignment by hand:
>>> import numpy as np
>>> from Bio.Align import Alignment
>>> sequences = [pro1, pro2, pro3]
>>> protein_alignment = Alignment(
... sequences, coordinates=np.array([[0, 4, 7, 23], [0, 4, 7, 23], [0, 4, 4, 20]])
... )
>>> print(protein_alignment)
pro1 0 SGTARTKLLLLLAALCAAGGALE 23
pro2 0 SGTSRTKRLLLLAALGAAGGALE 23
pro3 0 MGTA---LLLLLAALCAAGGALE 20
Now we can use the mapall
method on the protein alignment, with the
nucleotide-to-protein pairwise alignments as the argument, to obtain the
corresponding codon alignment:
>>> codon_alignment = protein_alignment.mapall([alignment1, alignment2, alignment3])
>>> print(codon_alignment)
rna1 0 TCAGGGACTGCGAGAACCAAGCTA 24
rna2 0 TCAGGGACTTCGAGAACCAAGCGC 24
rna3 0 ATGGGAACCGCG---------CTG 15
rna1 24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
rna2 23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68
rna3 15 CTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG 60
Analyzing a codon alignment
Calculating the number of nonsynonymous and synonymous substitutions per site
The most important application of a codon alignment is to estimate the
number of nonsynonymous substitutions per site (dN) and synonymous
substitutions per site (dS). These can be calculated by the
calculate_dn_ds
function in Bio.Align.analysis
. This function
takes a pairwise codon alignment and input, as well as the optional
arguments method
specifying the calculation method, codon_table
(defaulting to the Standard Code), the ratio k
of the transition and
transversion rates, and cfreq
to specify the equilibrium codon
frequency. Biopython currently supports three counting based methods
(NG86
, LWL85
, YN00
) as well as the maximum likelihood method
(ML
) to estimate dN and dS:
NG86
: Nei and Gojobori (1986) [Nei1986] (default). With this method, you can also specify the ratio of the transition and transversion rates via the argumentk
, defaulting to1.0
.LWL85
: Li et al. (1985) [Li1985].YN00
: Yang and Nielsen (2000) [Yang2000].ML
: Goldman and Yang (1994) [Goldman1994]. With this method, you can also specify the equilibrium codon frequency via thecfreq
argument, with the following options:F1x4
: count the nucleotide frequency in the provided codon sequences, and use it to calculate the background codon frequency;F3x4
: (default) count the nucleotide frequency separately for the first, second, and third position in the provided codons, and use it to calculate the background codon frequency;F61
: count the frequency of codons from the provided codon sequences, with a pseudocount of 0.1.
The calculate_dN_dS
method can be applied to a pairwise codon
alignment. In general, the different calculation methods will result in
slightly different estimates for dN and dS:
>>> from Bio.Align import analysis
>>> pairwise_codon_alignment = codon_alignment[:2]
>>> print(pairwise_codon_alignment)
rna1 0 TCAGGGACTGCGAGAACCAAGCTA 24
0 |||||||||.||||||||||||..
rna2 0 TCAGGGACTTCGAGAACCAAGCGC 24
rna1 24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
24 ||.||||||||||||||||||.|||||||||||||.||.|||||| 69
rna2 23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="NG86")
>>> print(dN, dS)
0.067715... 0.201197...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="LWL85")
>>> print(dN, dS)
0.068728... 0.207551...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="YN00")
>>> print(dN, dS)
0.081468... 0.127706...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="ML")
>>> print(dN, dS)
0.069475... 0.205754...
For a multiple alignment of codon sequences, you can calculate a matrix of dN and dS values:
>>> dN, dS = analysis.calculate_dn_ds_matrix(codon_alignment, method="NG86")
>>> print(dN)
rna1 0.000000
rna2 0.067715 0.000000
rna3 0.060204 0.145469 0.000000
rna1 rna2 rna3
>>> print(dS)
rna1 0.000000
rna2 0.201198 0.000000
rna3 0.664268 0.798957 0.000000
rna1 rna2 rna3
The objects dN
and dS
returned by calculate_dn_ds_matrix
are
instances of the DistanceMatrix
class in
Bio.Phylo.TreeConstruction
. This function only takes codon_table
as an optional argument.
From these two sequences, you can create a dN tree and a dS tree using
Bio.Phylo.TreeConstruction
:
>>> from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
>>> dn_constructor = DistanceTreeConstructor()
>>> ds_constructor = DistanceTreeConstructor()
>>> dn_tree = dn_constructor.upgma(dN)
>>> ds_tree = ds_constructor.upgma(dS)
>>> print(type(dn_tree))
<class 'Bio.Phylo.BaseTree.Tree'>
>>> print(dn_tree)
Tree(rooted=True)
Clade(branch_length=0, name='Inner2')
Clade(branch_length=0.053296..., name='rna2')
Clade(branch_length=0.023194..., name='Inner1')
Clade(branch_length=0.0301021..., name='rna3')
Clade(branch_length=0.0301021..., name='rna1')
>>> print(ds_tree)
Tree(rooted=True)
Clade(branch_length=0, name='Inner2')
Clade(branch_length=0.365806..., name='rna3')
Clade(branch_length=0.265207..., name='Inner1')
Clade(branch_length=0.100598..., name='rna2')
Clade(branch_length=0.100598..., name='rna1')
Performing the McDonald-Kreitman test
The McDonald-Kreitman test assesses the amount of adaptive evolution by
comparing the within species synonymous substitutions and nonsynonymous
substitutions to the between species synonymous substitutions and
nonsynonymous substitutions to see if they are from the same
evolutionary process. The test requires gene sequences sampled from
different individuals of the same species. In the following example, we
will use Adh gene from fruit fly. The data includes 11 individuals from
Drosophila melanogaster, 4 individuals from Drosophila simulans, and
12 individuals from Drosophila yakuba. The protein alignment data and
the nucleotide sequences are available in the Tests/codonalign
directory as the files adh.aln
and drosophila.fasta
,
respectively, in the Biopython distribution. The function mktest
in
Bio.Align.analysis
implements the Mcdonald-Kreitman test.
>>> from Bio import SeqIO
>>> from Bio import Align
>>> from Bio.Align import CodonAligner
>>> from Bio.Align.analysis import mktest
>>> aligner = CodonAligner()
>>> nucleotide_records = SeqIO.index("drosophila.fasta", "fasta")
>>> for nucleotide_record in nucleotide_records.values():
... print(nucleotide_record.description)
...
gi|9097|emb|X57361.1| Drosophila simulans (individual c) ...
gi|9099|emb|X57362.1| Drosophila simulans (individual d) ...
gi|9101|emb|X57363.1| Drosophila simulans (individual e) ...
gi|9103|emb|X57364.1| Drosophila simulans (individual f) ...
gi|9217|emb|X57365.1| Drosophila yakuba (individual a) ...
gi|9219|emb|X57366.1| Drosophila yakuba (individual b) ...
gi|9221|emb|X57367.1| Drosophila yakuba (individual c) ...
gi|9223|emb|X57368.1| Drosophila yakuba (individual d) ...
gi|9225|emb|X57369.1| Drosophila yakuba (individual e) ...
gi|9227|emb|X57370.1| Drosophila yakuba (individual f) ...
gi|9229|emb|X57371.1| Drosophila yakuba (individual g) ...
gi|9231|emb|X57372.1| Drosophila yakuba (individual h) ...
gi|9233|emb|X57373.1| Drosophila yakuba (individual i) ...
gi|9235|emb|X57374.1| Drosophila yakuba (individual j) ...
gi|9237|emb|X57375.1| Drosophila yakuba (individual k) ...
gi|9239|emb|X57376.1| Drosophila yakuba (individual l) ...
gi|156879|gb|M17837.1|DROADHCK D.melanogaster (strain Ja-F) ...
gi|156863|gb|M19547.1|DROADHCC D.melanogaster (strain Af-S) ...
gi|156877|gb|M17836.1|DROADHCJ D.melanogaster (strain Af-F) ...
gi|156875|gb|M17835.1|DROADHCI D.melanogaster (strain Wa-F) ...
gi|156873|gb|M17834.1|DROADHCH D.melanogaster (strain Fr-F) ...
gi|156871|gb|M17833.1|DROADHCG D.melanogaster (strain Fl-F) ...
gi|156869|gb|M17832.1|DROADHCF D.melanogaster (strain Ja-S) ...
gi|156867|gb|M17831.1|DROADHCE D.melanogaster (strain Fl-2S) ...
gi|156865|gb|M17830.1|DROADHCD D.melanogaster (strain Fr-S) ...
gi|156861|gb|M17828.1|DROADHCB D.melanogaster (strain Fl-1S) ...
gi|156859|gb|M17827.1|DROADHCA D.melanogaster (strain Wa-S) ...
>>> protein_alignment = Align.read("adh.aln", "clustal")
>>> len(protein_alignment)
27
>>> print(protein_alignment)
gi|9217|e 0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9219|e 0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9221|e 0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
...
gi|156859 0 MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENPAAIAELKAINPKVTVT
...
gi|9217|e 240 GTLEAIQWSKHWDSGI 256
gi|9219|e 240 GTLEAIQWSKHWDSGI 256
gi|9221|e 240 GTLEAIQWSKHWDSGI 256
...
gi|156859 240 GTLEAIQWTKHWDSGI 256
>>> codon_alignments = []
>>> for protein_record in protein_alignment.sequences:
... nucleotide_record = nucleotide_records[protein_record.id]
... alignments = aligner.align(protein_record, nucleotide_record)
... assert len(alignments) == 1
... codon_alignment = next(alignments)
... codon_alignments.append(codon_alignment)
...
>>> print(codon_alignment)
gi|156859 0 M S F T L T N K N V I F V A G L G G I G
gi|156859 0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT
gi|156859 20 L D T S K E L L K R D L K N L V I L D R
gi|156859 60 CTGGACACCAGCAAGGAGCTGCTCAAGCGCGATCTGAAGAACCTGGTGATCCTCGACCGC
...
gi|156859 240 G T L E A I Q W T K H W D S G I 256
gi|156859 720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768
>>> nucleotide_records.close() # Close indexed FASTA file
>>> alignment = protein_alignment.mapall(codon_alignments)
>>> print(alignment)
gi|9217|e 0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9219|e 0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9221|e 0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
...
gi|156859 0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT
...
gi|9217|e 720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9219|e 720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9221|e 720 GGTACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
...
gi|156859 720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768
>>> unique_species = ["Drosophila simulans", "Drosophila yakuba", "D.melanogaster"]
>>> species = []
>>> for record in alignment.sequences:
... description = record.description
... for s in unique_species:
... if s in description:
... break
... else:
... raise Exception(f"Failed to find species for {description}")
... species.append(s)
...
>>> print(species)
['Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster']
>>> pvalue = mktest(alignment, species)
>>> print(pvalue)
0.00206457...
In addition to the multiple codon alignment, the function mktest
takes as input the species to which each sequence in the alignment
belongs to. The codon table can be provided as an optional argument
codon_table
.