Pairwise sequence alignment

Pairwise sequence alignment is the process of aligning two sequences to each other by optimizing the similarity score between them. The Bio.Align module contains the PairwiseAligner class for global and local alignments using the Needleman-Wunsch, Smith-Waterman, Gotoh (three-state), and Waterman-Smith-Beyer global and local pairwise alignment algorithms, and the Fast Optimal Global Alignment Algorithm (FOGSAA), with numerous options to change the alignment parameters. We refer to Durbin et al. [Durbin1998] for in-depth information on sequence alignment algorithms.

Basic usage

To generate pairwise alignments, first create a PairwiseAligner object:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()

The PairwiseAligner object aligner (see Section The pairwise aligner object) stores the alignment parameters to be used for the pairwise alignments. These attributes can be set in the constructor of the object:

>>> aligner = Align.PairwiseAligner(match_score=1.0)

or after the object is made:

>>> aligner.match_score = 1.0

Use the aligner.score method to calculate the alignment score between two sequences:

>>> target = "GAACT"
>>> query = "GAT"
>>> score = aligner.score(target, query)
>>> score
3.0

The aligner.align method returns Alignment objects, each representing one alignment between the two sequences:

>>> alignments = aligner.align(target, query)
>>> alignment = alignments[0]
>>> alignment  
<Alignment object (2 rows x 5 columns) at ...>

Iterate over the Alignment objects and print them to see the alignments:

>>> for alignment in alignments:
...     print(alignment)
...
target            0 GAACT 5
                  0 ||--| 5
query             0 GA--T 3

target            0 GAACT 5
                  0 |-|-| 5
query             0 G-A-T 3

Each alignment stores the alignment score:

>>> alignment.score
3.0

as well as pointers to the sequences that were aligned:

>>> alignment.target
'GAACT'
>>> alignment.query
'GAT'

Internally, the alignment is stored in terms of the sequence coordinates:

>>> alignment = alignments[0]
>>> alignment.coordinates
array([[0, 2, 4, 5],
       [0, 2, 2, 3]])

Here, the two rows refer to the target and query sequence. These coordinates show that the alignment consists of the following three blocks:

  • target[0:2] aligned to query[0:2];

  • target[2:4] aligned to a gap, since query[2:2] is an empty string;

  • target[4:5] aligned to query[2:3].

The number of aligned sequences is always 2 for a pairwise alignment:

>>> len(alignment)
2

The alignment length is defined as the number of columns in the alignment as printed. This is equal to the sum of the number of matches, number of mismatches, and the total length of gaps in the target and query:

>>> alignment.length
5

The aligned property, which returns the start and end indices of aligned subsequences, returns two tuples of length 2 for the first alignment:

>>> alignment.aligned
array([[[0, 2],
        [4, 5]],

       [[0, 2],
        [2, 3]]])

while for the alternative alignment, two tuples of length 3 are returned:

>>> alignment = alignments[1]
>>> print(alignment)
target            0 GAACT 5
                  0 |-|-| 5
query             0 G-A-T 3

>>> alignment.aligned
array([[[0, 1],
        [2, 3],
        [4, 5]],

       [[0, 1],
        [1, 2],
        [2, 3]]])

Note that different alignments may have the same subsequences aligned to each other. In particular, this may occur if alignments differ from each other in terms of their gap placement only:

>>> aligner.mode = "global"
>>> aligner.mismatch_score = -10
>>> alignments = aligner.align("AAACAAA", "AAAGAAA")
>>> len(alignments)
2
>>> print(alignments[0])
target            0 AAAC-AAA 7
                  0 |||--||| 8
query             0 AAA-GAAA 7

>>> alignments[0].aligned
array([[[0, 3],
        [4, 7]],

       [[0, 3],
        [4, 7]]])
>>> print(alignments[1])
target            0 AAA-CAAA 7
                  0 |||--||| 8
query             0 AAAG-AAA 7

>>> alignments[1].aligned
array([[[0, 3],
        [4, 7]],

       [[0, 3],
        [4, 7]]])

The map method can be applied on a pairwise alignment alignment1 to find the pairwise alignment of the query of alignment2 to the target of alignment1, where the target of alignment2 and the query of alignment1 are identical. A typical example is where alignment1 is the pairwise alignment between a chromosome and a transcript, alignment2 is the pairwise alignment between the transcript and a sequence (e.g., an RNA-seq read), and we want to find the alignment of the sequence to the chromosome:

>>> aligner.mode = "local"
>>> aligner.open_gap_score = -1
>>> aligner.extend_gap_score = 0
>>> chromosome = "AAAAAAAACCCCCCCAAAAAAAAAAAGGGGGGAAAAAAAA"
>>> transcript = "CCCCCCCGGGGGG"
>>> alignments1 = aligner.align(chromosome, transcript)
>>> len(alignments1)
1
>>> alignment1 = alignments1[0]
>>> print(alignment1)
target            8 CCCCCCCAAAAAAAAAAAGGGGGG 32
                  0 |||||||-----------|||||| 24
query             0 CCCCCCC-----------GGGGGG 13

>>> sequence = "CCCCGGGG"
>>> alignments2 = aligner.align(transcript, sequence)
>>> len(alignments2)
1
>>> alignment2 = alignments2[0]
>>> print(alignment2)
target            3 CCCCGGGG 11
                  0 ||||||||  8
query             0 CCCCGGGG  8

>>> mapped_alignment = alignment1.map(alignment2)
>>> print(mapped_alignment)
target           11 CCCCAAAAAAAAAAAGGGG 30
                  0 ||||-----------|||| 19
query             0 CCCC-----------GGGG  8

>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'

Mapping the alignment does not depend on the sequence contents. If we delete the sequence contents, the same alignment is found in PSL format (though we obviously lose the ability to print the sequence alignment):

>>> from Bio.Seq import Seq
>>> alignment1.target = Seq(None, len(alignment1.target))
>>> alignment1.query = Seq(None, len(alignment1.query))
>>> alignment2.target = Seq(None, len(alignment2.target))
>>> alignment2.query = Seq(None, len(alignment2.query))
>>> mapped_alignment = alignment1.map(alignment2)
>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'

By default, a global pairwise alignment is performed, which finds the optimal alignment over the whole length of target and query. Instead, a local alignment will find the subsequence of target and query with the highest alignment score. Local alignments can be generated by setting aligner.mode to "local":

>>> aligner.mode = "local"
>>> target = "AGAACTC"
>>> query = "GAACT"
>>> score = aligner.score(target, query)
>>> score
5.0
>>> alignments = aligner.align(target, query)
>>> for alignment in alignments:
...     print(alignment)
...
target            1 GAACT 6
                  0 ||||| 5
query             0 GAACT 5

Note that there is some ambiguity in the definition of the best local alignments if segments with a score 0 can be added to the alignment. We follow the suggestion by Waterman & Eggert [Waterman1987] and disallow such extensions.

If aligner.mode is set to “fogsaa”, then the Fast Optimal Global Alignment Algorithm [Chakraborty2013] with some modifications is used. This mode calculates a global alignment, but it is not like the regular “global” mode. It is best suited for long alignments between similar sequences. Rather than calculating all possible alignments like other algorithms do, FOGSAA uses a heuristic to detect steps in an alignment that cannot lead to an optimal alignment. This can speed up alignment, however, the heuristic makes assumptions about your match, mismatch, and gap scores. If the match score is less than the mismatch score or any gap score, or if any gap score is greater than the mismatch score, then a warning is raised and the algorithm may return incorrect results. Unlike other modes that may return more than one alignment, FOGSAA always returns only one alignment.

>>> aligner.mode = "fogsaa"
>>> aligner.mismatch_score = -10
>>> alignments = aligner.align("AAACAAA", "AAAGAAA")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAC-AAA 7
                  0 |||--||| 8
query             0 AAA-GAAA 7

The pairwise aligner object

The PairwiseAligner object stores all alignment parameters to be used for the pairwise alignments. To see an overview of the values for all parameters, use

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(match_score=1.0, mode="local")
>>> print(aligner)
Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: 0.000000
  target_internal_extend_gap_score: 0.000000
  target_left_open_gap_score: 0.000000
  target_left_extend_gap_score: 0.000000
  target_right_open_gap_score: 0.000000
  target_right_extend_gap_score: 0.000000
  query_internal_open_gap_score: 0.000000
  query_internal_extend_gap_score: 0.000000
  query_left_open_gap_score: 0.000000
  query_left_extend_gap_score: 0.000000
  query_right_open_gap_score: 0.000000
  query_right_extend_gap_score: 0.000000
  mode: local

See Sections Substitution scores, Affine gap scores, and General gap scores below for the definition of these parameters. The attribute mode (described above in Section Basic usage) can be set equal to "global" or "local" to specify global or local pairwise alignment, respectively.

Depending on the gap scoring parameters (see Sections Affine gap scores and General gap scores) and mode, a PairwiseAligner object automatically chooses the appropriate algorithm to use for pairwise sequence alignment. To verify the selected algorithm, use

>>> aligner.algorithm
'Smith-Waterman'

This attribute is read-only.

A PairwiseAligner object also stores the precision \(\epsilon\) to be used during alignment. The value of \(\epsilon\) is stored in the attribute aligner.epsilon, and by default is equal to \(10^{-6}\):

>>> aligner.epsilon
1e-06

Two scores will be considered equal to each other for the purpose of the alignment if the absolute difference between them is less than \(\epsilon\).

Substitution scores

Substitution scores define the value to be added to the total score when two letters (nucleotides or amino acids) are aligned to each other. The substitution scores to be used by the PairwiseAligner can be specified in two ways:

  • By specifying a match score for identical letters, and a mismatch scores for mismatched letters. Nucleotide sequence alignments are typically based on match and mismatch scores. For example, by default BLAST [Altschul1990] uses a match score of \(+1\) and a mismatch score of \(-2\) for nucleotide alignments by megablast, with a gap penalty of 2.5 (see section Affine gap scores for more information on gap scores). Match and mismatch scores can be specified by setting the match and mismatch attributes of the PairwiseAligner object:

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> aligner.match_score
    1.0
    >>> aligner.mismatch_score
    0.0
    >>> score = aligner.score("ACGT", "ACAT")
    >>> print(score)
    3.0
    >>> aligner.match_score = 1.0
    >>> aligner.mismatch_score = -2.0
    >>> aligner.gap_score = -2.5
    >>> score = aligner.score("ACGT", "ACAT")
    >>> print(score)
    1.0
    

    When using match and mismatch scores, you can specify a wildcard character (None by default) for unknown letters. These will get a zero score in alignments, irrespective of the value of the match or mismatch score:

    >>> aligner.wildcard = "?"
    >>> score = aligner.score("ACGT", "AC?T")
    >>> print(score)
    3.0
    
  • Alternatively, you can use the substitution_matrix attribute of the PairwiseAligner object to specify a substitution matrix. This allows you to apply different scores for different pairs of matched and mismatched letters. This is typically used for amino acid sequence alignments. For example, by default BLAST [Altschul1990] uses the BLOSUM62 substitution matrix for protein alignments by blastp. This substitution matrix is available from Biopython:

    >>> from Bio.Align import substitution_matrices
    >>> substitution_matrices.load()  
    ['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', 'BLOSUM62', ..., 'TRANS']
    >>> matrix = substitution_matrices.load("BLOSUM62")
    >>> print(matrix)  
    #  Matrix made by matblas from blosum62.iij
    ...
         A    R    N    D    C    Q ...
    A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 ...
    R -1.0  5.0  0.0 -2.0 -3.0  1.0 ...
    N -2.0  0.0  6.0  1.0 -3.0  0.0 ...
    D -2.0 -2.0  1.0  6.0 -3.0  0.0 ...
    C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 ...
    Q -1.0  1.0  0.0  0.0 -3.0  5.0 ...
    ...
    >>> aligner.substitution_matrix = matrix
    >>> score = aligner.score("ACDQ", "ACDQ")
    >>> score
    24.0
    >>> score = aligner.score("ACDQ", "ACNQ")
    >>> score
    19.0
    

    When using a substitution matrix, X is not interpreted as an unknown character. Instead, the score provided by the substitution matrix will be used:

    >>> matrix["D", "X"]
    -1.0
    >>> score = aligner.score("ACDQ", "ACXQ")
    >>> score
    17.0
    

By default, aligner.substitution_matrix is None. The attributes aligner.match_score and aligner.mismatch_score are ignored if aligner.substitution_matrix is not None. Setting aligner.match_score or aligner.mismatch_score to valid values will reset aligner.substitution_matrix to None.

Affine gap scores

Affine gap scores are defined by a score to open a gap, and a score to extend an existing gap:

\(\textrm{gap score} = \textrm{open gap score} + (n-1) \times \textrm{extend gap score}\),

where \(n\) is the length of the gap. Biopython’s pairwise sequence aligner allows fine-grained control over the gap scoring scheme by specifying the following twelve attributes of a PairwiseAligner object:

Opening scores

Extending scores

query_left_open_gap_score

query_left_extend_gap_score

query_internal_open_gap_score

query_internal_extend_gap_score

query_right_open_gap_score

query_right_extend_gap_score

target_left_open_gap_score

target_left_extend_gap_score

target_internal_open_gap_score

target_internal_extend_gap_score

target_right_open_gap_score

target_right_extend_gap_score

These attributes allow for different gap scores for internal gaps and on either end of the sequence, as shown in this example:

target

query

score

A

query left open gap score

C

query left extend gap score

C

query left extend gap score

G

G

match score

G

T

mismatch score

G

query internal open gap score

A

query internal extend gap score

A

query internal extend gap score

T

T

match score

A

A

match score

G

query internal open gap score

C

C

match score

C

target internal open gap score

C

target internal extend gap score

C

C

match score

T

G

mismatch score

C

C

match score

C

target internal open gap score

A

A

match score

T

target right open gap score

A

target right extend gap score

A

target right extend gap score

For convenience, PairwiseAligner objects have additional attributes that refer to a number of these values collectively, as shown (hierarchically) in Table Meta-attributes of the pairwise aligner objects..

Table 1 Meta-attributes of the pairwise aligner objects.

Meta-attribute

Attributes it maps to

gap_score

target_gap_score, query_gap_score

open_gap_score

target_open_gap_score, query_open_gap_score

extend_gap_score

target_extend_gap_score, query_extend_gap_score

internal_gap_score

target_internal_gap_score, query_internal_gap_score

internal_open_gap_score

target_internal_open_gap_score, query_internal_open_gap_score

internal_extend_gap_score

target_internal_extend_gap_score, query_internal_extend_gap_score

end_gap_score

target_end_gap_score, query_end_gap_score

end_open_gap_score

target_end_open_gap_score, query_end_open_gap_score

end_extend_gap_score

target_end_extend_gap_score, query_end_extend_gap_score

left_gap_score

target_left_gap_score, query_left_gap_score

right_gap_score

target_right_gap_score, query_right_gap_score

left_open_gap_score

target_left_open_gap_score, query_left_open_gap_score

left_extend_gap_score

target_left_extend_gap_score, query_left_extend_gap_score

right_open_gap_score

target_right_open_gap_score, query_right_open_gap_score

right_extend_gap_score

target_right_extend_gap_score, query_right_extend_gap_score

target_open_gap_score

target_internal_open_gap_score, target_left_open_gap_score, target_right_open_gap_score

target_extend_gap_score

target_internal_extend_gap_score, target_left_extend_gap_score, target_right_extend_gap_score

target_gap_score

target_open_gap_score, target_extend_gap_score

query_open_gap_score

query_internal_open_gap_score, query_left_open_gap_score, query_right_open_gap_score

query_extend_gap_score

query_internal_extend_gap_score, query_left_extend_gap_score, query_right_extend_gap_score

query_gap_score

query_open_gap_score, query_extend_gap_score

target_internal_gap_score

target_internal_open_gap_score, target_internal_extend_gap_score

target_end_gap_score

target_end_open_gap_score, target_end_extend_gap_score

target_end_open_gap_score

target_left_open_gap_score, target_right_open_gap_score

target_end_extend_gap_score

target_left_extend_gap_score, target_right_extend_gap_score

target_left_gap_score

target_left_open_gap_score, target_left_extend_gap_score

target_right_gap_score

target_right_open_gap_score, target_right_extend_gap_score

query_end_gap_score

query_end_open_gap_score, query_end_extend_gap_score

query_end_open_gap_score

query_left_open_gap_score, query_right_open_gap_score

query_end_extend_gap_score

query_left_extend_gap_score, query_right_extend_gap_score

query_internal_gap_score

query_internal_open_gap_score, query_internal_extend_gap_score

query_left_gap_score

query_left_open_gap_score, query_left_extend_gap_score

query_right_gap_score

query_right_open_gap_score, query_right_extend_gap_score

General gap scores

For even more fine-grained control over the gap scores, you can specify a gap scoring function. For example, the gap scoring function below disallows a gap after two nucleotides in the query sequence:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> def my_gap_score_function(start, length):
...     if start == 2:
...         return -1000
...     else:
...         return -1 * length
...
>>> aligner.query_gap_score = my_gap_score_function
>>> alignments = aligner.align("AACTT", "AATT")
>>> for alignment in alignments:
...     print(alignment)
...
target            0 AACTT 5
                  0 -|.|| 5
query             0 -AATT 4

target            0 AACTT 5
                  0 |-.|| 5
query             0 A-ATT 4

target            0 AACTT 5
                  0 ||.-| 5
query             0 AAT-T 4

target            0 AACTT 5
                  0 ||.|- 5
query             0 AATT- 4

Using a pre-defined substitution matrix and gap scores

By default, a PairwiseAligner object is initialized with a match score of +1.0, a mismatch score of 0.0, and all gap scores equal to 0.0, While this has the benefit of being a simple scoring scheme, in general it does not give the best performance. Instead, you can use the argument scoring to select a predefined scoring scheme when initializing a PairwiseAligner object. Currently, the provided scoring schemes are blastn and megablast, which are suitable for nucleotide alignments, and blastp, which is suitable for protein alignments. Selecting these scoring schemes will initialize the PairwiseAligner object to the default scoring parameters used by BLASTN, MegaBLAST, and BLASTP, respectively.

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(scoring="blastn")
>>> print(aligner)  
Pairwise sequence aligner with parameters
  substitution_matrix: <Array object at ...>
  target_internal_open_gap_score: -7.000000
  target_internal_extend_gap_score: -2.000000
  target_left_open_gap_score: -7.000000
  target_left_extend_gap_score: -2.000000
  target_right_open_gap_score: -7.000000
  target_right_extend_gap_score: -2.000000
  query_internal_open_gap_score: -7.000000
  query_internal_extend_gap_score: -2.000000
  query_left_open_gap_score: -7.000000
  query_left_extend_gap_score: -2.000000
  query_right_open_gap_score: -7.000000
  query_right_extend_gap_score: -2.000000
  mode: global

>>> print(aligner.substitution_matrix[:, :])
     A    T    G    C    S    W    R    Y    K    M    B    V    H    D    N
A  2.0 -3.0 -3.0 -3.0 -3.0 -1.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -2.0
T -3.0  2.0 -3.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -2.0
G -3.0 -3.0  2.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0
C -3.0 -3.0 -3.0  2.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -3.0 -2.0
S -3.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
W -1.0 -1.0 -3.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
R -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
Y -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
K -3.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -2.0
M -1.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
B -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
V -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
H -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
D -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
N -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0

Iterating over alignments

The alignments returned by aligner.align are a kind of immutable iterable objects (similar to range). While they appear similar to a tuple or list of Alignment objects, they are different in the sense that each Alignment object is created dynamically when it is needed. This approach was chosen because the number of alignments can be extremely large, in particular for poor alignments (see Section Examples for an example).

You can perform the following operations on alignments:

  • len(alignments) returns the number of alignments stored. This function returns quickly, even if the number of alignments is huge. If the number of alignments is extremely large (typically, larger than 9,223,372,036,854,775,807, which is the largest integer that can be stored as a long int on 64 bit machines), len(alignments) will raise an OverflowError. A large number of alignments suggests that the alignment quality is low.

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> alignments = aligner.align("AAA", "AA")
    >>> len(alignments)
    3
    
  • You can extract a specific alignment by index:

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> alignments = aligner.align("AAA", "AA")
    >>> print(alignments[2])
    target            0 AAA 3
                      0 -|| 3
    query             0 -AA 2
    
    >>> print(alignments[0])
    target            0 AAA 3
                      0 ||- 3
    query             0 AA- 2
    
  • You can iterate over alignments, for example as in

    >>> for alignment in alignments:
    ...     print(alignment)
    ...
    

    The alignments iterator can be converted into a list or tuple:

    >>> alignments = list(alignments)
    

    It is wise to check the number of alignments by calling len(alignments) before attempting to call list(alignments) to save all alignments as a list.

  • The alignment score (which has the same value for each alignment in alignments) is stored as an attribute. This allows you to check the alignment score before proceeding to extract individual alignments:

    >>> print(alignments.score)
    2.0
    

Aligning to the reverse strand

By default, the pairwise aligner aligns the forward strand of the query to the forward strand of the target. To calculate the alignment score for query to the reverse strand of target, use strand="-":

>>> from Bio import Align
>>> from Bio.Seq import reverse_complement
>>> target = "AAAACCC"
>>> query = "AACC"
>>> aligner = Align.PairwiseAligner()
>>> aligner.mismatch_score = -1
>>> aligner.internal_gap_score = -1
>>> aligner.score(target, query)  # strand is "+" by default
4.0
>>> aligner.score(target, reverse_complement(query), strand="-")
4.0
>>> aligner.score(target, query, strand="-")
0.0
>>> aligner.score(target, reverse_complement(query))
0.0

The alignments against the reverse strand can be obtained by specifying strand="-" when calling aligner.align:

>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             0 --AACC- 4

>>> print(alignments[0].format("bed"))  
target   2   6   query   4   +   2   6   0   1   4,   0,

>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             4 --AACC- 0

>>> print(alignments[0].format("bed"))  
target   2   6   query   4   -   2   6   0   1   4,   0,

>>> alignments = aligner.align(target, query, strand="-")
>>> len(alignments)
2
>>> print(alignments[0])
target            0 AAAACCC----  7
                  0 ----------- 11
query             4 -------GGTT  0

>>> print(alignments[1])
target            0 ----AAAACCC  7
                  0 ----------- 11
query             4 GGTT-------  0

Note that the score for aligning query to the reverse strand of target may be different from the score for aligning the reverse complement of query to the forward strand of target if the left and right gap scores are different:

>>> aligner.left_gap_score = -0.5
>>> aligner.right_gap_score = -0.2
>>> aligner.score(target, query)
2.8
>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             0 --AACC- 4

>>> aligner.score(target, reverse_complement(query), strand="-")
3.1
>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             4 --AACC- 0

Substitution matrices

Substitution matrices [Durbin1998] provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. Biopython provides a ton of common substitution matrices, including the famous PAM and BLOSUM series of matrices, and also provides functionality for creating your own substitution matrices.

Array objects

You can think of substitutions matrices as 2D arrays in which the indices are letters (nucleotides or amino acids) rather than integers. The Array class in Bio.Align.substitution_matrices is a subclass of numpy arrays that supports indexing both by integers and by specific strings. An Array instance can either be a one-dimensional array or a square two-dimensional arrays. A one-dimensional Array object can for example be used to store the nucleotide frequency of a DNA sequence, while a two-dimensional Array object can be used to represent a scoring matrix for sequence alignments.

To create a one-dimensional Array, only the alphabet of allowed letters needs to be specified:

>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT")
>>> print(counts)
A 0.0
C 0.0
G 0.0
T 0.0

The allowed letters are stored in the alphabet property:

>>> counts.alphabet
'ACGT'

This property is read-only; modifying the underlying _alphabet attribute may lead to unexpected results. Elements can be accessed both by letter and by integer index:

>>> counts["C"] = -3
>>> counts[2] = 7
>>> print(counts)
A  0.0
C -3.0
G  7.0
T  0.0

>>> counts[1]
-3.0

Using a letter that is not in the alphabet, or an index that is out of bounds, will cause a IndexError:

>>> counts["U"]
Traceback (most recent call last):
    ...
IndexError: 'U'
>>> counts["X"] = 6
Traceback (most recent call last):
    ...
IndexError: 'X'
>>> counts[7]
Traceback (most recent call last):
    ...
IndexError: index 7 is out of bounds for axis 0 with size 4

A two-dimensional Array can be created by specifying dims=2:

>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT", dims=2)
>>> print(counts)
    A   C   G   T
A 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0
G 0.0 0.0 0.0 0.0
T 0.0 0.0 0.0 0.0

Again, both letters and integers can be used for indexing, and specifying a letter that is not in the alphabet will cause an IndexError:

>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> print(counts)
    A    C   G    T
A 0.0 12.0 0.0  0.0
C 0.0  0.0 0.0  0.0
G 0.0  5.0 0.0  0.0
T 0.0  0.0 0.0 -2.0

>>> counts["X", 1]
Traceback (most recent call last):
    ...
IndexError: 'X'
>>> counts["A", 5]
Traceback (most recent call last):
    ...
IndexError: index 5 is out of bounds for axis 1 with size 4

Selecting a row or column from the two-dimensional array will return a one-dimensional Array:

>>> counts = Array("ACGT", dims=2)
>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> counts["G"]
Array([0., 5., 0., 0.],
      alphabet='ACGT')
>>> counts[:, "C"]
Array([12.,  0.,  5.,  0.],
      alphabet='ACGT')

Array objects can thus be used as an array and as a dictionary. They can be converted to plain numpy arrays or plain dictionary objects:

>>> import numpy as np
>>> x = Array("ACGT")
>>> x["C"] = 5
>>> x
Array([0., 5., 0., 0.],
      alphabet='ACGT')
>>> a = np.array(x)  # create a plain numpy array
>>> a
array([0., 5., 0., 0.])
>>> d = dict(x)  # create a plain dictionary
>>> d
{'A': 0.0, 'C': 5.0, 'G': 0.0, 'T': 0.0}

While the alphabet of an Array is usually a string, you may also use a tuple of (immutable) objects. This is used for example for a codon substitution matrix (as in the substitution_matrices.load("SCHNEIDER") example shown later), where the keys are not individual nucleotides or amino acids but instead three-nucleotide codons.

While the alphabet property of an Array is immutable, you can create a new Array object by selecting the letters you are interested in from the alphabet. For example,

>>> a = Array("ABCD", dims=2, data=np.arange(16).reshape(4, 4))
>>> print(a)
     A    B    C    D
A  0.0  1.0  2.0  3.0
B  4.0  5.0  6.0  7.0
C  8.0  9.0 10.0 11.0
D 12.0 13.0 14.0 15.0

>>> b = a.select("CAD")
>>> print(b)
     C    A    D
C 10.0  8.0 11.0
A  2.0  0.0  3.0
D 14.0 12.0 15.0

Note that this also allows you to reorder the alphabet.

Data for letters that are not found in the alphabet are set to zero:

>>> c = a.select("DEC")
>>> print(c)
     D   E    C
D 15.0 0.0 14.0
E  0.0 0.0  0.0
C 11.0 0.0 10.0

As the Array class is a subclass of numpy array, it can be used as such. A ValueError is triggered if the Array objects appearing in a mathematical operation have different alphabets, for example

>>> from Bio.Align.substitution_matrices import Array
>>> d = Array("ACGT")
>>> r = Array("ACGU")
>>> d + r
Traceback (most recent call last):
    ...
ValueError: alphabets are inconsistent

Calculating a substitution matrix from a pairwise sequence alignment

As Array is a subclass of a numpy array, you can apply mathematical operations on an Array object in much the same way. Here, we illustrate this by calculating a scoring matrix from the alignment of the 16S ribosomal RNA gene sequences of Escherichia coli and Bacillus subtilis. First, we create a PairwiseAligner object (see Chapter Pairwise sequence alignment) and initialize it with the default scores used by blastn:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner(scoring="blastn")
>>> aligner.mode = "local"

Next, we read in the 16S ribosomal RNA gene sequence of Escherichia coli and Bacillus subtilis (provided in Tests/Align/ecoli.fa and Tests/Align/bsubtilis.fa), and align them to each other:

>>> from Bio import SeqIO
>>> sequence1 = SeqIO.read("ecoli.fa", "fasta")
>>> sequence2 = SeqIO.read("bsubtilis.fa", "fasta")
>>> alignments = aligner.align(sequence1, sequence2)

The number of alignments generated is very large:

>>> len(alignments)
1990656

However, as they only differ trivially from each other, we arbitrarily choose the first alignment, and count the number of each substitution:

>>> alignment = alignments[0]
>>> substitutions = alignment.substitutions
>>> print(substitutions)
      A     C     G     T
A 307.0  19.0  34.0  19.0
C  15.0 280.0  25.0  29.0
G  34.0  24.0 401.0  20.0
T  24.0  36.0  20.0 228.0

We normalize against the total number to find the probability of each substitution, and create a symmetric matrix of observed frequencies:

>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
       A      C      G      T
A 0.2026 0.0112 0.0224 0.0142
C 0.0112 0.1848 0.0162 0.0215
G 0.0224 0.0162 0.2647 0.0132
T 0.0142 0.0215 0.0132 0.1505

The background probability is the probability of finding an A, C, G, or T nucleotide in each sequence separately. This can be calculated as the sum of each row or column:

>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
A 0.2505
C 0.2337
G 0.3165
T 0.1993

The number of substitutions expected at random is simply the product of the background distribution with itself:

>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
       A      C      G      T
A 0.0627 0.0585 0.0793 0.0499
C 0.0585 0.0546 0.0740 0.0466
G 0.0793 0.0740 0.1002 0.0631
T 0.0499 0.0466 0.0631 0.0397

The scoring matrix can then be calculated as the logarithm of the odds-ratio of the observed and the expected probabilities:

>>> oddsratios = observed_frequencies / expected_frequencies
>>> import numpy as np
>>> scoring_matrix = np.log2(oddsratios)
>>> print(scoring_matrix)
     A    C    G    T
A  1.7 -2.4 -1.8 -1.8
C -2.4  1.8 -2.2 -1.1
G -1.8 -2.2  1.4 -2.3
T -1.8 -1.1 -2.3  1.9

The matrix can be used to set the substitution matrix for the pairwise aligner (see Chapter Pairwise sequence alignment):

>>> aligner.substitution_matrix = scoring_matrix

Calculating a substitution matrix from a multiple sequence alignment

In this example, we’ll first read a protein sequence alignment from the Clustalw file protein.aln (also available online here)

>>> from Bio import Align
>>> filename = "protein.aln"
>>> alignment = Align.read(filename, "clustal")

Section ClustalW contains more information on doing this.

The substitutions property of the alignment stores the number of times different residues substitute for each other:

>>> substitutions = alignment.substitutions

To make the example more readable, we’ll select only amino acids with polar charged side chains:

>>> substitutions = substitutions.select("DEHKR")
>>> print(substitutions)
       D      E      H      K      R
D 2360.0  270.0   15.0    1.0   48.0
E  241.0 3305.0   15.0   45.0    2.0
H    0.0   18.0 1235.0    8.0    0.0
K    0.0    9.0   24.0 3218.0  130.0
R    2.0    2.0   17.0  103.0 2079.0

Rows and columns for other amino acids were removed from the matrix.

Next, we normalize the matrix and make it symmetric.

>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
       D      E      H      K      R
D 0.1795 0.0194 0.0006 0.0000 0.0019
E 0.0194 0.2514 0.0013 0.0021 0.0002
H 0.0006 0.0013 0.0939 0.0012 0.0006
K 0.0000 0.0021 0.0012 0.2448 0.0089
R 0.0019 0.0002 0.0006 0.0089 0.1581

Summing over rows or columns gives the relative frequency of occurrence of each residue:

>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
D 0.2015
E 0.2743
H 0.0976
K 0.2569
R 0.1697

>>> sum(background) == 1.0
True

The expected frequency of residue pairs is then

>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
       D      E      H      K      R
D 0.0406 0.0553 0.0197 0.0518 0.0342
E 0.0553 0.0752 0.0268 0.0705 0.0465
H 0.0197 0.0268 0.0095 0.0251 0.0166
K 0.0518 0.0705 0.0251 0.0660 0.0436
R 0.0342 0.0465 0.0166 0.0436 0.0288

Here, background[:, None] creates a 2D array consisting of a single column with the values of expected_frequencies, and rxpected_frequencies[None, :] a 2D array with these values as a single row. Taking their dot product (inner product) creates a matrix of expected frequencies where each entry consists of two expected_frequencies values multiplied with each other. For example, expected_frequencies['D', 'E'] is equal to residue_frequencies['D'] * residue_frequencies['E'].

We can now calculate the log-odds matrix by dividing the observed frequencies by the expected frequencies and taking the logarithm:

>>> import numpy as np
>>> scoring_matrix = np.log2(observed_frequencies / expected_frequencies)
>>> print(scoring_matrix)
      D    E    H     K    R
D   2.1 -1.5 -5.1 -10.4 -4.2
E  -1.5  1.7 -4.4  -5.1 -8.3
H  -5.1 -4.4  3.3  -4.4 -4.7
K -10.4 -5.1 -4.4   1.9 -2.3
R  -4.2 -8.3 -4.7  -2.3  2.5

This matrix can be used as the substitution matrix when performing alignments. For example,

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = scoring_matrix
>>> aligner.gap_score = -3.0
>>> alignments = aligner.align("DEHEK", "DHHKK")
>>> print(alignments[0])
target            0 DEHEK 5
                  0 |.|.| 5
query             0 DHHKK 5

>>> print("%.2f" % alignments.score)
-2.18
>>> score = (
...     scoring_matrix["D", "D"]
...     + scoring_matrix["E", "H"]
...     + scoring_matrix["H", "H"]
...     + scoring_matrix["E", "K"]
...     + scoring_matrix["K", "K"]
... )
>>> print("%.2f" % score)
-2.18

(see Chapter Pairwise sequence alignment for details).

Reading Array objects from file

Bio.Align.substitution_matrices includes a parser to read one- and two-dimensional Array objects from file. One-dimensional arrays are represented by a simple two-column format, with the first column containing the key and the second column the corresponding value. For example, the file hg38.chrom.sizes (obtained from UCSC), available in the Tests/Align subdirectory of the Biopython distribution, contains the size in nucleotides of each chromosome in human genome assembly hg38:

chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
...
chrUn_KI270385v1    990
chrUn_KI270423v1    981
chrUn_KI270392v1    971
chrUn_KI270394v1    970

To parse this file, use

>>> from Bio.Align import substitution_matrices
>>> with open("hg38.chrom.sizes") as handle:
...     table = substitution_matrices.read(handle)
...
>>> print(table)  
chr1 248956422.0
chr2 242193529.0
chr3 198295559.0
chr4 190214555.0
...
chrUn_KI270423v1       981.0
chrUn_KI270392v1       971.0
chrUn_KI270394v1       970.0

Use dtype=int to read the values as integers:

>>> with open("hg38.chrom.sizes") as handle:
...     table = substitution_matrices.read(handle, int)
...
>>> print(table)  
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
...
chrUn_KI270423v1       981
chrUn_KI270392v1       971
chrUn_KI270394v1       970

For two-dimensional arrays, we follow the file format of substitution matrices provided by NCBI. For example, the BLOSUM62 matrix, which is the default substitution matrix for NCBI’s protein-protein BLAST [Altschul1990] program blastp, is stored as follows:

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
...

This file is included in the Biopython distribution under Bio/Align/substitution_matrices/data. To parse this file, use

>>> from Bio.Align import substitution_matrices
>>> with open("BLOSUM62") as handle:
...     matrix = substitution_matrices.read(handle)
...
>>> print(matrix.alphabet)
ARNDCQEGHILKMFPSTWYVBZX*
>>> print(matrix["A", "D"])
-2.0

The header lines starting with # are stored in the attribute header:

>>> matrix.header[0]
'Matrix made by matblas from blosum62.iij'

We can now use this matrix as the substitution matrix on an aligner object:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = matrix

To save an Array object, create a string first:

>>> text = str(matrix)
>>> print(text)  
#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S ...
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 -1.0 -1.0 -1.0 -2.0 -1.0  1.0 ...
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 -2.0  2.0 -1.0 -3.0 -2.0 -1.0 ...
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 -3.0  0.0 -2.0 -3.0 -2.0  1.0 ...
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 -4.0 -1.0 -3.0 -3.0 -1.0  0.0 ...
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0 -3.0 -1.0 ...
...

and write the text to a file.

Loading predefined substitution matrices

Biopython contains a large set of substitution matrices defined in the literature, including BLOSUM (Blocks Substitution Matrix) [Henikoff1992] and PAM (Point Accepted Mutation) matrices [Dayhoff1978]. These matrices are available as flat files in the Bio/Align/substitution_matrices/data directory, and can be loaded into Python using the load function in the substitution_matrices submodule. For example, the BLOSUM62 matrix can be loaded by running

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("BLOSUM62")

This substitution matrix has an alphabet consisting of the 20 amino acids used in the genetic code, the three ambiguous amino acids B (asparagine or aspartic acid), Z (glutamine or glutamic acid), and X (representing any amino acid), and the stop codon represented by an asterisk:

>>> m.alphabet
'ARNDCQEGHILKMFPSTWYVBZX*'

To get a full list of available substitution matrices, use load without an argument:

>>> substitution_matrices.load()  
['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', ..., 'TRANS']

Note that the substitution matrix provided by Schneider et al. [Schneider2005] uses an alphabet consisting of three-nucleotide codons:

>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet  
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')

Examples

Suppose you want to do a global pairwise alignment between the same two hemoglobin sequences from above (HBA_HUMAN, HBB_HUMAN) stored in alpha.faa and beta.faa:

>>> from Bio import Align
>>> from Bio import SeqIO
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
72.0

showing an alignment score of 72.0. To see the individual alignments, do

>>> alignments = aligner.align(seq1.seq, seq2.seq)

In this example, the total number of optimal alignments is huge (more than \(4 \times 10^{37}\)), and calling len(alignments) will raise an OverflowError:

>>> len(alignments)
Traceback (most recent call last):
...
OverflowError: number of optimal alignments is larger than 9223372036854775807

Let’s have a look at the first alignment:

>>> alignment = alignments[0]

The alignment object stores the alignment score, as well as the alignment itself:

>>> print(alignment.score)
72.0
>>> print(alignment)
target            0 MV-LS-PAD--KTN--VK-AA-WGKV-----GAHAGEYGAEALE-RMFLSF----P-TTK
                  0 ||-|--|----|----|--|--||||-----|---||--|--|--|--|------|-|--
query             0 MVHL-TP--EEK--SAV-TA-LWGKVNVDEVG---GE--A--L-GR--L--LVVYPWT--

target           41 TY--FPHF----DLSHGS---AQVK-G------HGKKV--A--DA-LTNAVAHV-DDMPN
                 60 ----|--|----|||------|-|--|------|||||--|--|--|--|--|--|---|
query            39 --QRF--FESFGDLS---TPDA-V-MGNPKVKAHGKKVLGAFSD-GL--A--H-LD---N

target           79 ALS----A-LSD-LHAH--KLR-VDPV-NFK-LLSHC---LLVT--LAAHLPA----EFT
                120 -|-----|-||--||----||--|||--||--||------|-|---||-|-------|||
query            81 -L-KGTFATLS-ELH--CDKL-HVDP-ENF-RLL---GNVL-V-CVLA-H---HFGKEFT

target          119 PA-VH-ASLDKFLAS---VSTV------LTS--KYR- 142
                180 |--|--|------|----|--|------|----||-- 217
query           124 P-PV-QA------A-YQKV--VAGVANAL--AHKY-H 147

Better alignments are usually obtained by penalizing gaps: higher costs for opening a gap and lower costs for extending an existing gap. For amino acid sequences match scores are usually encoded in matrices like PAM or BLOSUM. Thus, a more meaningful alignment for our example can be obtained by using the BLOSUM62 matrix, together with a gap open penalty of 10 and a gap extension penalty of 0.5:

>>> from Bio import Align
>>> from Bio import SeqIO
>>> from Bio.Align import substitution_matrices
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -0.5
>>> aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
292.5
>>> alignments = aligner.align(seq1.seq, seq2.seq)
>>> len(alignments)
2
>>> print(alignments[0].score)
292.5
>>> print(alignments[0])
target            0 MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGS
                  0 ||-|.|..|..|.|.||||--...|.|.|||.|.....|.|...|..|-|||-----.|.
query             0 MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN

target           53 AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
                 60 ..||.|||||..|.....||.|........||.||..||.|||.||.||...|...||.|
query            58 PKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH

target          113 LPAEFTPAVHASLDKFLASVSTVLTSKYR 142
                120 ...||||.|.|...|..|.|...|..||. 149
query           118 FGKEFTPPVQAAYQKVVAGVANALAHKYH 147

This alignment has the same score that we obtained earlier with EMBOSS needle using the same sequences and the same parameters.

To perform a local alignment, set aligner.mode to 'local':

>>> aligner.mode = "local"
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -1
>>> alignments = aligner.align("LSPADKTNVKAA", "PEEKSAV")
>>> print(len(alignments))
1
>>> alignment = alignments[0]
>>> print(alignment)
target            2 PADKTNV 9
                  0 |..|..| 7
query             0 PEEKSAV 7

>>> print(alignment.score)
16.0

Generalized pairwise alignments

In most cases, PairwiseAligner is used to perform alignments of sequences (strings or Seq objects) consisting of single-letter nucleotides or amino acids. More generally, PairwiseAligner can also be applied to lists or tuples of arbitrary objects. This section will describe some examples of such generalized pairwise alignments.

Generalized pairwise alignments using a substitution matrix and alphabet

Schneider et al. [Schneider2005] created a substitution matrix for aligning three-nucleotide codons (see below in section Substitution matrices for more information). This substitution matrix is associated with an alphabet consisting of all three-letter codons:

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet  
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')

We can use this matrix to align codon sequences to each other:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1.0
>>> s1 = ("AAT", "CTG", "TTT", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
2
>>> print(alignments[0])
AAT CTG TTT TTT
||| ... ||| ---
AAT TTA TTT ---

>>> print(alignments[1])
AAT CTG TTT TTT
||| ... --- |||
AAT TTA --- TTT

Note that aligning TTT to TTA, as in this example:

AAT CTG TTT TTT
||| --- ... |||
AAT --- TTA TTT

would get a much lower score:

>>> print(m["CTG", "TTA"])
7.6
>>> print(m["TTT", "TTA"])
-0.3

presumably because CTG and TTA both code for leucine, while TTT codes for phenylalanine. The three-letter codon substitution matrix also reveals a preference among codons representing the same amino acid. For example, TTA has a preference for CTG preferred compared to CTC, though all three code for leucine:

>>> s1 = ("AAT", "CTG", "CTC", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
1
>>> print(alignments[0])
AAT CTG CTC TTT
||| ... --- |||
AAT TTA --- TTT

>>> print(m["CTC", "TTA"])
6.5

Generalized pairwise alignments using match/mismatch scores and an alphabet

Using the three-letter amino acid symbols, the sequences above translate to

>>> s1 = ("Asn", "Leu", "Leu", "Phe")
>>> s2 = ("Asn", "Leu", "Phe")

We can align these sequences directly to each other by using a three-letter amino acid alphabet:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.alphabet = ['Ala', 'Arg', 'Asn', 'Asp', 'Cys',
...                     'Gln', 'Glu', 'Gly', 'His', 'Ile',
...                     'Leu', 'Lys', 'Met', 'Phe', 'Pro',
...                     'Ser', 'Thr', 'Trp', 'Tyr', 'Val']  # fmt: skip
...

We use +6/-1 match and mismatch scores as an approximation of the BLOSUM62 matrix, and align these sequences to each other:

>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
Asn Leu Leu Phe
||| ||| --- |||
Asn Leu --- Phe

>>> print(alignments[1])
Asn Leu Leu Phe
||| --- ||| |||
Asn --- Leu Phe

>>> print(alignments.score)
18.0

Generalized pairwise alignments using match/mismatch scores and integer sequences

Internally, the first step when performing an alignment is to replace the two sequences by integer arrays consisting of the indices of each letter in each sequence in the alphabet associated with the aligner. This step can be bypassed by passing integer arrays directly:

>>> import numpy as np
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> s1 = np.array([2, 10, 10, 13], np.int32)
>>> s2 = np.array([2, 10, 13], np.int32)
>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| || -- ||
2 10 -- 13

>>> print(alignments[1])
2 10 10 13
| -- || ||
2 -- 10 13

>>> print(alignments.score)
18.0

Note that the indices should consist of 32-bit integers, as specified in this example by numpy.int32.

Unknown letters can again be included by defining a wildcard character, and using the corresponding Unicode code point number as the index:

>>> aligner.wildcard = "?"
>>> ord(aligner.wildcard)
63
>>> s2 = np.array([2, 63, 13], np.int32)
>>> aligner.gap_score = -3
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| .. -- ||
2 63 -- 13

>>> print(alignments[1])
2 10 10 13
| -- .. ||
2 -- 63 13

>>> print(alignments.score)
9.0

Generalized pairwise alignments using a substitution matrix and integer sequences

Integer sequences can also be aligned using a substitution matrix, in this case a numpy square array without an alphabet associated with it. In this case, all index values must be non-negative, and smaller than the size of the substitution matrix:

>>> from Bio import Align
>>> import numpy as np
>>> aligner = Align.PairwiseAligner()
>>> m = np.eye(5)
>>> m[0, 1:] = m[1:, 0] = -2
>>> m[2, 2] = 3
>>> print(m)
[[ 1. -2. -2. -2. -2.]
 [-2.  1.  0.  0.  0.]
 [-2.  0.  3.  0.  0.]
 [-2.  0.  0.  1.  0.]
 [-2.  0.  0.  0.  1.]]
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1
>>> s1 = np.array([0, 2, 3, 4], np.int32)
>>> s2 = np.array([0, 3, 2, 1], np.int32)
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
0 - 2 3 4
| - | . -
0 3 2 1 -

>>> print(alignments[1])
0 - 2 3 4
| - | - .
0 3 2 - 1

>>> print(alignments.score)
2.0

Codon alignments

The CodonAligner class in the Bio.Align module implements a specialized aligner for aligning a nucleotide sequence to the amino acid sequence it encodes. Such alignments are non-trivial if frameshifts occur during translation.

Aligning a nucleotide sequence to an amino acid sequence

To align a nucleotide sequence to an amino acid sequence, first create a CodonAligner object:

>>> from Bio import Align
>>> aligner = Align.CodonAligner()

The CodonAligner object aligner stores the alignment parameters to be used for the alignments:

>>> print(aligner)
Codon aligner with parameters
  wildcard: 'X'
  match_score: 1.0
  mismatch_score: 0.0
  frameshift_minus_two_score: -3.0
  frameshift_minus_one_score: -3.0
  frameshift_plus_one_score: -3.0
  frameshift_plus_two_score: -3.0

The wildcard, match_score, and mismatch_score parameters are defined in the same was as for the PairwiseAligner class described above (see Section The pairwise aligner object). The values specified by the frameshift_minus_two_score, frameshift_minus_one_score, frameshift_plus_one_score, and frameshift_plus_two_score parameters are added to the alignment score whenever a -2, -1, +1, or +2 frame shift, respectively, occurs in the alignment. By default, the frame shift scores are set to -3.0. Similar to the PairwiseAligner class (Table Meta-attributes of the pairwise aligner objects.), the CodonAligner class defines additional attributes that refer to a number of these values collectively, as shown in Table Meta-attributes of CodonAligner objects..

Table 2 Meta-attributes of CodonAligner objects.

Meta-attribute

Attributes it maps to

frameshift_minus_score

frameshift_minus_two_score, frameshift_minus_one_score

frameshift_plus_score

frameshift_plus_two_score, frameshift_plus_one_score

frameshift_two_score

frameshift_minus_two_score, frameshift_plus_two_score

frameshift_one_score

frameshift_minus_one_score, frameshift_plus_one_score

frameshift_score

frameshift_minus_two_score, frameshift_minus_one_score, frameshift_plus_one_score, frameshift_plus_two_score

Now let’s consider two nucleotide sequences and the amino acid sequences they encode:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> nuc1 = Seq("TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG")
>>> rna1 = SeqRecord(nuc1, id="rna1")
>>> nuc2 = Seq("TCAGGGACTTCGAGAACCAAGCGCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG")
>>> rna2 = SeqRecord(nuc2, id="rna2")
>>> aa1 = Seq("SGTARTKLLLLLAALCAAGGALE")
>>> aa2 = Seq("SGTSRTKRLLLLAALGAAGGALE")
>>> pro1 = SeqRecord(aa1, id="pro1")
>>> pro2 = SeqRecord(aa2, id="pro2")

While the two protein sequences both consist of 23 amino acids, the first nucleotide sequence consists of \(3 \times 23 = 69\) nucleotides while the second nucleotide sequence tonsists of only 68 nucleotides:

>>> len(pro1)
23
>>> len(pro2)
23
>>> len(rna1)
69
>>> len(rna2)
68

This is due to a -1 frame shift event during translation of the second nucleotide sequence. Use CodonAligner.align to align rna1 to pro1, and rna2 to pro2, returning an iterator of Alignment objects:

>>> alignments1 = aligner.align(pro1, rna1)
>>> len(alignments1)
1
>>> alignment1 = next(alignments1)
>>> print(alignment1)
pro1              0 S  G  T  A  R  T  K  L  L  L  L  L  A  A  L  C  A  A  G  G
rna1              0 TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGG

pro1             20 A  L  E   23
rna1             60 GCGCTGGAG 69

>>> alignment1.coordinates
array([[ 0, 23],
       [ 0, 69]])
>>> alignment1[0]
'SGTARTKLLLLLAALCAAGGALE'
>>> alignment1[1]
'TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG'
>>> alignments2 = aligner.align(pro2, rna2)
>>> len(alignments2)
1
>>> alignment2 = next(alignments2)
>>> print(alignment2)
pro2              0 S  G  T  S  R  T  K  R   8
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24

pro2              8 L  L  L  L  A  A  L  G  A  A  G  G  A  L  E   23
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68

>>> alignment2[0]
'SGTSRTKRLLLLAALGAAGGALE'
>>> alignment2[1]
'TCAGGGACTTCGAGAACCAAGCGCCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG'
>>> alignment2.coordinates
array([[ 0,  8,  8, 23],
       [ 0, 24, 23, 68]])

While alignment1 is a continuous alignment of the 69 nucleotides to the 23 amino acids, in alignment2 we find a -1 frame shift after 24 nucleotides. As alignment2[1] contains the nucleotide sequence after applying the -1 frame shift, it is one nucleotide longer than nuc2 and can be translated directly, resulting in the amino acid sequence aa2:

>>> from Bio.Seq import translate
>>> len(nuc2)
68
>>> len(alignment2[1])
69
>>> translate(alignment2[1])
'SGTSRTKRLLLLAALGAAGGALE'
>>> _ == aa2
True

The alignment score is stored as an attribute on the alignments1 and alignments2 iterators, and on the individual alignments alignment1 and alignment2:

>>> alignments1.score
23.0
>>> alignment1.score
23.0
>>> alignments2.score
20.0
>>> alignment2.score
20.0

where the score of the rna1-pro1 alignment is equal to the number of aligned amino acids, and the score of the rna2-pro2 alignment is 3 less due to the penalty for the frame shift. To calculate the alignment score without calculating the alignment itself, the score method can be used:

>>> score = aligner.score(pro1, rna1)
>>> print(score)
23.0
>>> score = aligner.score(pro2, rna2)
>>> print(score)
20.0

Generating a multiple sequence alignment of codon sequences

Suppose we have a third related amino acid sequence and its associated nucleotide sequence:

>>> aa3 = Seq("MGTALLLLLAALCAAGGALE")
>>> pro3 = SeqRecord(aa3, id="pro3")
>>> nuc3 = Seq("ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG")
>>> rna3 = SeqRecord(nuc3, id="rna3")
>>> nuc3.translate() == aa3
True

As above, we use the CodonAligner to align the nucleotide sequence to the amino acid sequence:

>>> alignments3 = aligner.align(pro3, rna3)
>>> len(alignments3)
1
>>> alignment3 = next(alignments3)
>>> print(alignment3)
pro3              0 M  G  T  A  L  L  L  L  L  A  A  L  C  A  A  G  G  A  L  E
rna3              0 ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG

pro3             20
rna3             60

The three amino acid sequences can be aligned to each other, for example using ClustalW. Here, we create the alignment by hand:

>>> import numpy as np
>>> from Bio.Align import Alignment
>>> sequences = [pro1, pro2, pro3]
>>> protein_alignment = Alignment(
...     sequences, coordinates=np.array([[0, 4, 7, 23], [0, 4, 7, 23], [0, 4, 4, 20]])
... )
>>> print(protein_alignment)
pro1              0 SGTARTKLLLLLAALCAAGGALE 23
pro2              0 SGTSRTKRLLLLAALGAAGGALE 23
pro3              0 MGTA---LLLLLAALCAAGGALE 20

Now we can use the mapall method on the protein alignment, with the nucleotide-to-protein pairwise alignments as the argument, to obtain the corresponding codon alignment:

>>> codon_alignment = protein_alignment.mapall([alignment1, alignment2, alignment3])
>>> print(codon_alignment)
rna1              0 TCAGGGACTGCGAGAACCAAGCTA 24
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24
rna3              0 ATGGGAACCGCG---------CTG 15

rna1             24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68
rna3             15 CTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG 60

Analyzing a codon alignment

Calculating the number of nonsynonymous and synonymous substitutions per site

The most important application of a codon alignment is to estimate the number of nonsynonymous substitutions per site (dN) and synonymous substitutions per site (dS). These can be calculated by the calculate_dn_ds function in Bio.Align.analysis. This function takes a pairwise codon alignment and input, as well as the optional arguments method specifying the calculation method, codon_table (defaulting to the Standard Code), the ratio k of the transition and transversion rates, and cfreq to specify the equilibrium codon frequency. Biopython currently supports three counting based methods (NG86, LWL85, YN00) as well as the maximum likelihood method (ML) to estimate dN and dS:

  • NG86: Nei and Gojobori (1986) [Nei1986] (default). With this method, you can also specify the ratio of the transition and transversion rates via the argument k, defaulting to 1.0.

  • LWL85: Li et al. (1985) [Li1985].

  • YN00: Yang and Nielsen (2000) [Yang2000].

  • ML: Goldman and Yang (1994) [Goldman1994]. With this method, you can also specify the equilibrium codon frequency via the cfreq argument, with the following options:

    • F1x4: count the nucleotide frequency in the provided codon sequences, and use it to calculate the background codon frequency;

    • F3x4: (default) count the nucleotide frequency separately for the first, second, and third position in the provided codons, and use it to calculate the background codon frequency;

    • F61: count the frequency of codons from the provided codon sequences, with a pseudocount of 0.1.

The calculate_dN_dS method can be applied to a pairwise codon alignment. In general, the different calculation methods will result in slightly different estimates for dN and dS:

>>> from Bio.Align import analysis
>>> pairwise_codon_alignment = codon_alignment[:2]
>>> print(pairwise_codon_alignment)
rna1              0 TCAGGGACTGCGAGAACCAAGCTA 24
                  0 |||||||||.||||||||||||..
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24

rna1             24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
                 24 ||.||||||||||||||||||.|||||||||||||.||.|||||| 69
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68

>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="NG86")
>>> print(dN, dS)  
0.067715... 0.201197...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="LWL85")
>>> print(dN, dS)  
0.068728... 0.207551...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="YN00")
>>> print(dN, dS)  
0.081468... 0.127706...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="ML")
>>> print(dN, dS)  
0.069475... 0.205754...

For a multiple alignment of codon sequences, you can calculate a matrix of dN and dS values:

>>> dN, dS = analysis.calculate_dn_ds_matrix(codon_alignment, method="NG86")
>>> print(dN)
rna1    0.000000
rna2    0.067715    0.000000
rna3    0.060204    0.145469    0.000000
    rna1    rna2    rna3
>>> print(dS)
rna1    0.000000
rna2    0.201198    0.000000
rna3    0.664268    0.798957    0.000000
    rna1    rna2    rna3

The objects dN and dS returned by calculate_dn_ds_matrix are instances of the DistanceMatrix class in Bio.Phylo.TreeConstruction. This function only takes codon_table as an optional argument.

From these two sequences, you can create a dN tree and a dS tree using Bio.Phylo.TreeConstruction:

>>> from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
>>> dn_constructor = DistanceTreeConstructor()
>>> ds_constructor = DistanceTreeConstructor()
>>> dn_tree = dn_constructor.upgma(dN)
>>> ds_tree = ds_constructor.upgma(dS)
>>> print(type(dn_tree))
<class 'Bio.Phylo.BaseTree.Tree'>
>>> print(dn_tree)  
Tree(rooted=True)
    Clade(branch_length=0, name='Inner2')
        Clade(branch_length=0.053296..., name='rna2')
        Clade(branch_length=0.023194..., name='Inner1')
            Clade(branch_length=0.0301021..., name='rna3')
            Clade(branch_length=0.0301021..., name='rna1')
>>> print(ds_tree)  
Tree(rooted=True)
    Clade(branch_length=0, name='Inner2')
        Clade(branch_length=0.365806..., name='rna3')
        Clade(branch_length=0.265207..., name='Inner1')
            Clade(branch_length=0.100598..., name='rna2')
            Clade(branch_length=0.100598..., name='rna1')

Performing the McDonald-Kreitman test

The McDonald-Kreitman test assesses the amount of adaptive evolution by comparing the within species synonymous substitutions and nonsynonymous substitutions to the between species synonymous substitutions and nonsynonymous substitutions to see if they are from the same evolutionary process. The test requires gene sequences sampled from different individuals of the same species. In the following example, we will use Adh gene from fruit fly. The data includes 11 individuals from Drosophila melanogaster, 4 individuals from Drosophila simulans, and 12 individuals from Drosophila yakuba. The protein alignment data and the nucleotide sequences are available in the Tests/codonalign directory as the files adh.aln and drosophila.fasta, respectively, in the Biopython distribution. The function mktest in Bio.Align.analysis implements the Mcdonald-Kreitman test.

>>> from Bio import SeqIO
>>> from Bio import Align
>>> from Bio.Align import CodonAligner
>>> from Bio.Align.analysis import mktest
>>> aligner = CodonAligner()
>>> nucleotide_records = SeqIO.index("drosophila.fasta", "fasta")
>>> for nucleotide_record in nucleotide_records.values():
...     print(nucleotide_record.description)  
...
gi|9097|emb|X57361.1| Drosophila simulans (individual c) ...
gi|9099|emb|X57362.1| Drosophila simulans (individual d) ...
gi|9101|emb|X57363.1| Drosophila simulans (individual e) ...
gi|9103|emb|X57364.1| Drosophila simulans (individual f) ...
gi|9217|emb|X57365.1| Drosophila yakuba (individual a) ...
gi|9219|emb|X57366.1| Drosophila yakuba (individual b) ...
gi|9221|emb|X57367.1| Drosophila yakuba (individual c) ...
gi|9223|emb|X57368.1| Drosophila yakuba (individual d) ...
gi|9225|emb|X57369.1| Drosophila yakuba (individual e) ...
gi|9227|emb|X57370.1| Drosophila yakuba (individual f) ...
gi|9229|emb|X57371.1| Drosophila yakuba (individual g) ...
gi|9231|emb|X57372.1| Drosophila yakuba (individual h) ...
gi|9233|emb|X57373.1| Drosophila yakuba (individual i) ...
gi|9235|emb|X57374.1| Drosophila yakuba (individual j) ...
gi|9237|emb|X57375.1| Drosophila yakuba (individual k) ...
gi|9239|emb|X57376.1| Drosophila yakuba (individual l) ...
gi|156879|gb|M17837.1|DROADHCK D.melanogaster (strain Ja-F) ...
gi|156863|gb|M19547.1|DROADHCC D.melanogaster (strain Af-S) ...
gi|156877|gb|M17836.1|DROADHCJ D.melanogaster (strain Af-F) ...
gi|156875|gb|M17835.1|DROADHCI D.melanogaster (strain Wa-F) ...
gi|156873|gb|M17834.1|DROADHCH D.melanogaster (strain Fr-F) ...
gi|156871|gb|M17833.1|DROADHCG D.melanogaster (strain Fl-F) ...
gi|156869|gb|M17832.1|DROADHCF D.melanogaster (strain Ja-S) ...
gi|156867|gb|M17831.1|DROADHCE D.melanogaster (strain Fl-2S) ...
gi|156865|gb|M17830.1|DROADHCD D.melanogaster (strain Fr-S) ...
gi|156861|gb|M17828.1|DROADHCB D.melanogaster (strain Fl-1S) ...
gi|156859|gb|M17827.1|DROADHCA D.melanogaster (strain Wa-S) ...
>>> protein_alignment = Align.read("adh.aln", "clustal")
>>> len(protein_alignment)
27
>>> print(protein_alignment)  
gi|9217|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9219|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9221|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
...
gi|156859         0 MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENPAAIAELKAINPKVTVT

...

gi|9217|e       240 GTLEAIQWSKHWDSGI 256
gi|9219|e       240 GTLEAIQWSKHWDSGI 256
gi|9221|e       240 GTLEAIQWSKHWDSGI 256
...
gi|156859       240 GTLEAIQWTKHWDSGI 256

>>> codon_alignments = []
>>> for protein_record in protein_alignment.sequences:
...     nucleotide_record = nucleotide_records[protein_record.id]
...     alignments = aligner.align(protein_record, nucleotide_record)
...     assert len(alignments) == 1
...     codon_alignment = next(alignments)
...     codon_alignments.append(codon_alignment)
...
>>> print(codon_alignment)  
gi|156859         0 M  S  F  T  L  T  N  K  N  V  I  F  V  A  G  L  G  G  I  G
gi|156859         0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT

gi|156859        20 L  D  T  S  K  E  L  L  K  R  D  L  K  N  L  V  I  L  D  R
gi|156859        60 CTGGACACCAGCAAGGAGCTGCTCAAGCGCGATCTGAAGAACCTGGTGATCCTCGACCGC

...

gi|156859       240 G  T  L  E  A  I  Q  W  T  K  H  W  D  S  G  I   256
gi|156859       720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768

>>> nucleotide_records.close()  # Close indexed FASTA file
>>> alignment = protein_alignment.mapall(codon_alignments)
>>> print(alignment)  
gi|9217|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9219|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9221|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
...
gi|156859         0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT

...

gi|9217|e       720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9219|e       720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9221|e       720 GGTACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
...
gi|156859       720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768

>>> unique_species = ["Drosophila simulans", "Drosophila yakuba", "D.melanogaster"]
>>> species = []
>>> for record in alignment.sequences:
...     description = record.description
...     for s in unique_species:
...         if s in description:
...             break
...     else:
...         raise Exception(f"Failed to find species for {description}")
...     species.append(s)
...
>>> print(species)
['Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster']
>>> pvalue = mktest(alignment, species)
>>> print(pvalue)  
0.00206457...

In addition to the multiple codon alignment, the function mktest takes as input the species to which each sequence in the alignment belongs to. The codon table can be provided as an optional argument codon_table.