Bio.Seq module

Provide objects to represent biological sequences.

See also the Seq wiki and the chapter in our tutorial:
class Bio.Seq.SequenceDataAbstractBaseClass

Bases: abc.ABC

Abstract base class for sequence content providers.

Most users will not need to use this class. It is used internally as a base class for sequence content provider classes such as _UndefinedSequenceData defined in this module, and _TwoBitSequenceData in Bio.SeqIO.TwoBitIO. Instances of these classes can be used instead of a bytes object as the data argument when creating a Seq object, and provide the sequence content only when requested via __getitem__. This allows lazy parsers to load and parse sequence data from a file only for the requested sequence regions, and _UndefinedSequenceData instances to raise an exception when undefined sequence data are requested.

Future implementations of lazy parsers that similarly provide on-demand parsing of sequence data should use a subclass of this abstract class and implement the abstract methods __len__ and __getitem__:

  • __len__ must return the sequence length;

  • __getitem__ must return

    • a bytes object for the requested region; or

    • a new instance of the subclass for the requested region; or

    • raise an UndefinedSequenceError.

    Calling __getitem__ for a sequence region of size zero should always return an empty bytes object. Calling __getitem__ for the full sequence (as in data[:]) should either return a bytes object with the full sequence, or raise an UndefinedSequenceError.

Subclasses of SequenceDataAbstractBaseClass must call super().__init__() as part of their __init__ method.

__slots__ = ()
__init__()

Check if __getitem__ returns a bytes-like object.

abstract __len__()
abstract __getitem__(key)
__bytes__()
__hash__()

Return hash(self).

__eq__(other)

Return self==value.

__lt__(other)

Return self<value.

__le__(other)

Return self<=value.

__gt__(other)

Return self>value.

__ge__(other)

Return self>=value.

__add__(other)
__radd__(other)
__mul__(other)
__contains__(item)
decode(encoding='utf-8')

Decode the data as bytes using the codec registered for encoding.

encoding

The encoding with which to decode the bytes.

count(sub, start=None, end=None)

Return the number of non-overlapping occurrences of sub in data[start:end].

Optional arguments start and end are interpreted as in slice notation. This method behaves as the count method of Python strings.

find(sub, start=None, end=None)

Return the lowest index in data where subsection sub is found.

Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rfind(sub, start=None, end=None)

Return the highest index in data where subsection sub is found.

Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

index(sub, start=None, end=None)

Return the lowest index in data where subsection sub is found.

Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the subsection is not found.

rindex(sub, start=None, end=None)

Return the highest index in data where subsection sub is found.

Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Raise ValueError when the subsection is not found.

startswith(prefix, start=None, end=None)

Return True if data starts with the specified prefix, False otherwise.

With optional start, test data beginning at that position. With optional end, stop comparing data at that position. prefix can also be a tuple of bytes to try.

endswith(suffix, start=None, end=None)

Return True if data ends with the specified suffix, False otherwise.

With optional start, test data beginning at that position. With optional end, stop comparing data at that position. suffix can also be a tuple of bytes to try.

split(sep=None, maxsplit=- 1)

Return a list of the sections in the data, using sep as the delimiter.

sep

The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).

maxsplit

Maximum number of splits to do. -1 (the default value) means no limit.

rsplit(sep=None, maxsplit=- 1)

Return a list of the sections in the data, using sep as the delimiter.

sep

The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).

maxsplit

Maximum number of splits to do. -1 (the default value) means no limit.

Splitting is done starting at the end of the data and working to the front.

strip(chars=None)

Strip leading and trailing characters contained in the argument.

If the argument is omitted or None, strip leading and trailing ASCII whitespace.

lstrip(chars=None)

Strip leading characters contained in the argument.

If the argument is omitted or None, strip leading ASCII whitespace.

rstrip(chars=None)

Strip trailing characters contained in the argument.

If the argument is omitted or None, strip trailing ASCII whitespace.

upper()

Return a copy of data with all ASCII characters converted to uppercase.

lower()

Return a copy of data with all ASCII characters converted to lowercase.

isupper()

Return True if all ASCII characters in data are uppercase.

If there are no cased characters, the method returns False.

islower()

Return True if all ASCII characters in data are lowercase.

If there are no cased characters, the method returns False.

replace(old, new)

Return a copy with all occurrences of substring old replaced by new.

translate(table, delete=b'')

Return a copy with each character mapped by the given translation table.

table

Translation table, which must be a bytes object of length 256.

All characters occurring in the optional argument delete are removed. The remaining characters are mapped through the given translation table.

property defined

Return True if the sequence is defined, False if undefined or partially defined.

Zero-length sequences are always considered to be defined.

property defined_ranges

Return a tuple of the ranges where the sequence contents is defined.

The return value has the format ((start1, end1), (start2, end2), …).

__abstractmethods__ = frozenset({'__getitem__', '__len__'})
class Bio.Seq.Seq(data, length=None)

Bases: Bio.Seq._SeqAbstractBaseClass

Read-only sequence object (essentially a string with biological methods).

Like normal python strings, our basic sequence object is immutable. This prevents you from doing my_seq[5] = “A” for example, but does allow Seq objects to be used as dictionary keys.

The Seq object provides a number of string like methods (such as count, find, split and strip).

The Seq object also provides some biological methods, such as complement, reverse_complement, transcribe, back_transcribe and translate (which are not applicable to protein sequences).

__init__(data, length=None)

Create a Seq object.

Arguments:
  • data - Sequence, required (string)

  • length - Sequence length, used only if data is None or a dictionary (integer)

You will typically use Bio.SeqIO to read in sequences from files as SeqRecord objects, whose sequence will be exposed as a Seq object via the seq property.

However, you can also create a Seq object directly:

>>> from Bio.Seq import Seq
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF")
>>> my_seq
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
>>> print(my_seq)
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

To create a Seq object with for a sequence of known length but unknown sequence contents, use None for the data argument and pass the sequence length for the length argument. Trying to access the sequence contents of a Seq object created in this way will raise an UndefinedSequenceError:

>>> my_undefined_sequence = Seq(None, 20)
>>> my_undefined_sequence
Seq(None, length=20)
>>> len(my_undefined_sequence)
20
>>> print(my_undefined_sequence)
Traceback (most recent call last):
...
Bio.Seq.UndefinedSequenceError: Sequence content is undefined

If the sequence contents is known for parts of the sequence only, use a dictionary for the data argument to pass the known sequence segments:

>>> my_partially_defined_sequence = Seq({3: "ACGT"}, 10)
>>> my_partially_defined_sequence
Seq({3: 'ACGT'}, length=10)
>>> len(my_partially_defined_sequence)
10
>>> print(my_partially_defined_sequence)
Traceback (most recent call last):
...
Bio.Seq.UndefinedSequenceError: Sequence content is only partially defined
>>> my_partially_defined_sequence[3:7]
Seq('ACGT')
>>> print(my_partially_defined_sequence[3:7])
ACGT
__hash__()

Hash of the sequence as a string for comparison.

See Seq object comparison documentation (method __eq__ in particular) as this has changed in Biopython 1.65. Older versions would hash on object identity.

ungap(gap='-')

Return a copy of the sequence without the gap character(s) (DEPRECATED).

The gap character now defaults to the minus sign, and can only be specified via the method argument. This is no longer possible via the sequence’s alphabet (as was possible up to Biopython 1.77):

>>> from Bio.Seq import Seq
>>> my_dna = Seq("-ATA--TGAAAT-TTGAAAA")
>>> my_dna
Seq('-ATA--TGAAAT-TTGAAAA')
>>> my_dna.ungap("-")
Seq('ATATGAAATTTGAAAA')

This method is DEPRECATED; please use my_dna.replace(gap, “”) instead.

__abstractmethods__ = frozenset({})
class Bio.Seq.MutableSeq(data)

Bases: Bio.Seq._SeqAbstractBaseClass

An editable sequence object.

Unlike normal python strings and our basic sequence object (the Seq class) which are immutable, the MutableSeq lets you edit the sequence in place. However, this means you cannot use a MutableSeq object as a dictionary key.

>>> from Bio.Seq import MutableSeq
>>> my_seq = MutableSeq("ACTCGTCGTCG")
>>> my_seq
MutableSeq('ACTCGTCGTCG')
>>> my_seq[5]
'T'
>>> my_seq[5] = "A"
>>> my_seq
MutableSeq('ACTCGACGTCG')
>>> my_seq[5]
'A'
>>> my_seq[5:8] = "NNN"
>>> my_seq
MutableSeq('ACTCGNNNTCG')
>>> len(my_seq)
11

Note that the MutableSeq object does not support as many string-like or biological methods as the Seq object.

__init__(data)

Create a MutableSeq object.

__setitem__(index, value)

Set a subsequence of single letter via value parameter.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq[0] = 'T'
>>> my_seq
MutableSeq('TCTCGACGTCG')
__delitem__(index)

Delete a subsequence of single letter.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> del my_seq[0]
>>> my_seq
MutableSeq('CTCGACGTCG')
append(c)

Add a subsequence to the mutable sequence object.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.append('A')
>>> my_seq
MutableSeq('ACTCGACGTCGA')

No return value.

insert(i, c)

Add a subsequence to the mutable sequence object at a given index.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.insert(0,'A')
>>> my_seq
MutableSeq('AACTCGACGTCG')
>>> my_seq.insert(8,'G')
>>> my_seq
MutableSeq('AACTCGACGGTCG')

No return value.

pop(i=- 1)

Remove a subsequence of a single letter at given index.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.pop()
'G'
>>> my_seq
MutableSeq('ACTCGACGTC')
>>> my_seq.pop()
'C'
>>> my_seq
MutableSeq('ACTCGACGT')

Returns the last character of the sequence.

remove(item)

Remove a subsequence of a single letter from mutable sequence.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.remove('C')
>>> my_seq
MutableSeq('ATCGACGTCG')
>>> my_seq.remove('A')
>>> my_seq
MutableSeq('TCGACGTCG')

No return value.

reverse()

Modify the mutable sequence to reverse itself.

No return value.

extend(other)

Add a sequence to the original mutable sequence object.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.extend('A')
>>> my_seq
MutableSeq('ACTCGACGTCGA')
>>> my_seq.extend('TTT')
>>> my_seq
MutableSeq('ACTCGACGTCGATTT')

No return value.

__abstractmethods__ = frozenset({})
exception Bio.Seq.UndefinedSequenceError

Bases: ValueError

Sequence contents is undefined.

Bio.Seq.transcribe(dna)

Transcribe a DNA sequence into RNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object.

e.g.

>>> transcribe("ACTGN")
'ACUGN'
Bio.Seq.back_transcribe(rna)

Return the RNA sequence back-transcribed into DNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object.

e.g.

>>> back_transcribe("ACUGN")
'ACTGN'
Bio.Seq.translate(sequence, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None)

Translate a nucleotide sequence into amino acids.

If given a string, returns a new string object. Given a Seq or MutableSeq, returns a Seq object.

Arguments:
  • table - Which codon table to use? This can be either a name (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). Defaults to the “Standard” table.

  • stop_symbol - Single character string, what to use for any terminators, defaults to the asterisk, “*”.

  • to_stop - Boolean, defaults to False meaning do a full translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • cds - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

  • gap - Single character string to denote symbol used for gaps. Defaults to None.

A simple string example using the default (standard) genetic code:

>>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
>>> translate(coding_dna)
'VAIVMGR*KGAR*'
>>> translate(coding_dna, stop_symbol="@")
'VAIVMGR@KGAR@'
>>> translate(coding_dna, to_stop=True)
'VAIVMGR'

Now using NCBI table 2, where TGA is not a stop codon:

>>> translate(coding_dna, table=2)
'VAIVMGRWKGAR*'
>>> translate(coding_dna, table=2, to_stop=True)
'VAIVMGRWKGAR'

In fact this example uses an alternative start codon valid under NCBI table 2, GTG, which means this example is a complete valid CDS which when translated should really start with methionine (not valine):

>>> translate(coding_dna, table=2, cds=True)
'MAIVMGRWKGAR'

Note that if the sequence has no in-frame stop codon, then the to_stop argument has no effect:

>>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC"
>>> translate(coding_dna2)
'VAIVMGR'
>>> translate(coding_dna2, to_stop=True)
'VAIVMGR'

NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.

It will however translate either DNA or RNA.

NOTE - Since version 1.71 Biopython contains codon tables with ‘ambiguous stop codons’. These are stop codons with unambiguous sequence but which have a context dependent coding as STOP or as amino acid. With these tables ‘to_stop’ must be False (otherwise a ValueError is raised). The dual coding codons will always be translated as amino acid, except for ‘cds=True’, where the last codon will be translated as STOP.

>>> coding_dna3 = "ATGGCACGGAAGTGA"
>>> translate(coding_dna3)
'MARK*'
>>> translate(coding_dna3, table=27)  # Table 27: TGA -> STOP or W
'MARKW'

It will however raise a BiopythonWarning (not shown).

>>> translate(coding_dna3, table=27, cds=True)
'MARK'
>>> translate(coding_dna3, table=27, to_stop=True)
Traceback (most recent call last):
   ...
ValueError: You cannot use 'to_stop=True' with this table ...
Bio.Seq.reverse_complement(sequence, inplace=None)

Return the reverse complement as a DNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> reverse_complement(my_seq, inplace=False)
'TCG'
>>> my_seq = Seq("CGA")
>>> reverse_complement(my_seq, inplace=False)
Seq('TCG')
>>> my_seq = MutableSeq("CGA")
>>> reverse_complement(my_seq, inplace=False)
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')

Any U in the sequence is treated as a T:

>>> reverse_complement(Seq("CGAUT"), inplace=False)
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> reverse_complement_rna(Seq("CGAUT"))
Seq('AAUCG')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> reverse_complement("ACGTUacgtuXYZxyz", inplace=False)
'zrxZRXaacgtAACGT'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> reverse_complement(my_seq, inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.reverse_complement_rna(sequence, inplace=False)

Return the reverse complement as an RNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> reverse_complement_rna(my_seq)
'UCG'
>>> my_seq = Seq("CGA")
>>> reverse_complement_rna(my_seq)
Seq('UCG')
>>> my_seq = MutableSeq("CGA")
>>> reverse_complement_rna(my_seq)
MutableSeq('UCG')
>>> my_seq
MutableSeq('CGA')

Any T in the sequence is treated as a U:

>>> reverse_complement_rna(Seq("CGAUT"))
Seq('AAUCG')

In contrast, reverse_complement returns a DNA sequence:

>>> reverse_complement(Seq("CGAUT"), inplace=False)
Seq('AATCG')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> reverse_complement_rna("ACGTUacgtuXYZxyz")
'zrxZRXaacguAACGU'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> reverse_complement_rna(my_seq, inplace=True)
MutableSeq('UCG')
>>> my_seq
MutableSeq('UCG')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.complement(sequence, inplace=None)

Return the complement as a DNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> complement(my_seq, inplace=False)
'GCT'
>>> my_seq = Seq("CGA")
>>> complement(my_seq, inplace=False)
Seq('GCT')
>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=False)
MutableSeq('GCT')
>>> my_seq
MutableSeq('CGA')

Any U in the sequence is treated as a T:

>>> complement(Seq("CGAUT"), inplace=False)
Seq('GCTAA')

In contrast, complement_rna returns an RNA sequence:

>>> complement_rna(Seq("CGAUT"))
Seq('GCUAA')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> complement("ACGTUacgtuXYZxyz", inplace=False)
'TGCAAtgcaaXRZxrz'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.complement_rna(sequence, inplace=False)

Return the complement as an RNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> complement_rna(my_seq)
'GCU'
>>> my_seq = Seq("CGA")
>>> complement_rna(my_seq)
Seq('GCU')
>>> my_seq = MutableSeq("CGA")
>>> complement_rna(my_seq)
MutableSeq('GCU')
>>> my_seq
MutableSeq('CGA')

Any T in the sequence is treated as a U:

>>> complement_rna(Seq("CGAUT"))
Seq('GCUAA')

In contrast, complement returns a DNA sequence:

>>> complement(Seq("CGAUT"),inplace=False)
Seq('GCTAA')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> complement_rna("ACGTUacgtuXYZxyz")
'UGCAAugcaaXRZxrz'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.