Bio.Seq module

Provide objects to represent biological sequences.

See also the Seq wiki and the chapter in our tutorial:
class Bio.Seq.SequenceDataAbstractBaseClass

Bases: abc.ABC

Abstract base class for sequence content providers.

Most users will not need to use this class. It is used internally as a base class for sequence content provider classes such as _UndefinedSequenceData defined in this module, and _TwoBitSequenceData in Bio.SeqIO.TwoBitIO. Instances of these classes can be used instead of a bytes object as the data argument when creating a Seq object, and provide the sequence content only when requested via __getitem__. This allows lazy parsers to load and parse sequence data from a file only for the requested sequence regions, and _UndefinedSequenceData instances to raise an exception when undefined sequence data are requested.

Future implementations of lazy parsers that similarly provide on-demand parsing of sequence data should use a subclass of this abstract class and implement the abstract methods __len__ and __getitem__:

  • __len__ must return the sequence length;

  • __getitem__ must return

    • a bytes object for the requested region; or

    • a new instance of the subclass for the requested region; or

    • raise an UndefinedSequenceError.

    Calling __getitem__ for a sequence region of size zero should always return an empty bytes object. Calling __getitem__ for the full sequence (as in data[:]) should either return a bytes object with the full sequence, or raise an UndefinedSequenceError.

Subclasses of SequenceDataAbstractBaseClass must call super().__init__() as part of their __init__ method.

__slots__ = ()
__init__()

Check if __getitem__ returns a bytes-like object.

abstract __len__()
abstract __getitem__(key)
__bytes__()
__hash__()

Return hash(self).

__eq__(other)

Return self==value.

__lt__(other)

Return self<value.

__le__(other)

Return self<=value.

__gt__(other)

Return self>value.

__ge__(other)

Return self>=value.

__add__(other)
__radd__(other)
__mul__(other)
__contains__(item)
decode(encoding='utf-8')

Decode the data as bytes using the codec registered for encoding.

encoding

The encoding with which to decode the bytes.

count(sub, start=None, end=None)

Return the number of non-overlapping occurrences of sub in data[start:end].

Optional arguments start and end are interpreted as in slice notation. This method behaves as the count method of Python strings.

find(sub, start=None, end=None)

Return the lowest index in data where subsection sub is found.

Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rfind(sub, start=None, end=None)

Return the highest index in data where subsection sub is found.

Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

index(sub, start=None, end=None)

Return the lowest index in data where subsection sub is found.

Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the subsection is not found.

rindex(sub, start=None, end=None)

Return the highest index in data where subsection sub is found.

Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.

Raise ValueError when the subsection is not found.

startswith(prefix, start=None, end=None)

Return True if data starts with the specified prefix, False otherwise.

With optional start, test data beginning at that position. With optional end, stop comparing data at that position. prefix can also be a tuple of bytes to try.

endswith(suffix, start=None, end=None)

Return True if data ends with the specified suffix, False otherwise.

With optional start, test data beginning at that position. With optional end, stop comparing data at that position. suffix can also be a tuple of bytes to try.

split(sep=None, maxsplit=- 1)

Return a list of the sections in the data, using sep as the delimiter.

sep

The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).

maxsplit

Maximum number of splits to do. -1 (the default value) means no limit.

rsplit(sep=None, maxsplit=- 1)

Return a list of the sections in the data, using sep as the delimiter.

sep

The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).

maxsplit

Maximum number of splits to do. -1 (the default value) means no limit.

Splitting is done starting at the end of the data and working to the front.

strip(chars=None)

Strip leading and trailing characters contained in the argument.

If the argument is omitted or None, strip leading and trailing ASCII whitespace.

lstrip(chars=None)

Strip leading characters contained in the argument.

If the argument is omitted or None, strip leading ASCII whitespace.

rstrip(chars=None)

Strip trailing characters contained in the argument.

If the argument is omitted or None, strip trailing ASCII whitespace.

upper()

Return a copy of data with all ASCII characters converted to uppercase.

lower()

Return a copy of data with all ASCII characters converted to lowercase.

isupper()

Return True if all ASCII characters in data are uppercase.

If there are no cased characters, the method returns False.

islower()

Return True if all ASCII characters in data are lowercase.

If there are no cased characters, the method returns False.

replace(old, new)

Return a copy with all occurrences of substring old replaced by new.

translate(table, delete=b'')

Return a copy with each character mapped by the given translation table.

table

Translation table, which must be a bytes object of length 256.

All characters occurring in the optional argument delete are removed. The remaining characters are mapped through the given translation table.

property defined

Return True if the sequence is defined, False if undefined or partially defined.

Zero-length sequences are always considered to be defined.

property defined_ranges

Return a tuple of the ranges where the sequence contents is defined.

The return value has the format ((start1, end1), (start2, end2), …).

__abstractmethods__ = frozenset({'__getitem__', '__len__'})
class Bio.Seq.Seq(data, length=None)

Bases: Bio.Seq._SeqAbstractBaseClass

Read-only sequence object (essentially a string with biological methods).

Like normal python strings, our basic sequence object is immutable. This prevents you from doing my_seq[5] = “A” for example, but does allow Seq objects to be used as dictionary keys.

The Seq object provides a number of string like methods (such as count, find, split and strip).

The Seq object also provides some biological methods, such as complement, reverse_complement, transcribe, back_transcribe and translate (which are not applicable to protein sequences).

__init__(data, length=None)

Create a Seq object.

Arguments:
  • data - Sequence, required (string)

  • length - Sequence length, used only if data is None or a dictionary (integer)

You will typically use Bio.SeqIO to read in sequences from files as SeqRecord objects, whose sequence will be exposed as a Seq object via the seq property.

However, you can also create a Seq object directly:

>>> from Bio.Seq import Seq
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF")
>>> my_seq
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
>>> print(my_seq)
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

To create a Seq object with for a sequence of known length but unknown sequence contents, use None for the data argument and pass the sequence length for the length argument. Trying to access the sequence contents of a Seq object created in this way will raise an UndefinedSequenceError:

>>> my_undefined_sequence = Seq(None, 20)
>>> my_undefined_sequence
Seq(None, length=20)
>>> len(my_undefined_sequence)
20
>>> print(my_undefined_sequence)
Traceback (most recent call last):
...
Bio.Seq.UndefinedSequenceError: Sequence content is undefined

If the sequence contents is known for parts of the sequence only, use a dictionary for the data argument to pass the known sequence segments:

>>> my_partially_defined_sequence = Seq({3: "ACGT"}, 10)
>>> my_partially_defined_sequence
Seq({3: 'ACGT'}, length=10)
>>> len(my_partially_defined_sequence)
10
>>> print(my_partially_defined_sequence)
Traceback (most recent call last):
...
Bio.Seq.UndefinedSequenceError: Sequence content is only partially defined
>>> my_partially_defined_sequence[3:7]
Seq('ACGT')
>>> print(my_partially_defined_sequence[3:7])
ACGT
__hash__()

Hash of the sequence as a string for comparison.

See Seq object comparison documentation (method __eq__ in particular) as this has changed in Biopython 1.65. Older versions would hash on object identity.

tomutable()

Return the full sequence as a MutableSeq object.

>>> from Bio.Seq import Seq
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAAL")
>>> my_seq
Seq('MKQHKAMIVALIVICITAVVAAL')
>>> my_seq.tomutable()
MutableSeq('MKQHKAMIVALIVICITAVVAAL')
encode(encoding='utf-8', errors='strict')

Return an encoded version of the sequence as a bytes object.

The Seq object aims to match the interface of a Python string.

This is essentially to save you doing str(my_seq).encode() when you need a bytes string, for example for computing a hash:

>>> from Bio.Seq import Seq
>>> Seq("ACGT").encode("ascii")
b'ACGT'
ungap(gap='-')

Return a copy of the sequence without the gap character(s) (DEPRECATED).

The gap character now defaults to the minus sign, and can only be specified via the method argument. This is no longer possible via the sequence’s alphabet (as was possible up to Biopython 1.77):

>>> from Bio.Seq import Seq
>>> my_dna = Seq("-ATA--TGAAAT-TTGAAAA")
>>> my_dna
Seq('-ATA--TGAAAT-TTGAAAA')
>>> my_dna.ungap("-")
Seq('ATATGAAATTTGAAAA')

This method is DEPRECATED; please use my_dna.replace(gap, “”) instead.

__abstractmethods__ = frozenset({})
class Bio.Seq.UnknownSeq(length, alphabet=None, character='?')

Bases: Bio.Seq.Seq

Read-only sequence object of known length but unknown contents (DEPRECATED).

If you have an unknown sequence, you can represent this with a normal Seq object, for example:

>>> my_seq = Seq("N"*5)
>>> my_seq
Seq('NNNNN')
>>> len(my_seq)
5
>>> print(my_seq)
NNNNN

However, this is rather wasteful of memory (especially for large sequences), which is where this class is most useful:

>>> unk_five = UnknownSeq(5)
>>> unk_five
UnknownSeq(5, character='?')
>>> len(unk_five)
5
>>> print(unk_five)
?????

You can add unknown sequence together. Provided the characters are the same, you get another memory saving UnknownSeq:

>>> unk_four = UnknownSeq(4)
>>> unk_four
UnknownSeq(4, character='?')
>>> unk_four + unk_five
UnknownSeq(9, character='?')

If the characters are different, addition gives an ordinary Seq object:

>>> unk_nnnn = UnknownSeq(4, character="N")
>>> unk_nnnn
UnknownSeq(4, character='N')
>>> unk_nnnn + unk_four
Seq('NNNN????')

Combining with a real Seq gives a new Seq object:

>>> known_seq = Seq("ACGT")
>>> unk_four + known_seq
Seq('????ACGT')
>>> known_seq + unk_four
Seq('ACGT????')

Although originally intended for unknown sequences (thus the class name), this can be used for homopolymer sequences like AAAAAA, and the biological methods will respect this:

>>> homopolymer = UnknownSeq(6, character="A")
>>> homopolymer.complement()
UnknownSeq(6, character='T')
>>> homopolymer.complement_rna()
UnknownSeq(6, character='U')
>>> homopolymer.translate()
UnknownSeq(2, character='K')
__init__(length, alphabet=None, character='?')

Create a new UnknownSeq object.

Arguments:
  • length - Integer, required.

  • alphabet - no longer used, must be None.

  • character - single letter string, default “?”. Typically “N” for nucleotides, “X” for proteins, and “?” otherwise.

__len__()

Return the stated length of the unknown sequence.

__bytes__()

Return the unknown sequence as full string of the given length.

__str__()

Return the unknown sequence as full string of the given length.

__repr__()

Return (truncated) representation of the sequence for debugging.

__add__(other)

Add another sequence or string to this sequence.

Adding two UnknownSeq objects returns another UnknownSeq object provided the character is the same.

>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(10, character='X') + UnknownSeq(5, character='X')
UnknownSeq(15, character='X')

If the characters differ, an UnknownSeq object cannot be used, so a Seq object is returned:

>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(10, character='X') + UnknownSeq(5, character="x")
Seq('XXXXXXXXXXxxxxx')

If adding a string to an UnknownSeq, a new Seq is returned:

>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(5, character='X') + "LV"
Seq('XXXXXLV')
__radd__(other)

Add a sequence on the left.

__mul__(other)

Multiply UnknownSeq by integer.

>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(3) * 2
UnknownSeq(6, character='?')
>>> UnknownSeq(3, character="N") * 2
UnknownSeq(6, character='N')
__rmul__(other)

Multiply integer by UnknownSeq.

>>> from Bio.Seq import UnknownSeq
>>> 2 * UnknownSeq(3)
UnknownSeq(6, character='?')
>>> 2 * UnknownSeq(3, character="N")
UnknownSeq(6, character='N')
__imul__(other)

Multiply UnknownSeq in-place.

>>> from Bio.Seq import UnknownSeq
>>> seq = UnknownSeq(3, character="N")
>>> seq *= 2
>>> seq
UnknownSeq(6, character='N')
__getitem__(index)

Get a subsequence from the UnknownSeq object.

>>> unk = UnknownSeq(8, character="N")
>>> print(unk[:])
NNNNNNNN
>>> print(unk[5:3])

>>> print(unk[1:-1])
NNNNNN
>>> print(unk[1:-1:2])
NNN
count(sub, start=None, end=None)

Return a non-overlapping count, like that of a python string.

This behaves like the python string (and Seq object) method of the same name, which does a non-overlapping count!

For an overlapping search use the newer count_overlap() method.

Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.

Arguments:
  • sub - a string or another Seq object to look for

  • start - optional integer, slice start

  • end - optional integer, slice end

>>> "NNNN".count("N")
4
>>> Seq("NNNN").count("N")
4
>>> UnknownSeq(4, character="N").count("N")
4
>>> UnknownSeq(4, character="N").count("A")
0
>>> UnknownSeq(4, character="N").count("AA")
0

HOWEVER, please note because that python strings and Seq objects (and MutableSeq objects) do a non-overlapping search, this may not give the answer you expect:

>>> UnknownSeq(4, character="N").count("NN")
2
>>> UnknownSeq(4, character="N").count("NNN")
1
count_overlap(sub, start=None, end=None)

Return an overlapping count.

For a non-overlapping search use the count() method.

Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.

Arguments:
  • sub - a string or another Seq object to look for

  • start - optional integer, slice start

  • end - optional integer, slice end

e.g.

>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(4, character="N").count_overlap("NN")
3
>>> UnknownSeq(4, character="N").count_overlap("NNN")
2

Where substrings do not overlap, should behave the same as the count() method:

>>> UnknownSeq(4, character="N").count_overlap("N")
4
>>> UnknownSeq(4, character="N").count_overlap("N") == UnknownSeq(4, character="N").count("N")
True
>>> UnknownSeq(4, character="N").count_overlap("A")
0
>>> UnknownSeq(4, character="N").count_overlap("A") == UnknownSeq(4, character="N").count("A")
True
>>> UnknownSeq(4, character="N").count_overlap("AA")
0
>>> UnknownSeq(4, character="N").count_overlap("AA") == UnknownSeq(4, character="N").count("AA")
True
complement()

Return the complement assuming it is DNA.

In typical usage this will return the same unknown sequence:

>>> my_nuc = UnknownSeq(8, character='N')
>>> my_nuc
UnknownSeq(8, character='N')
>>> print(my_nuc)
NNNNNNNN
>>> my_nuc.complement()
UnknownSeq(8, character='N')
>>> print(my_nuc.complement())
NNNNNNNN

If your sequence isn’t actually unknown, and has a nucleotide letter other than N, the appropriate DNA complement base is used:

>>> UnknownSeq(8, character="A").complement()
UnknownSeq(8, character='T')
complement_rna()

Return the complement assuming it is RNA.

In typical usage this will return the same unknown sequence. If your sequence isn’t actually unknown, the appropriate RNA complement base is used:

>>> UnknownSeq(8, character="A").complement_rna()
UnknownSeq(8, character='U')
reverse_complement()

Return the reverse complement assuming it is DNA.

In typical usage this will return the same unknown sequence:

>>> from Bio.Seq import UnknownSeq
>>> example = UnknownSeq(6, character="N")
>>> print(example)
NNNNNN
>>> print(example.reverse_complement())
NNNNNN

If your sequence isn’t actually unknown, the appropriate DNA complement base is used:

>>> UnknownSeq(8, character="A").reverse_complement()
UnknownSeq(8, character='T')
reverse_complement_rna()

Return the reverse complement assuming it is RNA.

In typical usage this will return the same unknown sequence. If your sequence isn’t actually unknown, the appropriate RNA complement base is used:

>>> UnknownSeq(8, character="A").reverse_complement_rna()
UnknownSeq(8, character='U')
transcribe()

Return an unknown RNA sequence from an unknown DNA sequence.

>>> my_dna = UnknownSeq(10, character="N")
>>> my_dna
UnknownSeq(10, character='N')
>>> print(my_dna)
NNNNNNNNNN
>>> my_rna = my_dna.transcribe()
>>> my_rna
UnknownSeq(10, character='N')
>>> print(my_rna)
NNNNNNNNNN

In typical usage this will return the same unknown sequence. If your sequence isn’t actually unknown, but a homopolymer of T, the standard DNA to RNA transcription is done, replacing T with U:

>>> UnknownSeq(9, character="t").transcribe()
UnknownSeq(9, character='u')
back_transcribe()

Return an unknown DNA sequence from an unknown RNA sequence.

>>> my_rna = UnknownSeq(20, character="N")
>>> my_rna
UnknownSeq(20, character='N')
>>> print(my_rna)
NNNNNNNNNNNNNNNNNNNN
>>> my_dna = my_rna.back_transcribe()
>>> my_dna
UnknownSeq(20, character='N')
>>> print(my_dna)
NNNNNNNNNNNNNNNNNNNN

In typical usage this will return the same unknown sequence. If your sequence is actually a U homopolymer, the standard RNA to DNA back translation applies, replacing U with T:

>>> UnknownSeq(9, character="U").back_transcribe()
UnknownSeq(9, character='T')
upper()

Return an upper case copy of the sequence.

>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, character="n")
>>> my_seq
UnknownSeq(20, character='n')
>>> print(my_seq)
nnnnnnnnnnnnnnnnnnnn
>>> my_seq.upper()
UnknownSeq(20, character='N')
>>> print(my_seq.upper())
NNNNNNNNNNNNNNNNNNNN

See also the lower method.

lower()

Return a lower case copy of the sequence.

>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, character="X")
>>> my_seq
UnknownSeq(20, character='X')
>>> print(my_seq)
XXXXXXXXXXXXXXXXXXXX
>>> my_seq.lower()
UnknownSeq(20, character='x')
>>> print(my_seq.lower())
xxxxxxxxxxxxxxxxxxxx

See also the upper method.

translate(table='Standard', stop_symbol='*', to_stop=False, cds=False, gap='-')

Translate an unknown nucleotide sequence into an unknown protein.

If your sequence makes sense as codons (e.g. a poly-A tail AAAAAA), it will be translated accordingly:

>>> UnknownSeq(7, character='A').translate()
UnknownSeq(2, character='K')

Otherwise, it will be translated as X for unknown amino acid:

>>> UnknownSeq(7).translate()
UnknownSeq(2, character='X')
ungap(gap='-')

Return a copy of the sequence without the gap character(s).

The gap character now defaults to the minus sign, and can only be specified via the method argument. This is no longer possible via the sequence’s alphabet (as was possible up to Biopython 1.77):

>>> from Bio.Seq import UnknownSeq
>>> my_dna = UnknownSeq(20, character='N')
>>> my_dna
UnknownSeq(20, character='N')
>>> my_dna.ungap()  # using default
UnknownSeq(20, character='N')
>>> my_dna.ungap("-")
UnknownSeq(20, character='N')

If the UnknownSeq is using the gap character, then an empty Seq is returned:

>>> my_gap = UnknownSeq(20, character="-")
>>> my_gap
UnknownSeq(20, character='-')
>>> my_gap.ungap()  # using default
Seq('')
>>> my_gap.ungap("-")
Seq('')
join(other)

Return a merge of the sequences in other, spaced by the sequence from self.

Accepts either a Seq or string (and iterates over the letters), or an iterable containing Seq or string objects. These arguments will be concatenated with the calling sequence as the spacer:

>>> concatenated = UnknownSeq(5).join([Seq("AAA"), Seq("TTT"), Seq("PPP")])
>>> concatenated
Seq('AAA?????TTT?????PPP')

If all the inputs are also UnknownSeq using the same character, then it returns a new UnknownSeq:

>>> UnknownSeq(5).join([UnknownSeq(3), UnknownSeq(3), UnknownSeq(3)])
UnknownSeq(19, character='?')

Examples taking a single sequence and joining the letters:

>>> UnknownSeq(3).join("ACGT")
Seq('A???C???G???T')
>>> UnknownSeq(3).join(UnknownSeq(4))
UnknownSeq(13, character='?')

Will only return an UnknownSeq object if all of the objects to be joined are also UnknownSeqs with the same character as the spacer, similar to how the addition of an UnknownSeq and another UnknownSeq would work.

property defined

Return True if the sequence is defined, False if undefined or partially defined.

Zero-length sequences are always considered to be defined.

property defined_ranges

Return a tuple of the ranges where the sequence contents is defined.

As the sequence contents of an UnknownSeq object is fully undefined, the return value is always an empty tuple.

__abstractmethods__ = frozenset({})
class Bio.Seq.MutableSeq(data)

Bases: Bio.Seq._SeqAbstractBaseClass

An editable sequence object.

Unlike normal python strings and our basic sequence object (the Seq class) which are immutable, the MutableSeq lets you edit the sequence in place. However, this means you cannot use a MutableSeq object as a dictionary key.

>>> from Bio.Seq import MutableSeq
>>> my_seq = MutableSeq("ACTCGTCGTCG")
>>> my_seq
MutableSeq('ACTCGTCGTCG')
>>> my_seq[5]
'T'
>>> my_seq[5] = "A"
>>> my_seq
MutableSeq('ACTCGACGTCG')
>>> my_seq[5]
'A'
>>> my_seq[5:8] = "NNN"
>>> my_seq
MutableSeq('ACTCGNNNTCG')
>>> len(my_seq)
11

Note that the MutableSeq object does not support as many string-like or biological methods as the Seq object.

__init__(data)

Create a MutableSeq object.

property data

Get the data.

__setitem__(index, value)

Set a subsequence of single letter via value parameter.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq[0] = 'T'
>>> my_seq
MutableSeq('TCTCGACGTCG')
__delitem__(index)

Delete a subsequence of single letter.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> del my_seq[0]
>>> my_seq
MutableSeq('CTCGACGTCG')
append(c)

Add a subsequence to the mutable sequence object.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.append('A')
>>> my_seq
MutableSeq('ACTCGACGTCGA')

No return value.

insert(i, c)

Add a subsequence to the mutable sequence object at a given index.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.insert(0,'A')
>>> my_seq
MutableSeq('AACTCGACGTCG')
>>> my_seq.insert(8,'G')
>>> my_seq
MutableSeq('AACTCGACGGTCG')

No return value.

pop(i=- 1)

Remove a subsequence of a single letter at given index.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.pop()
'G'
>>> my_seq
MutableSeq('ACTCGACGTC')
>>> my_seq.pop()
'C'
>>> my_seq
MutableSeq('ACTCGACGT')

Returns the last character of the sequence.

remove(item)

Remove a subsequence of a single letter from mutable sequence.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.remove('C')
>>> my_seq
MutableSeq('ATCGACGTCG')
>>> my_seq.remove('A')
>>> my_seq
MutableSeq('TCGACGTCG')

No return value.

reverse()

Modify the mutable sequence to reverse itself.

No return value.

extend(other)

Add a sequence to the original mutable sequence object.

>>> my_seq = MutableSeq('ACTCGACGTCG')
>>> my_seq.extend('A')
>>> my_seq
MutableSeq('ACTCGACGTCGA')
>>> my_seq.extend('TTT')
>>> my_seq
MutableSeq('ACTCGACGTCGATTT')

No return value.

toseq()

Return the full sequence as a new immutable Seq object.

>>> from Bio.Seq import MutableSeq
>>> my_mseq = MutableSeq("MKQHKAMIVALIVICITAVVAAL")
>>> my_mseq
MutableSeq('MKQHKAMIVALIVICITAVVAAL')
>>> my_mseq.toseq()
Seq('MKQHKAMIVALIVICITAVVAAL')
__abstractmethods__ = frozenset({})
exception Bio.Seq.UndefinedSequenceError

Bases: ValueError

Sequence contents is undefined.

Bio.Seq.transcribe(dna)

Transcribe a DNA sequence into RNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object.

e.g.

>>> transcribe("ACTGN")
'ACUGN'
Bio.Seq.back_transcribe(rna)

Return the RNA sequence back-transcribed into DNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object.

e.g.

>>> back_transcribe("ACUGN")
'ACTGN'
Bio.Seq.translate(sequence, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None)

Translate a nucleotide sequence into amino acids.

If given a string, returns a new string object. Given a Seq or MutableSeq, returns a Seq object.

Arguments:
  • table - Which codon table to use? This can be either a name (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). Defaults to the “Standard” table.

  • stop_symbol - Single character string, what to use for any terminators, defaults to the asterisk, “*”.

  • to_stop - Boolean, defaults to False meaning do a full translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).

  • cds - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.

  • gap - Single character string to denote symbol used for gaps. Defaults to None.

A simple string example using the default (standard) genetic code:

>>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
>>> translate(coding_dna)
'VAIVMGR*KGAR*'
>>> translate(coding_dna, stop_symbol="@")
'VAIVMGR@KGAR@'
>>> translate(coding_dna, to_stop=True)
'VAIVMGR'

Now using NCBI table 2, where TGA is not a stop codon:

>>> translate(coding_dna, table=2)
'VAIVMGRWKGAR*'
>>> translate(coding_dna, table=2, to_stop=True)
'VAIVMGRWKGAR'

In fact this example uses an alternative start codon valid under NCBI table 2, GTG, which means this example is a complete valid CDS which when translated should really start with methionine (not valine):

>>> translate(coding_dna, table=2, cds=True)
'MAIVMGRWKGAR'

Note that if the sequence has no in-frame stop codon, then the to_stop argument has no effect:

>>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC"
>>> translate(coding_dna2)
'VAIVMGR'
>>> translate(coding_dna2, to_stop=True)
'VAIVMGR'

NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.

It will however translate either DNA or RNA.

NOTE - Since version 1.71 Biopython contains codon tables with ‘ambiguous stop codons’. These are stop codons with unambiguous sequence but which have a context dependent coding as STOP or as amino acid. With these tables ‘to_stop’ must be False (otherwise a ValueError is raised). The dual coding codons will always be translated as amino acid, except for ‘cds=True’, where the last codon will be translated as STOP.

>>> coding_dna3 = "ATGGCACGGAAGTGA"
>>> translate(coding_dna3)
'MARK*'
>>> translate(coding_dna3, table=27)  # Table 27: TGA -> STOP or W
'MARKW'

It will however raise a BiopythonWarning (not shown).

>>> translate(coding_dna3, table=27, cds=True)
'MARK'
>>> translate(coding_dna3, table=27, to_stop=True)
Traceback (most recent call last):
   ...
ValueError: You cannot use 'to_stop=True' with this table ...
Bio.Seq.reverse_complement(sequence, inplace=None)

Return the reverse complement as a DNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> reverse_complement(my_seq, inplace=False)
'TCG'
>>> my_seq = Seq("CGA")
>>> reverse_complement(my_seq, inplace=False)
Seq('TCG')
>>> my_seq = MutableSeq("CGA")
>>> reverse_complement(my_seq, inplace=False)
MutableSeq('TCG')
>>> my_seq
MutableSeq('CGA')

Any U in the sequence is treated as a T:

>>> reverse_complement(Seq("CGAUT"), inplace=False)
Seq('AATCG')

In contrast, reverse_complement_rna returns an RNA sequence:

>>> reverse_complement_rna(Seq("CGAUT"))
Seq('AAUCG')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> reverse_complement("ACGTUacgtuXYZxyz", inplace=False)
'zrxZRXaacgtAACGT'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> reverse_complement(my_seq, inplace=True)
MutableSeq('TCG')
>>> my_seq
MutableSeq('TCG')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.reverse_complement_rna(sequence, inplace=False)

Return the reverse complement as an RNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> reverse_complement_rna(my_seq)
'UCG'
>>> my_seq = Seq("CGA")
>>> reverse_complement_rna(my_seq)
Seq('UCG')
>>> my_seq = MutableSeq("CGA")
>>> reverse_complement_rna(my_seq)
MutableSeq('UCG')
>>> my_seq
MutableSeq('CGA')

Any T in the sequence is treated as a U:

>>> reverse_complement_rna(Seq("CGAUT"))
Seq('AAUCG')

In contrast, reverse_complement returns a DNA sequence:

>>> reverse_complement(Seq("CGAUT"), inplace=False)
Seq('AATCG')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> reverse_complement_rna("ACGTUacgtuXYZxyz")
'zrxZRXaacguAACGU'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> reverse_complement_rna(my_seq, inplace=True)
MutableSeq('UCG')
>>> my_seq
MutableSeq('UCG')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.complement(sequence, inplace=None)

Return the complement as a DNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> complement(my_seq, inplace=False)
'GCT'
>>> my_seq = Seq("CGA")
>>> complement(my_seq, inplace=False)
Seq('GCT')
>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=False)
MutableSeq('GCT')
>>> my_seq
MutableSeq('CGA')

Any U in the sequence is treated as a T:

>>> complement(Seq("CGAUT"), inplace=False)
Seq('GCTAA')

In contrast, complement_rna returns an RNA sequence:

>>> complement_rna(Seq("CGAUT"))
Seq('GCUAA')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> complement("ACGTUacgtuXYZxyz", inplace=False)
'TGCAAtgcaaXRZxrz'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.

Bio.Seq.complement_rna(sequence, inplace=False)

Return the complement as an RNA sequence.

If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.

>>> my_seq = "CGA"
>>> complement_rna(my_seq)
'GCU'
>>> my_seq = Seq("CGA")
>>> complement_rna(my_seq)
Seq('GCU')
>>> my_seq = MutableSeq("CGA")
>>> complement_rna(my_seq)
MutableSeq('GCU')
>>> my_seq
MutableSeq('CGA')

Any T in the sequence is treated as a U:

>>> complement_rna(Seq("CGAUT"))
Seq('GCUAA')

In contrast, complement returns a DNA sequence:

>>> complement(Seq("CGAUT"),inplace=False)
Seq('GCTAA')

Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:

>>> complement_rna("ACGTUacgtuXYZxyz")
'UGCAAugcaaXRZxrz'

The sequence is modified in-place and returned if inplace is True:

>>> my_seq = MutableSeq("CGA")
>>> complement(my_seq, inplace=True)
MutableSeq('GCT')
>>> my_seq
MutableSeq('GCT')

As strings and Seq objects are immutable, a TypeError is raised if reverse_complement is called on a Seq object with inplace=True.