Bio.Seq module
Provide objects to represent biological sequences.
- See also the Seq wiki and the chapter in our tutorial:
- class Bio.Seq.SequenceDataAbstractBaseClass
Bases:
abc.ABC
Abstract base class for sequence content providers.
Most users will not need to use this class. It is used internally as a base class for sequence content provider classes such as _UndefinedSequenceData defined in this module, and _TwoBitSequenceData in Bio.SeqIO.TwoBitIO. Instances of these classes can be used instead of a
bytes
object as the data argument when creating a Seq object, and provide the sequence content only when requested via__getitem__
. This allows lazy parsers to load and parse sequence data from a file only for the requested sequence regions, and _UndefinedSequenceData instances to raise an exception when undefined sequence data are requested.Future implementations of lazy parsers that similarly provide on-demand parsing of sequence data should use a subclass of this abstract class and implement the abstract methods
__len__
and__getitem__
:__len__
must return the sequence length;__getitem__
must returna
bytes
object for the requested region; ora new instance of the subclass for the requested region; or
raise an
UndefinedSequenceError
.
Calling
__getitem__
for a sequence region of size zero should always return an emptybytes
object. Calling__getitem__
for the full sequence (as in data[:]) should either return abytes
object with the full sequence, or raise anUndefinedSequenceError
.
Subclasses of SequenceDataAbstractBaseClass must call
super().__init__()
as part of their__init__
method.- __slots__ = ()
- __init__()
Check if
__getitem__
returns a bytes-like object.
- abstract __len__()
- abstract __getitem__(key)
- __bytes__()
- __hash__()
Return hash(self).
- __eq__(other)
Return self==value.
- __lt__(other)
Return self<value.
- __le__(other)
Return self<=value.
- __gt__(other)
Return self>value.
- __ge__(other)
Return self>=value.
- __add__(other)
- __radd__(other)
- __mul__(other)
- __contains__(item)
- decode(encoding='utf-8')
Decode the data as bytes using the codec registered for encoding.
- encoding
The encoding with which to decode the bytes.
- count(sub, start=None, end=None)
Return the number of non-overlapping occurrences of sub in data[start:end].
Optional arguments start and end are interpreted as in slice notation. This method behaves as the count method of Python strings.
- find(sub, start=None, end=None)
Return the lowest index in data where subsection sub is found.
Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
- rfind(sub, start=None, end=None)
Return the highest index in data where subsection sub is found.
Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
- index(sub, start=None, end=None)
Return the lowest index in data where subsection sub is found.
Return the lowest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the subsection is not found.
- rindex(sub, start=None, end=None)
Return the highest index in data where subsection sub is found.
Return the highest index in data where subsection sub is found, such that sub is contained within data[start,end]. Optional arguments start and end are interpreted as in slice notation.
Raise ValueError when the subsection is not found.
- startswith(prefix, start=None, end=None)
Return True if data starts with the specified prefix, False otherwise.
With optional start, test data beginning at that position. With optional end, stop comparing data at that position. prefix can also be a tuple of bytes to try.
- endswith(suffix, start=None, end=None)
Return True if data ends with the specified suffix, False otherwise.
With optional start, test data beginning at that position. With optional end, stop comparing data at that position. suffix can also be a tuple of bytes to try.
- split(sep=None, maxsplit=- 1)
Return a list of the sections in the data, using sep as the delimiter.
- sep
The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).
- maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.
- rsplit(sep=None, maxsplit=- 1)
Return a list of the sections in the data, using sep as the delimiter.
- sep
The delimiter according which to split the data. None (the default value) means split on ASCII whitespace characters (space, tab, return, newline, formfeed, vertical tab).
- maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.
Splitting is done starting at the end of the data and working to the front.
- strip(chars=None)
Strip leading and trailing characters contained in the argument.
If the argument is omitted or None, strip leading and trailing ASCII whitespace.
- lstrip(chars=None)
Strip leading characters contained in the argument.
If the argument is omitted or None, strip leading ASCII whitespace.
- rstrip(chars=None)
Strip trailing characters contained in the argument.
If the argument is omitted or None, strip trailing ASCII whitespace.
- upper()
Return a copy of data with all ASCII characters converted to uppercase.
- lower()
Return a copy of data with all ASCII characters converted to lowercase.
- isupper()
Return True if all ASCII characters in data are uppercase.
If there are no cased characters, the method returns False.
- islower()
Return True if all ASCII characters in data are lowercase.
If there are no cased characters, the method returns False.
- replace(old, new)
Return a copy with all occurrences of substring old replaced by new.
- translate(table, delete=b'')
Return a copy with each character mapped by the given translation table.
- table
Translation table, which must be a bytes object of length 256.
All characters occurring in the optional argument delete are removed. The remaining characters are mapped through the given translation table.
- property defined
Return True if the sequence is defined, False if undefined or partially defined.
Zero-length sequences are always considered to be defined.
- property defined_ranges
Return a tuple of the ranges where the sequence contents is defined.
The return value has the format ((start1, end1), (start2, end2), …).
- __abstractmethods__ = frozenset({'__getitem__', '__len__'})
- class Bio.Seq.Seq(data, length=None)
Bases:
Bio.Seq._SeqAbstractBaseClass
Read-only sequence object (essentially a string with biological methods).
Like normal python strings, our basic sequence object is immutable. This prevents you from doing my_seq[5] = “A” for example, but does allow Seq objects to be used as dictionary keys.
The Seq object provides a number of string like methods (such as count, find, split and strip).
The Seq object also provides some biological methods, such as complement, reverse_complement, transcribe, back_transcribe and translate (which are not applicable to protein sequences).
- __init__(data, length=None)
Create a Seq object.
- Arguments:
data - Sequence, required (string)
length - Sequence length, used only if data is None or a dictionary (integer)
You will typically use Bio.SeqIO to read in sequences from files as SeqRecord objects, whose sequence will be exposed as a Seq object via the seq property.
However, you can also create a Seq object directly:
>>> from Bio.Seq import Seq >>> my_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF") >>> my_seq Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF') >>> print(my_seq) MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
To create a Seq object with for a sequence of known length but unknown sequence contents, use None for the data argument and pass the sequence length for the length argument. Trying to access the sequence contents of a Seq object created in this way will raise an UndefinedSequenceError:
>>> my_undefined_sequence = Seq(None, 20) >>> my_undefined_sequence Seq(None, length=20) >>> len(my_undefined_sequence) 20 >>> print(my_undefined_sequence) Traceback (most recent call last): ... Bio.Seq.UndefinedSequenceError: Sequence content is undefined
If the sequence contents is known for parts of the sequence only, use a dictionary for the data argument to pass the known sequence segments:
>>> my_partially_defined_sequence = Seq({3: "ACGT"}, 10) >>> my_partially_defined_sequence Seq({3: 'ACGT'}, length=10) >>> len(my_partially_defined_sequence) 10 >>> print(my_partially_defined_sequence) Traceback (most recent call last): ... Bio.Seq.UndefinedSequenceError: Sequence content is only partially defined >>> my_partially_defined_sequence[3:7] Seq('ACGT') >>> print(my_partially_defined_sequence[3:7]) ACGT
- __hash__()
Hash of the sequence as a string for comparison.
See Seq object comparison documentation (method
__eq__
in particular) as this has changed in Biopython 1.65. Older versions would hash on object identity.
- ungap(gap='-')
Return a copy of the sequence without the gap character(s) (DEPRECATED).
The gap character now defaults to the minus sign, and can only be specified via the method argument. This is no longer possible via the sequence’s alphabet (as was possible up to Biopython 1.77):
>>> from Bio.Seq import Seq >>> my_dna = Seq("-ATA--TGAAAT-TTGAAAA") >>> my_dna Seq('-ATA--TGAAAT-TTGAAAA') >>> my_dna.ungap("-") Seq('ATATGAAATTTGAAAA')
This method is DEPRECATED; please use my_dna.replace(gap, “”) instead.
- __abstractmethods__ = frozenset({})
- class Bio.Seq.MutableSeq(data)
Bases:
Bio.Seq._SeqAbstractBaseClass
An editable sequence object.
Unlike normal python strings and our basic sequence object (the Seq class) which are immutable, the MutableSeq lets you edit the sequence in place. However, this means you cannot use a MutableSeq object as a dictionary key.
>>> from Bio.Seq import MutableSeq >>> my_seq = MutableSeq("ACTCGTCGTCG") >>> my_seq MutableSeq('ACTCGTCGTCG') >>> my_seq[5] 'T' >>> my_seq[5] = "A" >>> my_seq MutableSeq('ACTCGACGTCG') >>> my_seq[5] 'A' >>> my_seq[5:8] = "NNN" >>> my_seq MutableSeq('ACTCGNNNTCG') >>> len(my_seq) 11
Note that the MutableSeq object does not support as many string-like or biological methods as the Seq object.
- __init__(data)
Create a MutableSeq object.
- __setitem__(index, value)
Set a subsequence of single letter via value parameter.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq[0] = 'T' >>> my_seq MutableSeq('TCTCGACGTCG')
- __delitem__(index)
Delete a subsequence of single letter.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> del my_seq[0] >>> my_seq MutableSeq('CTCGACGTCG')
- append(c)
Add a subsequence to the mutable sequence object.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq.append('A') >>> my_seq MutableSeq('ACTCGACGTCGA')
No return value.
- insert(i, c)
Add a subsequence to the mutable sequence object at a given index.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq.insert(0,'A') >>> my_seq MutableSeq('AACTCGACGTCG') >>> my_seq.insert(8,'G') >>> my_seq MutableSeq('AACTCGACGGTCG')
No return value.
- pop(i=- 1)
Remove a subsequence of a single letter at given index.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq.pop() 'G' >>> my_seq MutableSeq('ACTCGACGTC') >>> my_seq.pop() 'C' >>> my_seq MutableSeq('ACTCGACGT')
Returns the last character of the sequence.
- remove(item)
Remove a subsequence of a single letter from mutable sequence.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq.remove('C') >>> my_seq MutableSeq('ATCGACGTCG') >>> my_seq.remove('A') >>> my_seq MutableSeq('TCGACGTCG')
No return value.
- reverse()
Modify the mutable sequence to reverse itself.
No return value.
- extend(other)
Add a sequence to the original mutable sequence object.
>>> my_seq = MutableSeq('ACTCGACGTCG') >>> my_seq.extend('A') >>> my_seq MutableSeq('ACTCGACGTCGA') >>> my_seq.extend('TTT') >>> my_seq MutableSeq('ACTCGACGTCGATTT')
No return value.
- __abstractmethods__ = frozenset({})
- exception Bio.Seq.UndefinedSequenceError
Bases:
ValueError
Sequence contents is undefined.
- Bio.Seq.transcribe(dna)
Transcribe a DNA sequence into RNA.
If given a string, returns a new string object.
Given a Seq or MutableSeq, returns a new Seq object.
e.g.
>>> transcribe("ACTGN") 'ACUGN'
- Bio.Seq.back_transcribe(rna)
Return the RNA sequence back-transcribed into DNA.
If given a string, returns a new string object.
Given a Seq or MutableSeq, returns a new Seq object.
e.g.
>>> back_transcribe("ACUGN") 'ACTGN'
- Bio.Seq.translate(sequence, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None)
Translate a nucleotide sequence into amino acids.
If given a string, returns a new string object. Given a Seq or MutableSeq, returns a Seq object.
- Arguments:
table - Which codon table to use? This can be either a name (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). Defaults to the “Standard” table.
stop_symbol - Single character string, what to use for any terminators, defaults to the asterisk, “*”.
to_stop - Boolean, defaults to False meaning do a full translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).
cds - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.
gap - Single character string to denote symbol used for gaps. Defaults to None.
A simple string example using the default (standard) genetic code:
>>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG" >>> translate(coding_dna) 'VAIVMGR*KGAR*' >>> translate(coding_dna, stop_symbol="@") 'VAIVMGR@KGAR@' >>> translate(coding_dna, to_stop=True) 'VAIVMGR'
Now using NCBI table 2, where TGA is not a stop codon:
>>> translate(coding_dna, table=2) 'VAIVMGRWKGAR*' >>> translate(coding_dna, table=2, to_stop=True) 'VAIVMGRWKGAR'
In fact this example uses an alternative start codon valid under NCBI table 2, GTG, which means this example is a complete valid CDS which when translated should really start with methionine (not valine):
>>> translate(coding_dna, table=2, cds=True) 'MAIVMGRWKGAR'
Note that if the sequence has no in-frame stop codon, then the to_stop argument has no effect:
>>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC" >>> translate(coding_dna2) 'VAIVMGR' >>> translate(coding_dna2, to_stop=True) 'VAIVMGR'
NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.
It will however translate either DNA or RNA.
NOTE - Since version 1.71 Biopython contains codon tables with ‘ambiguous stop codons’. These are stop codons with unambiguous sequence but which have a context dependent coding as STOP or as amino acid. With these tables ‘to_stop’ must be False (otherwise a ValueError is raised). The dual coding codons will always be translated as amino acid, except for ‘cds=True’, where the last codon will be translated as STOP.
>>> coding_dna3 = "ATGGCACGGAAGTGA" >>> translate(coding_dna3) 'MARK*'
>>> translate(coding_dna3, table=27) # Table 27: TGA -> STOP or W 'MARKW'
It will however raise a BiopythonWarning (not shown).
>>> translate(coding_dna3, table=27, cds=True) 'MARK'
>>> translate(coding_dna3, table=27, to_stop=True) Traceback (most recent call last): ... ValueError: You cannot use 'to_stop=True' with this table ...
- Bio.Seq.reverse_complement(sequence, inplace=None)
Return the reverse complement as a DNA sequence.
If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.
>>> my_seq = "CGA" >>> reverse_complement(my_seq, inplace=False) 'TCG' >>> my_seq = Seq("CGA") >>> reverse_complement(my_seq, inplace=False) Seq('TCG') >>> my_seq = MutableSeq("CGA") >>> reverse_complement(my_seq, inplace=False) MutableSeq('TCG') >>> my_seq MutableSeq('CGA')
Any U in the sequence is treated as a T:
>>> reverse_complement(Seq("CGAUT"), inplace=False) Seq('AATCG')
In contrast,
reverse_complement_rna
returns an RNA sequence:>>> reverse_complement_rna(Seq("CGAUT")) Seq('AAUCG')
Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:
>>> reverse_complement("ACGTUacgtuXYZxyz", inplace=False) 'zrxZRXaacgtAACGT'
The sequence is modified in-place and returned if inplace is True:
>>> my_seq = MutableSeq("CGA") >>> reverse_complement(my_seq, inplace=True) MutableSeq('TCG') >>> my_seq MutableSeq('TCG')
As strings and
Seq
objects are immutable, aTypeError
is raised ifreverse_complement
is called on aSeq
object withinplace=True
.
- Bio.Seq.reverse_complement_rna(sequence, inplace=False)
Return the reverse complement as an RNA sequence.
If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.
>>> my_seq = "CGA" >>> reverse_complement_rna(my_seq) 'UCG' >>> my_seq = Seq("CGA") >>> reverse_complement_rna(my_seq) Seq('UCG') >>> my_seq = MutableSeq("CGA") >>> reverse_complement_rna(my_seq) MutableSeq('UCG') >>> my_seq MutableSeq('CGA')
Any T in the sequence is treated as a U:
>>> reverse_complement_rna(Seq("CGAUT")) Seq('AAUCG')
In contrast,
reverse_complement
returns a DNA sequence:>>> reverse_complement(Seq("CGAUT"), inplace=False) Seq('AATCG')
Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:
>>> reverse_complement_rna("ACGTUacgtuXYZxyz") 'zrxZRXaacguAACGU'
The sequence is modified in-place and returned if inplace is True:
>>> my_seq = MutableSeq("CGA") >>> reverse_complement_rna(my_seq, inplace=True) MutableSeq('UCG') >>> my_seq MutableSeq('UCG')
As strings and
Seq
objects are immutable, aTypeError
is raised ifreverse_complement
is called on aSeq
object withinplace=True
.
- Bio.Seq.complement(sequence, inplace=None)
Return the complement as a DNA sequence.
If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.
>>> my_seq = "CGA" >>> complement(my_seq, inplace=False) 'GCT' >>> my_seq = Seq("CGA") >>> complement(my_seq, inplace=False) Seq('GCT') >>> my_seq = MutableSeq("CGA") >>> complement(my_seq, inplace=False) MutableSeq('GCT') >>> my_seq MutableSeq('CGA')
Any U in the sequence is treated as a T:
>>> complement(Seq("CGAUT"), inplace=False) Seq('GCTAA')
In contrast,
complement_rna
returns an RNA sequence:>>> complement_rna(Seq("CGAUT")) Seq('GCUAA')
Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:
>>> complement("ACGTUacgtuXYZxyz", inplace=False) 'TGCAAtgcaaXRZxrz'
The sequence is modified in-place and returned if inplace is True:
>>> my_seq = MutableSeq("CGA") >>> complement(my_seq, inplace=True) MutableSeq('GCT') >>> my_seq MutableSeq('GCT')
As strings and
Seq
objects are immutable, aTypeError
is raised ifreverse_complement
is called on aSeq
object withinplace=True
.
- Bio.Seq.complement_rna(sequence, inplace=False)
Return the complement as an RNA sequence.
If given a string, returns a new string object. Given a Seq object, returns a new Seq object. Given a MutableSeq, returns a new MutableSeq object. Given a SeqRecord object, returns a new SeqRecord object.
>>> my_seq = "CGA" >>> complement_rna(my_seq) 'GCU' >>> my_seq = Seq("CGA") >>> complement_rna(my_seq) Seq('GCU') >>> my_seq = MutableSeq("CGA") >>> complement_rna(my_seq) MutableSeq('GCU') >>> my_seq MutableSeq('CGA')
Any T in the sequence is treated as a U:
>>> complement_rna(Seq("CGAUT")) Seq('GCUAA')
In contrast,
complement
returns a DNA sequence:>>> complement(Seq("CGAUT"),inplace=False) Seq('GCTAA')
Supports and lower- and upper-case characters, and unambiguous and ambiguous nucleotides. All other characters are not converted:
>>> complement_rna("ACGTUacgtuXYZxyz") 'UGCAAugcaaXRZxrz'
The sequence is modified in-place and returned if inplace is True:
>>> my_seq = MutableSeq("CGA") >>> complement(my_seq, inplace=True) MutableSeq('GCT') >>> my_seq MutableSeq('GCT')
As strings and
Seq
objects are immutable, aTypeError
is raised ifreverse_complement
is called on aSeq
object withinplace=True
.