Bio.SeqUtils package

Submodules

Module contents

Miscellaneous functions for dealing with sequences.

Bio.SeqUtils.gc_fraction(seq, ambiguous='remove')

Calculate G+C percentage in seq (float between 0 and 1).

Copes with mixed case sequences. Ambiguous Nucleotides in this context are those different from ATCGSW (S is G or C, and W is A or T).

If ambiguous equals “remove” (default), will only count GCS and will only include ACTGSW when calculating the sequence length. Equivalent to removing all characters in the set BDHKMNRVXY before calculating the GC content, as each of these ambiguous nucleotides can either be in (A,T) or in (C,G).

If ambiguous equals “ignore”, it will treat only unambiguous nucleotides (GCS) as counting towards the GC percentage, but will include all ambiguous and unambiguous nucleotides when calculating the sequence length.

If ambiguous equals “weighted”, will use a “mean” value when counting the ambiguous characters, for example, G and C will be counted as 1, N and X will be counted as 0.5, D will be counted as 0.33 etc. See Bio.SeqUtils._gc_values for a full list.

Will raise a ValueError for any other value of the ambiguous parameter.

>>> from Bio.SeqUtils import gc_fraction
>>> seq = "ACTG"
>>> print(f"GC content of {seq} : {gc_fraction(seq):.2f}")
GC content of ACTG : 0.50

S and W are ambiguous for the purposes of calculating the GC content.

>>> seq = "ACTGSSSS"
>>> gc = gc_fraction(seq, "remove")
>>> print(f"GC content of {seq} : {gc:.2f}")
GC content of ACTGSSSS : 0.75
>>> gc = gc_fraction(seq, "ignore")
>>> print(f"GC content of {seq} : {gc:.2f}")
GC content of ACTGSSSS : 0.75
>>> gc = gc_fraction(seq, "weighted")
>>> print(f"GC content with ambiguous counting: {gc:.2f}")
GC content with ambiguous counting: 0.75

Some examples with ambiguous nucleotides.

>>> seq = "ACTGN"
>>> gc = gc_fraction(seq, "ignore")
>>> print(f"GC content of {seq} : {gc:.2f}")
GC content of ACTGN : 0.40
>>> gc = gc_fraction(seq, "weighted")
>>> print(f"GC content with ambiguous counting: {gc:.2f}")
GC content with ambiguous counting: 0.50
>>> gc = gc_fraction(seq, "remove")
>>> print(f"GC content with ambiguous removing: {gc:.2f}")
GC content with ambiguous removing: 0.50

Ambiguous nucleotides are also removed from the length of the sequence.

>>> seq = "GDVV"
>>> gc = gc_fraction(seq, "ignore")
>>> print(f"GC content of {seq} : {gc:.2f}")
GC content of GDVV : 0.25
>>> gc = gc_fraction(seq, "weighted")
>>> print(f"GC content with ambiguous counting: {gc:.4f}")
GC content with ambiguous counting: 0.6667
>>> gc = gc_fraction(seq, "remove")
>>> print(f"GC content with ambiguous removing: {gc:.2f}")
GC content with ambiguous removing: 1.00

Note that this will return zero for an empty sequence.

Bio.SeqUtils.GC123(seq)

Calculate G+C content: total, for first, second and third positions.

Returns a tuple of four floats (percentages between 0 and 100) for the entire sequence, and the three codon positions. e.g.

>>> from Bio.SeqUtils import GC123
>>> GC123("ACTGTN")
(40.0, 50.0, 50.0, 0.0)

Copes with mixed case sequences, but does NOT deal with ambiguous nucleotides.

Bio.SeqUtils.GC_skew(seq, window=100)

Calculate GC skew (G-C)/(G+C) for multiple windows along the sequence.

Returns a list of ratios (floats), controlled by the length of the sequence and the size of the window.

Returns 0 for windows without any G/C by handling zero division errors.

Does NOT look at any ambiguous nucleotides.

Bio.SeqUtils.xGC_skew(seq, window=1000, zoom=100, r=300, px=100, py=100)

Calculate and plot normal and accumulated GC skew (GRAPHICS !!!).

Search for a DNA subseq in seq, return list of [subseq, positions].

Use ambiguous values (like N = A or T or C or G, R = A or G etc.), searches only on forward strand.

Bio.SeqUtils.seq3(seq, custom_map=None, undef_code='Xaa')

Convert protein sequence from one-letter to three-letter code.

The single required input argument ‘seq’ should be a protein sequence using single letter codes, either as a Python string or as a Seq or MutableSeq object.

This function returns the amino acid sequence as a string using the three letter amino acid codes. Output follows the IUPAC standard (including ambiguous characters B for “Asx”, J for “Xle” and X for “Xaa”, and also U for “Sel” and O for “Pyl”) plus “Ter” for a terminator given as an asterisk. Any unknown character (including possible gap characters), is changed into ‘Xaa’ by default.

e.g.

>>> from Bio.SeqUtils import seq3
>>> seq3("MAIVMGRWKGAR*")
'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'

You can set a custom translation of the codon termination code using the dictionary “custom_map” argument (which defaults to {‘*’: ‘Ter’}), e.g.

>>> seq3("MAIVMGRWKGAR*", custom_map={"*": "***"})
'MetAlaIleValMetGlyArgTrpLysGlyAlaArg***'

You can also set a custom translation for non-amino acid characters, such as ‘-’, using the “undef_code” argument, e.g.

>>> seq3("MAIVMGRWKGA--R*", undef_code='---')
'MetAlaIleValMetGlyArgTrpLysGlyAla------ArgTer'

If not given, “undef_code” defaults to “Xaa”, e.g.

>>> seq3("MAIVMGRWKGA--R*")
'MetAlaIleValMetGlyArgTrpLysGlyAlaXaaXaaArgTer'

This function was inspired by BioPerl’s seq3.

Bio.SeqUtils.seq1(seq, custom_map=None, undef_code='X')

Convert protein sequence from three-letter to one-letter code.

The single required input argument ‘seq’ should be a protein sequence using three-letter codes, either as a Python string or as a Seq or MutableSeq object.

This function returns the amino acid sequence as a string using the one letter amino acid codes. Output follows the IUPAC standard (including ambiguous characters “B” for “Asx”, “J” for “Xle”, “X” for “Xaa”, “U” for “Sel”, and “O” for “Pyl”) plus “*” for a terminator given the “Ter” code. Any unknown character (including possible gap characters), is changed into ‘-’ by default.

e.g.

>>> from Bio.SeqUtils import seq1
>>> seq1("MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer")
'MAIVMGRWKGAR*'

The input is case insensitive, e.g.

>>> from Bio.SeqUtils import seq1
>>> seq1("METalaIlEValMetGLYArgtRplysGlyAlaARGTer")
'MAIVMGRWKGAR*'

You can set a custom translation of the codon termination code using the dictionary “custom_map” argument (defaulting to {‘Ter’: ‘*’}), e.g.

>>> seq1("MetAlaIleValMetGlyArgTrpLysGlyAla***", custom_map={"***": "*"})
'MAIVMGRWKGA*'

You can also set a custom translation for non-amino acid characters, such as ‘-’, using the “undef_code” argument, e.g.

>>> seq1("MetAlaIleValMetGlyArgTrpLysGlyAla------ArgTer", undef_code='?')
'MAIVMGRWKGA??R*'

If not given, “undef_code” defaults to “X”, e.g.

>>> seq1("MetAlaIleValMetGlyArgTrpLysGlyAla------ArgTer")
'MAIVMGRWKGAXXR*'
Bio.SeqUtils.molecular_weight(seq, seq_type='DNA', double_stranded=False, circular=False, monoisotopic=False)

Calculate the molecular mass of DNA, RNA or protein sequences as float.

Only unambiguous letters are allowed. Nucleotide sequences are assumed to have a 5’ phosphate.

Arguments:
  • seq: string, Seq, or SeqRecord object.

  • seq_type: The default is to assume DNA; override this with a string “DNA”, “RNA”, or “protein”.

  • double_stranded: Calculate the mass for the double stranded molecule?

  • circular: Is the molecule circular (has no ends)?

  • monoisotopic: Use the monoisotopic mass tables?

>>> print("%0.2f" % molecular_weight("AGC"))
949.61
>>> print("%0.2f" % molecular_weight(Seq("AGC")))
949.61

However, it is better to be explicit - for example with strings:

>>> print("%0.2f" % molecular_weight("AGC", "DNA"))
949.61
>>> print("%0.2f" % molecular_weight("AGC", "RNA"))
997.61
>>> print("%0.2f" % molecular_weight("AGC", "protein"))
249.29
Bio.SeqUtils.six_frame_translations(seq, genetic_code=1)

Return pretty string showing the 6 frame translations and GC content.

Nice looking 6 frame translation with GC content - code from xbbtools similar to DNA Striders six-frame translation

>>> from Bio.SeqUtils import six_frame_translations
>>> print(six_frame_translations("AUGGCCAUUGUAAUGGGCCGCUGA"))
GC_Frame: a:5 t:0 g:8 c:5
Sequence: auggccauug ... gggccgcuga, 24 nt, 54.17 %GC


1/1
  G  H  C  N  G  P  L
 W  P  L  *  W  A  A
M  A  I  V  M  G  R  *
auggccauuguaaugggccgcuga   54 %
uaccgguaacauuacccggcgacu
A  M  T  I  P  R  Q
 H  G  N  Y  H  A  A  S
  P  W  Q  L  P  G  S

class Bio.SeqUtils.CodonAdaptationIndex(sequences, table=standard_dna_table)

Bases: dict

A codon adaptation index (CAI) implementation.

Implements the codon adaptation index (CAI) described by Sharp and Li (Nucleic Acids Res. 1987 Feb 11;15(3):1281-95).

__init__(sequences, table=standard_dna_table)

Generate a codon adaptiveness table from the coding DNA sequences.

This calculates the relative adaptiveness of each codon (w_ij) as defined by Sharp & Li (Nucleic Acids Research 15(3): 1281-1295 (1987)) from the provided codon DNA sequences.

Arguments:
  • sequences: An iterable over DNA sequences, which may be plain

    strings, Seq objects, MutableSeq objects, or SeqRecord objects.

  • table: A Bio.Data.CodonTable.CodonTable object defining the

    genetic code. By default, the standard genetic code is used.

calculate(sequence)

Calculate and return the CAI (float) for the provided DNA sequence.

optimize(sequence, seq_type='DNA', strict=True)

Return a new DNA sequence with preferred codons only.

Uses the codon adaptiveness table defined by the CodonAdaptationIndex object to generate DNA sequences with only preferred codons. May be useful when designing DNA sequences for transgenic protein expression or codon-optimized proteins like fluorophores.

Arguments:
  • sequence: DNA, RNA, or protein sequence to codon-optimize.

    Supplied as a str, Seq, or SeqRecord object.

  • seq_type: String specifying type of sequence provided.

    Options are “DNA”, “RNA”, and “protein”. Default is “DNA”.

  • strict: Determines whether an exception should be raised when

    two codons are equally prefered for a given amino acid.

Returns:

Seq object with DNA encoding the same protein as the sequence argument, but using only preferred codons as defined by the codon adaptation index. If multiple codons are equally preferred, a warning is issued and one codon is chosen for use in the optimzed sequence.

__str__()

Return str(self).