Bio.SwissProt package

Module contents

Code to work with the sprotXX.dat file from SwissProt.

  • Record Holds SwissProt data.

  • Reference Holds reference data from a SwissProt record.

  • read Read one SwissProt record

  • parse Read multiple SwissProt records

exception Bio.SwissProt.SwissProtParserError(*args, line=None)

Bases: ValueError

An error occurred while parsing a SwissProt file.

__init__(self, *args, line=None)

Create a SwissProtParserError object with the offending line.

class Bio.SwissProt.Record

Bases: object

Holds information from a SwissProt record.

  • entry_name Name of this entry, e.g. RL1_ECOLI.

  • data_class Either ‘STANDARD’ or ‘PRELIMINARY’.

  • molecule_type Type of molecule, ‘PRT’,

  • sequence_length Number of residues.

  • accessions List of the accession numbers, e.g. [‘P00321’]

  • created A tuple of (date, release).

  • sequence_update A tuple of (date, release).

  • annotation_update A tuple of (date, release).

  • description Free-format description.

  • gene_name Gene name. See userman.txt for description.

  • organism The source of the sequence.

  • organelle The origin of the sequence.

  • organism_classification The taxonomy classification. List of strings. (

  • taxonomy_id A list of NCBI taxonomy id’s.

  • host_organism A list of names of the hosts of a virus, if any.

  • host_taxonomy_id A list of NCBI taxonomy id’s of the hosts, if any.

  • references List of Reference objects.

  • comments List of strings.

  • cross_references List of tuples (db, id1[, id2][, id3]). See the docs.

  • keywords List of the keywords.

  • features List of tuples (key name, from, to, description). from and to can be either integers for the residue numbers, ‘<’, ‘>’, or ‘?’

  • protein_existence Numerical value describing the evidence for the existence of the protein.

  • seqinfo tuple of (length, molecular weight, CRC32 value)

  • sequence The sequence.


>>> from Bio import SwissProt
>>> example_filename = "SwissProt/sp008"
>>> with open(example_filename) as handle:
...     records = SwissProt.parse(handle)
...     for record in records:
...         print(record.entry_name)
...         print(",".join(record.accessions))
...         print(record.keywords)
...         print(repr(record.organism))
...         print(record.sequence[:20] + "...")
['MHC I', 'Transmembrane', 'Glycoprotein', 'Signal', 'Polymorphism', '3D-structure']
'Homo sapiens (Human).'

Initialize the class.

class Bio.SwissProt.Reference

Bases: object

Holds information from one reference in a SwissProt entry.

  • number Number of reference in an entry.

  • evidence Evidence code. List of strings.

  • positions Describes extent of work. List of strings.

  • comments Comments. List of (token, text).

  • references References. List of (dbname, identifier).

  • authors The authors of the work.

  • title Title of the work.

  • location A citation for the work.


Initialize the class.

class Bio.SwissProt.FeatureTable(location=None, type='', location_operator='', strand=None, id='<unknown id>', qualifiers=None, sub_features=None, ref=None, ref_db=None)

Bases: Bio.SeqFeature.SeqFeature

Stores feature annotations for specific regions of the sequence.

This is a subclass of SeqFeature, defined in Bio.SeqFeature, where the attributes are used as follows:

  • location: location of the feature on the canonical or isoform sequence; the location is stored as an instance of FeatureLocation, defined in Bio.SeqFeature, with the ref attribute set to the isoform ID referring to the canonical or isoform sequence on which the feature is defined

  • id: unique and stable identifier (FTId), only provided for features belonging to the types CARBOHYD, CHAIN, PEPTIDE, PROPEP, VARIANT, or VAR_SEQ

  • type: indicates the type of feature, as defined by the UniProt Knowledgebase documentation:

    • ACT_SITE: amino acid(s) involved in the activity of an enzyme

    • BINDING: binding site for any chemical group

    • CARBOHYD: glycosylation site; an FTId identifier to the GlyConnect database is provided if annotated there

    • CA_BIND: calcium-binding region

    • CHAIN: polypeptide chain in the mature protein

    • COILED: coiled-coil region

    • COMPBIAS: compositionally biased region

    • CONFLICT: different sources report differing sequences

    • CROSSLNK: posttransationally formed amino acid bond

    • DISULFID: disulfide bond

    • DNA_BIND: DNA-binding region

    • DOMAIN: domain, defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold

    • INIT_MET: initiator methionine

    • INTRAMEM: region located in a membrane without crossing it

    • HELIX: alpha-, 3(10)-, or pi-helix secondary structure

    • LIPID: covalent binding of a lipid moiety

    • METAL: binding site for a metal ion

    • MOD_RES: posttranslational modification (PTM) of a residue, annotated by the controlled vocabulary defined by the ptmlist.txt document on the UniProt website

    • MOTIF: short sequence motif of biological interest

    • MUTAGEN: site experimentally altered by mutagenesis

    • NON_CONS: non-consecutive residues

    • NON_STD: non-standard amino acid

    • NON_TER: the residue at an extremity of the sequence is not the terminal residue

    • NP_BIND: nucleotide phosphate-binding region

    • PEPTIDE: released active mature polypeptide

    • PROPEP: any processed propeptide

    • REGION: region of interest in the sequence

    • REPEAT: internal sequence repetition

    • SIGNAL: signal sequence (prepeptide)

    • SITE: amino-acid site of interest not represented by another feature key

    • STRAND: beta-strand secondary structure; either a hydrogen-bonded extended beta strand or a residue in an isolated beta-bridge

    • TOPO_DOM: topological domain

    • TRANSIT: transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome, etc.)

    • TRANSMEM: transmembrane region

    • TURN: H-bonded turn (3-, 4-, or 5-turn)

    • UNSURE: uncertainties in the sequence

    • VARIANT: sequence variant; an FTId is provided for protein sequence variants of Hominidae (great apes and humans)

    • VAR_SEQ: sequence variant produced by alternative splicing, alternative promoter usage, alternative initiation, or ribosomal frameshifting

    • ZN_FING: zinc finger region

  • qualifiers A dictionary of additional information, which may include the feature evidence and free-text notes. While SwissProt includes the feature identifier code (FTId) as a qualifier, it is stored as the attribute ID of the FeatureTable object.


Read multiple SwissProt records from file.

Argument source is a file-like object or a path to a file.

Returns a generator object which yields Bio.SwissProt.Record() objects.

Read one SwissProt record from file.

Argument source is a file-like object or a path to a file.

Returns a Record() object.