From Biopython
Revision as of 20:33, 14 December 2007 by Maubp (Talk | contribs)
Jump to: navigation, search

This page will describe the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.

Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new SeqIO system will only return SeqRecord objects.

Extracting information from a SeqRecord

Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as ls_orchid.gbk, which we'll load using the SeqIO module. This file contains 94 records:

from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) :
    print "index %i, ID = %s, length %i, with %i features" \
          % (index,, len(record.seq), len(record.features))

And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Lets look in a little more detail at the final record:

print record

That should give you a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/organism=Paphiopedilum barbatum

Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:

>>> dir(record)
[..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq']

If you didn't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting underscores in their name are "special" and we'll be ignoring them in this discussion. We'll start with the seq property:

>>> print record.seq
>>> print record.seq.__class__

This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.

The next three properties are all simple strings:

>>> print
>>> print
>>> print record.description
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.

Have a look at the raw GenBank file to see where these came from.

Next, we'll check the dxrefs property, which holds any database cross references:

>>> print record.dbxrefs
>>> print record.dbxrefs.__class__
<type 'list'>

An empty list? Disappointing...

How about the annotations property? This is a python dictionary...

>>> print record.annotations
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
>>> print record.annotations.__class__
<type 'dict'>
>>> print record.annotations["source"]
Paphiopedilum barbatum

In this case, most of the values in the dictionary are simple strings, but this isn't always the case - have a look at the references entry for this example - its a list of Reference objects:

>>> print record.annotations["references"].__class__
<type 'list'>
>>> print len(record.annotations["references"])
>>> for ref in record.annotations["references"] : print ref.authors
Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.

That brings us finally to features which is another list property, and it contains SeqFeature objects:

>>> print record.features.__class__
<type 'list'>
>>> print len(record.features)

SeqFeature objects are complicated enough to warrent their own page...

Personal tools