SeqRecord
(typo) |
m (phrasing) |
||
| Line 14: | Line 14: | ||
</python> | </python> | ||
| − | And this is some of the output. Remember python likes to count from zero, so | + | And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93: |
index 0, ID = Z78533.1, length 740, with 5 features | index 0, ID = Z78533.1, length 740, with 5 features | ||
Revision as of 11:57, 12 November 2007
This page will describe the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.
Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new SeqIO system will only return SeqRecord objects.
Extracting information from a SeqRecord
Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [ls_orchid.gbk]. This file contains 94 records:
from Bio import SeqIO for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) : print "index %i, ID = %s, length %i, with %i features" \ % (index, record.id, len(record.seq), len(record.features))
And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:
index 0, ID = Z78533.1, length 740, with 5 features index 1, ID = Z78532.1, length 753, with 5 features index 2, ID = Z78531.1, length 748, with 5 features ... index 92, ID = Z78440.1, length 744, with 5 features index 93, ID = Z78439.1, length 592, with 5 features
Lets look in a little more detail at the final record:
print record
That should give you a hint of the sort of information held in this object:
ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum barbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:
>>> dir(record) [..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq']
If you didn't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting underscores in their name are "special" and we'll be ignoring them in this discussion. We'll start with the seq property:
>>> print record.seq Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA()) >>> print record.seq.__class__ Bio.Seq.Seq
This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.
The next three properties are all simple strings:
>>> print record.id Z78439.1 >>> print record.name Z78439 >>> print record.description P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Have a look at the raw GenBank file to see where these came from.
Next, we'll check the dxrefs property, which holds any database cross references:
>>> print record.dbxrefs [] >>> print record.dbxrefs.__class__ <type 'list'>
An empty list? Disappointing...
How about the annotations property? This is a python dictionary...
>>> print record.annotations {'source': 'Paphiopedilum barbatum', 'taxonomy': ...} >>> print record.annotations.__class__ <type 'dict'> >>> print record.annotations["source"] Paphiopedilum barbatum
In this case, most of the values in the dictionary are simple strings, but this isn't always the case - have a look at the references entry for this example - its a list of Reference objects:
>>> print record.annotations["references"].__class__ <type 'list'> >>> print len(record.annotations["references"]) 2 >>> for ref in record.annotations["references"] : print ref.authors Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W. Cox,A.V.
That brings us finally to features which is another list property, and it contains SeqFeature objects:
>>> print record.features.__class__ <type 'list'> >>> print len(record.features) 5
SeqFeature objects are complicated enough to warrent their own page...