SeqRecord

(Difference between revisions)
Jump to: navigation, search
m (Extracting information from a SeqRecord)
(20 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page will describe the SeqRecord object used in BioPython to hold a sequence (as a [[Seq]] object) with identifiers (ID and name), description and optionally annotation and sub-features.
+
This page describes the '''SeqRecord''' object used in BioPython to hold a sequence (as a [[Seq]] object) with identifiers (ID and name), description and optionally annotation and sub-features.
  
Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too).  The new [[SeqIO]] system will only return SeqRecord objects.
+
Most of the sequence file format parsers in BioPython can return '''SeqRecord''' objects (and may offer a format specific record object too, see for example Bio.SwissProt).  The [[SeqIO]] system will ''only'' return SeqRecord objects.
  
Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [[http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk ls_orchid.gbk]].  This file contains 94 records:
+
In addition to the '''SeqRecord''' object's [http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html API documentation], there is a whole chapter in the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Tutorial] ([http://biopython.org/DIST/docs/tutorial/Tutorial.pdf PDF]), and the [[SeqIO]] page is also very relevant.
 +
 
 +
== Creating a SeqRecord object ==
 +
 
 +
Most of the time you'll create '''SeqRecord''' objects by parsing a sequence file with [[SeqIO|Bio.SeqIO]].  However, it is useful to know how to create a '''SeqRecord''' directly.  For example,
 +
<python>
 +
from Bio.Seq import Seq
 +
from Bio.SeqRecord import SeqRecord
 +
from Bio.Alphabet import IUPAC
 +
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
 +
                  IUPAC.protein),
 +
                  id="YP_025292.1", name="HokC",
 +
                  description="toxic membrane protein, small")
 +
print record
 +
</python>
 +
 
 +
This would give the following output:
 +
 
 +
ID: YP_025292.1
 +
Name: HokC
 +
Description: toxic membrane protein, small
 +
Number of features: 0
 +
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
 +
 
 +
== Extracting information from a SeqRecord ==
 +
 
 +
Lets look in closer detail at the well annotated '''SeqRecord''' objects Biopython creates from a GenBank file, such as [http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk ls_orchid.gbk], which we'll load using the [[SeqIO]] module.  This file contains 94 records:
  
 
<python>
 
<python>
Line 12: Line 38:
 
</python>
 
</python>
  
And this is some of the output. Remember python likes to count from zero, so for the 94 records in this file they have been labelled 0 to 93:
+
And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:
  
 
  index 0, ID = Z78533.1, length 740, with 5 features
 
  index 0, ID = Z78533.1, length 740, with 5 features
Line 31: Line 57:
 
  ID: Z78439.1
 
  ID: Z78439.1
 
  Name: Z78439
 
  Name: Z78439
  Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
+
  Description: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
 +
Number of features: 5
 
  /source=Paphiopedilum barbatum
 
  /source=Paphiopedilum barbatum
 
  /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
 
  /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
Line 42: Line 69:
 
  Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
 
  Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
  
Lets look a little more closely... we'll start with the '''seq''' property:
+
Lets look a little more closely... and use python's '''dir()''' function to find out more about the SeqRecord object and what it does:
  
 
<python>
 
<python>
print record.seq
+
>>> dir(record)
 +
[..., 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'name', 'seq']
 
</python>
 
</python>
  
That should give:
+
If you didn't already know, the '''dir()''' function returns a list of all the methods and properties of an object (as strings). Those starting with underscores in their name are "special" and we'll be ignoring them in this discussion. We'll start with the '''seq''' property:
  
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
+
<python>
 +
>>> print record.seq
 +
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
 +
>>> print record.seq.__class__
 +
Bio.Seq.Seq
 +
</python>
  
 
This is a [[Seq]] object, another important object type in Biopython, and worth of its own page on the wiki documentation.
 
This is a [[Seq]] object, another important object type in Biopython, and worth of its own page on the wiki documentation.
  
The next three properties are all simple strings:
+
The following three properties are all simple strings:
  
 
<python>
 
<python>
print record.id
+
>>> print record.id
print record.name
+
Z78439.1
print record.description
+
>>> print record.name
 +
Z78439
 +
>>> print record.description
 +
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
 
</python>
 
</python>
 
Z78439.1
 
Z78439
 
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
 
  
 
Have a look at the raw GenBank file to see where these came from.
 
Have a look at the raw GenBank file to see where these came from.
 +
 +
Next, we'll check the '''dxrefs''' property, which holds any database cross references:
 +
 +
<python>
 +
>>> print record.dbxrefs
 +
[]
 +
>>> print record.dbxrefs.__class__
 +
<type 'list'>
 +
</python>
 +
 +
An empty list? Disappointing... if we'd used a more recent GenBank file the genome sequencing project reference would show up here.
 +
 +
How about the '''annotations''' property? This is a python dictionary...
 +
 +
<python>
 +
>>> print record.annotations
 +
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
 +
>>> print record.annotations.__class__
 +
<type 'dict'>
 +
>>> print record.annotations["source"]
 +
Paphiopedilum barbatum
 +
</python>
 +
 +
In this case, most of the values in the dictionary are simple strings, but this isn't always the case - have a look at the references entry for this example - its a list of '''Reference''' objects:
 +
 +
<python>
 +
>>> print record.annotations["references"].__class__
 +
<type 'list'>
 +
>>> print len(record.annotations["references"])
 +
2
 +
>>> for ref in record.annotations["references"] : print ref.authors
 +
Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.
 +
Cox,A.V.
 +
</python>
 +
 +
Next is '''features''' which is another list property, and it contains SeqFeature objects:
 +
 +
<python>
 +
>>> print record.features.__class__
 +
<type 'list'>
 +
>>> print len(record.features)
 +
5
 +
</python>
 +
 +
SeqFeature objects are complicated enough to warrant their own wiki page... for now please refer to the Tutorial.
 +
 +
If you are using Biopython 1.48 or later, there will be a '''format''' method.  This lets you convert the '''SeqRecord''' into a string using one of the output formats supported by [[SeqIO|Bio.SeqIO]], for example:
 +
 +
<python>
 +
>>> print record.format("fasta")
 +
</python>
 +
 +
This should give:
 +
 +
>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
 +
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
 +
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
 +
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
 +
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
 +
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
 +
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
 +
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
 +
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
 +
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
 +
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC
 +
 +
 +
If you are using Biopython 1.50 or later, there will also be a '''letter_annotations''' property.  Again this is a dictionary but for per-letter-annotation such as sequence quality scores or secondary structure predictions.  This kind of information isn't found in GenBank files, so in this case the dictionary is empty:
 +
 +
<python>
 +
>>> print record.letter_annotations
 +
{}
 +
</python>
 +
 +
Have a look at FASTQ or QUAL files to see how quality scores are represented.  Stockholm (PFAM) alignment files also often include per-letter-annotation.
 +
[[Category:Wiki Documentation]]

Revision as of 17:14, 5 August 2011

This page describes the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.

Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too, see for example Bio.SwissProt). The SeqIO system will only return SeqRecord objects.

In addition to the SeqRecord object's API documentation, there is a whole chapter in the Tutorial (PDF), and the SeqIO page is also very relevant.

Creating a SeqRecord object

Most of the time you'll create SeqRecord objects by parsing a sequence file with Bio.SeqIO. However, it is useful to know how to create a SeqRecord directly. For example,

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                   IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print record

This would give the following output:

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

Extracting information from a SeqRecord

Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as ls_orchid.gbk, which we'll load using the SeqIO module. This file contains 94 records:

from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) :
    print "index %i, ID = %s, length %i, with %i features" \
          % (index, record.id, len(record.seq), len(record.features))

And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
...
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Lets look in a little more detail at the final record:

print record

That should give you a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Description: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum barbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())

Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:

>>> dir(record)
[..., 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'name', 'seq']

If you didn't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting with underscores in their name are "special" and we'll be ignoring them in this discussion. We'll start with the seq property:

>>> print record.seq
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
>>> print record.seq.__class__
Bio.Seq.Seq

This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.

The following three properties are all simple strings:

>>> print record.id
Z78439.1
>>> print record.name
Z78439
>>> print record.description
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.

Have a look at the raw GenBank file to see where these came from.

Next, we'll check the dxrefs property, which holds any database cross references:

>>> print record.dbxrefs
[]
>>> print record.dbxrefs.__class__
<type 'list'>

An empty list? Disappointing... if we'd used a more recent GenBank file the genome sequencing project reference would show up here.

How about the annotations property? This is a python dictionary...

>>> print record.annotations
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
>>> print record.annotations.__class__
<type 'dict'>
>>> print record.annotations["source"]
Paphiopedilum barbatum

In this case, most of the values in the dictionary are simple strings, but this isn't always the case - have a look at the references entry for this example - its a list of Reference objects:

>>> print record.annotations["references"].__class__
<type 'list'>
>>> print len(record.annotations["references"])
2
>>> for ref in record.annotations["references"] : print ref.authors
Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.
Cox,A.V.

Next is features which is another list property, and it contains SeqFeature objects:

>>> print record.features.__class__
<type 'list'>
>>> print len(record.features)
5

SeqFeature objects are complicated enough to warrant their own wiki page... for now please refer to the Tutorial.

If you are using Biopython 1.48 or later, there will be a format method. This lets you convert the SeqRecord into a string using one of the output formats supported by Bio.SeqIO, for example:

>>> print record.format("fasta")

This should give:

>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC


If you are using Biopython 1.50 or later, there will also be a letter_annotations property. Again this is a dictionary but for per-letter-annotation such as sequence quality scores or secondary structure predictions. This kind of information isn't found in GenBank files, so in this case the dictionary is empty:

>>> print record.letter_annotations
{}

Have a look at FASTQ or QUAL files to see how quality scores are represented. Stockholm (PFAM) alignment files also often include per-letter-annotation.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox