Converting sequence files

(Difference between revisions)
Jump to: navigation, search
(New page: == Problem == Many bioinformatics tools take different input file formats, so there is a common need to interconvert between sequence file formats. One useful option is the commandline t...)
 
m (Fixed link)
Line 1: Line 1:
 
== Problem ==
 
== Problem ==
  
Many bioinformatics tools take different input file formats, so there is a common need to interconvert between sequence file formats.  One useful option is the commandline tool [http://emboss.sourceforge.net/apps/cvs/emboss/apps/seqret.html seqret from EMBOSS], but here we'll show how to tackle this problem with [SeqIO|Bio.SeqIO].
+
Many bioinformatics tools take different input file formats, so there is a common need to interconvert between sequence file formats.  One useful option is the commandline tool [http://emboss.sourceforge.net/apps/cvs/emboss/apps/seqret.html seqret from EMBOSS], but here we'll show how to tackle this problem with [[SeqIO|Bio.SeqIO]].
  
 
== Solution ==
 
== Solution ==

Revision as of 16:40, 12 May 2009

Problem

Many bioinformatics tools take different input file formats, so there is a common need to interconvert between sequence file formats. One useful option is the commandline tool seqret from EMBOSS, but here we'll show how to tackle this problem with Bio.SeqIO.

Solution

Suppose you have a GenBank file which you want to turn into a Fasta file. For example, lets consider the file 'cor6_6.gb' (which is included in the Biopython unit tests under the GenBank directory):

from Bio import SeqIO
 
input_handle = open("cor6_6.gb", "rU")
output_handle = open("cor6_6.fasta", "w")
 
sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "fasta")
 
output_handle.close()
input_handle.close()
print "Coverted %i records" % count

In this example the GenBank file contained six records and started like this:

LOCUS       ATCOR66M      513 bp    mRNA            PLN       02-MAR-1992
DEFINITION  A.thaliana cor6.6 mRNA.
ACCESSION   X55053
VERSION     X55053.1  GI:16229
...

The resulting Fasta file also contains all six records and looks like this:

>X55053.1 A.thaliana cor6.6 mRNA.
AACAAAACACACATCAAAAACGATTTTACAAGAAAAAAATA...
...

Note that all the Fasta file can store is the identifier, description and sequence.

By changing the format strings, that code could be used to convert between any supported file formats.

How it works

See the Bio.SeqIO page.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox