Biopython Tutorial and CookbookJeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, |
The Biopython Project is an international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology. The web site http://www.biopython.org provides an online resource for modules, scripts, and web links for developers of Python-based software for life science research.
Basically, we just like to program in python and want to make it as easy as possible to use python for bioinformatics by creating high-quality, reusable modules and scripts.
The main Biopython releases have lots of functionality, including:
We hope this gives you plenty of reasons to download and start using Biopython!
All of the installation information for Biopython was separated from this document to make it easier to keep updated.
The short version is go to our downloads page (http://biopython.org/wiki/Download), download and install the listed dependencies, then download and install Biopython. For Windows we provide pre-compiled click-and-run installers, while for Unix and other operating systems you must install from source as described in the included README file. This is usually as simple as the standard command:
sudo python setup.py install
The longer version of our installation instructions covers installation of python, Biopython dependencies and Biopython itself. It is available in PDF (http://biopython.org/DIST/docs/install/Installation.pdf) and HTML formats (http://biopython.org/DIST/docs/install/Installation.html).
Seq object missing the (back) transcription & translation methods described in this Tutorial?Bio.Seq module functions described in Section 3.11.Bio.SeqIO work? It imports fine but there is no parse function etc.Bio.SeqIO name which has since been deprecated - and this is why the import “works”.Bio.SeqIO.read() work? The module imports fine but there is no read function!Bio.AlignIO present? The module import fails!Bio.SeqIO and Bio.AlignIO read and write?Bio.SeqIO and Bio.AlignIO input functions let me provide a sequence alphabet?str(...) give me the full sequence of a Seq object?str(my_seq), use my_seq.tostring() (which will also work on recent versions of Biopython).Bio.Blast work with the latest plain text NCBI blast output?Bio.Entrez.read() work? The module imports fine but there is no read function!Bio.PDB.MMCIFParser work? I see an import error about MMCIFlexBio.PDB.mmCIF.MMCIFlex module has
not been installed by default. It requires a third party tool called flex
(fast lexical analyzer generator). At the time of writing, you’ll have install
flex, then tweak your Biopython setup.py file and reinstall from source.__init__.py files. If you are not used to looking for code in this file this can be confusing. The reason we do this is to make the imports easier for users. For instance, instead of having to do a “repetitive” import like from Bio.GenBank import GenBank, you can just use from Bio import GenBank.This section is designed to get you started quickly with Biopython, and to give a general overview of what is available and how to use it. All of the examples in this section assume that you have some general working knowledge of python, and that you have successfully installed Biopython on your system. If you think you need to brush up on your python, the main python web site provides quite a bit of free documentation to get started with (http://www.python.org/doc/).
Since much biological work on the computer involves connecting with databases on the internet, some of the examples will also require a working internet connection in order to run.
Now that that is all out of the way, let’s get into what we can do with Biopython.
As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with “things” of interest to biologists working on the computer. In general this means that you will need to have at least some programming experience (in python, of course!) or at least an interest in learning to program. Biopython’s job is to make your job easier as a programmer by supplying reusable libraries so that you can focus on answering your specific question of interest, instead of focusing on the internals of parsing a particular file format (of course, if you want to help by writing a parser that doesn’t exist and contributing it to Biopython, please go ahead!). So Biopython’s job is to make you happy!
One thing to note about Biopython is that it often provides multiple ways of “doing the same thing.” Things have improved in recent releases, but this can still be frustrating as in Python there should ideally be one right way to do something. However, this can also be a real benefit because it gives you lots of flexibility and control over the libraries. The tutorial helps to show you the common or easy ways to do things so that you can just make things work. To learn more about the alternative possibilities, look in the Cookbook (Chapter 12, this has some cools tricks and tips), the Advanced section (Chapter 13), the built in “docstrings” (via the python help command, or the API documentation) or ultimately the code itself.
Disputably (of course!), the central object in bioinformatics is the sequence. Thus, we’ll start with a quick introduction to the Biopython mechanisms for dealing with sequences, the Seq object, which we’ll discuss in more detail in Chapter 3.
Most of the time when we think about sequences we have in my mind a string of letters like 'AGTACACTGGT'. You can create such Seq object with this sequence as follows - the “>>>” represents the python prompt followed by what you would type in:
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()
What we have here is a sequence object with a generic alphabet - reflecting the fact we have not specified if this is a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and Threonines!). We’ll talk more about alphabets in Chapter 3.
In addition to having an alphabet, the Seq object differs from the python string in the methods it supports. You can’t do this with a plain string:
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())
The next most important class is the SeqRecord or Sequence Record. This holds a sequence (as a Seq object) with additional annotation including an identifier, name and description. The Bio.SeqIO module for reading and writing sequence file formats works with SeqRecord objects, which will be introduced below and covered in more detail by Chapter 4.
This covers the basic features and uses of the Biopython sequence class. Now that you’ve got some idea of what it is like to interact with the Biopython libraries, it’s time to delve into the fun, fun world of dealing with biological file formats!
Before we jump right into parsers and everything else to do with Biopython, let’s set up an example to motivate everything we do and make life more interesting. After all, if there wasn’t any biology in this tutorial, why would you want you read it?
Since I love plants, I think we’re just going to have to have a plant based example (sorry to all the fans of other organisms out there!). Having just completed a recent trip to our local greenhouse, we’ve suddenly developed an incredible obsession with Lady Slipper Orchids (if you wonder why, have a look at some Lady Slipper Orchids photos on Flickr, or try a Google Image Search).
Of course, orchids are not only beautiful to look at, they are also extremely interesting for people studying evolution and systematics. So let’s suppose we’re thinking about writing a funding proposal to do a molecular study of Lady Slipper evolution, and would like to see what kind of research has already been done and how we can add to that.
After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family and the Cypripedioideae sub-family and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium, Selenipedium and Mexipedium.
That gives us enough to get started delving for more information. So, let’s look at how the Biopython tools can help us. We’ll start with sequence parsing in Section 2.4, but the orchids will be back later on as well - for example we’ll search PubMed for papers about orchids and extract sequence data from GenBank in Chapter 7, extract data from Swiss-Prot from certain orchid proteins in Chapter 8, and work with ClustalW multiple sequence alignments of orchid proteins in Section 12.2.1.
A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers.
We are now going to briefly introduce the Bio.SeqIO module – you can find out more in Chapter 4. We’ll start with an online search for our friends, the lady slipper orchids. To keep this introduction simple, we’re just using the NCBI website by hand. Let’s just take a look through the nucleotide databases at NCBI, using an Entrez online search (http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids).
When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTA formatted text file and as a GenBank formatted text file (files ls_orchid.fasta and ls_orchid.gbk, also included with the Biopython source code under docs/tutorial/examples/).
If you run the search today, you’ll get hundreds of results! When following the tutorial, if you want to see the same list of genes, just download the two files above or copy them from docs/examples/ in the Biopython source code. In Section 2.5 we will look at how to do a search like this from within python.
If you open the lady slipper orchids FASTA file ls_orchid.fasta in your favourite text editor, you’ll see that the file starts like this:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG ...
It contains 94 records, each has a line starting with “>” (greater-than symbol) followed by the sequence on one or more lines. Now try this in python:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
handle.close()
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rather generic SingleLetterAlphabet() rather than something DNA specific.
Now let’s load the GenBank file ls_orchid.gbk instead - notice that the code to do this is almost identical to the snippet used above for the FASTA file - the only difference is we change the filename and the format string:
from Bio import SeqIO
handle = open("ls_orchid.gbk")
for seq_record in SeqIO.parse(handle, "genbank") :
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
handle.close()
This should give:
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592
This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. You’ll also notice that a shorter string has been used as the seq_record.id in this case.
Biopython has a lot of parsers, and each has its own little special niches based on the sequence format it is parsing and all of that. Chapter 4 covers Bio.SeqIO in more detail, while Chapter 5 introduces Bio.AlignIO for sequence alignments.
While the most popular file formats have parsers integrated into Bio.SeqIO and/or Bio.AlignIO, for some of the rarer and unloved file formats there is either no parser at all, or an old parser which has not been linked in yet.
Please also check the wiki pages http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO for the latest information, or ask on the mailing list. The wiki pages should include an up to date list of supported file types, and some additional examples.
The next place to look for information about specific parsers and how to do cool things with them is in the Cookbook (Chapter 12 of this Tutorial). If you don’t find the information you are looking for, please consider helping out your poor overworked documentors and submitting a cookbook entry about it! (once you figure out how to do it, that is!)
One of the very common things that you need to do in bioinformatics is extract information from biological databases. It can be quite tedious to access these databases manually, especially if you have a lot of repetitive work to do. Biopython attempts to save you time and energy by making some on-line databases available from python scripts. Currently, Biopython has code to extract information from the following databases:
Bio.SCOP.search() function.
The code in these modules basically makes it easy to write python code that interact with the CGI scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results can be tightly integrated with the Biopython parsers to make it even easier to extract information.
Now that you’ve made it this far, you hopefully have a good understanding of the basics of Biopython and are ready to start using it for doing useful work. The best thing to do now is finish reading this tutorial, and then if you want start snooping around in the source code, and looking at the automatically generated documentation.
Once you get a picture of what you want to do, and what libraries in Biopython will do it, you should take a peak at the Cookbook (Chapter 12), which may have example code to do something similar to what you want to do.
If you know what you want to do, but can’t figure out how to do it, please feel free to post questions to the main biopython list (see http://biopython.org/wiki/Mailing_lists). This will not only help us answer your question, it will also allow us to improve the documentation so it can help the next person do what you want to do.
Enjoy the code!
Biological sequences are arguably the central object in Bioinformatics, and in this chapter we’ll introduce the Biopython mechanism for dealing with sequences, the Seq object.
In Chapter 4 on Sequence Input/Output (and Section 13.1), we’ll see that the Seq object is also used in the SeqRecord object, which combines the sequence information with any annotation.
Sequences are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological file formats.
There are two important differences between Seq objects and standard python strings.
First of all, they have different methods. Although the Seq object supports many of the same methods as a plain string, its translate() method differs by doing biological translation, and there are also additional biologically relevant methods like reverse_complement().
Secondly, the Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string “mean”, and how they should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines
and Threonines?
The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. We’ll use the IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects: DNA, RNA and Proteins.
Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid). For DNA you’ve got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type checking.
Now that we know what we are dealing with, let’s look at how to utilize this class to do interesting work. You can create an ambiguous sequence with the default generic alphabet like this:
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet
Alphabet()
However, where possible you should specify the alphabet explicitly when creating your sequence objects - in this case an unambiguous DNA alphabet object:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
In many ways, we can deal with Seq objects as if they were normal python strings, for example getting the length, or iterating over the elements:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
for index, letter in enumerate(my_seq) :
print index, letter
print len(letter)
You can access elements of the sequence in the same way as for strings (but remember, python counts from zero!):
>>> print my_seq[0] #first element >>> print my_seq[2] #third element >>> print my_seq[-1] #last element
The Seq object has a .count() method, just like a string:
>>> len(my_seq)
32
>>> my_seq.count("G")
10
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module has several GC functions already built. For example:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
Note that using the Bio.SeqUtils.GC() function should automatically cope with mixed case sequences and the ambiguous nucleotide S which means G or C.
Also note that just like a normal python string, the Seq object is in some ways “read-only”. If you need to edit your sequence, for example simulating a point mutation, look at the Section 3.10 below which talks about the MutableSeq object.
A more complicated example, let’s get a slice of the sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())
Two things are interesting to note. First, this follows the normal conventions for python strings. So the first element of the sequence is 0 (which is normal for computer science, but not so normal for biology). When you do a slice the first item is included (i.e. 4 in this case) and the last is excluded (12 in this case), which is the way things work in python, but of course not necessarily the way everyone in the world would expect. The main goal is to stay consistent with what python does.
The second thing to notice is that the slice is performed on the sequence data string, but the new object produced is another Seq object which retains the alphabet information from the original Seq object.
Also like a python string, you can do slices with a start, stop and stride (the step size, which defaults to one). For example, we can get the first, second and third codon positions of this DNA sequence:
>>> my_seq[0::3]
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
>>> my_seq[1::3]
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
>>> my_seq[2::3]
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
Another stride trick you might have seen with a python string is the use of a -1 stride to reverse the string. You can do this with a Seq object too:
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
If you really do just need a plain string, for example to write to a file, or insert into a database, then this is very easy to get:
>>> str(my_seq) 'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Since calling str() on a Seq object returns the full sequence as a string,
you often don’t actually have to do this conversion explicitly.
Python does this automatically with a print statement:
>>> print my_seq GATCGATGGGCCTATATAGGATCGAAAATCGC
You can also use the Seq object directly with a %s placeholder when using the python string formatting or interpolation operator (%):
>>> fasta_format_string = ">Name\n%s\n" % my_seq >>> print fasta_format_string >Name GATCGATGGGCCTATATAGGATCGAAAATCGC
This line of code constructs a simple FASTA format record (without worrying about line wrapping). Reading and writing FASTA format sequence files is covered in Chapter 4, in particular Section 4.4.3 describes a neat way to get a FASTA formatted string.
NOTE: If you are using Biopython 1.44 or older, using str(my_seq)
will give just a truncated representation. Instead use my_seq.tostring()
(which is still available in the current Biopython releases for backwards compatibility):
>>> my_seq.tostring() 'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Naturally, you can in principle add any two Seq objects together - just like you can with python strings to concatenate them. However, you can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence:
>>> protein_seq + dna_seq
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/site-packages/Bio/Seq.py", line 42, in __add__
raise TypeError, ("incompatable alphabets", str(self.alphabet),
TypeError: ('incompatable alphabets', 'IUPACProtein()', 'IUPACUnambiguousDNA()')
If you really wanted to do this, you’d have to first give both sequences generic alphabets:
>>> from Bio.Alphabet import generic_alphabet
>>> protein_seq.alphabet = generic_alphabet
>>> dna_seq.alphabet = generic_alphabet
>>> protein_seq + dna_seq
Seq('EVRNAKACGT', Alphabet())
Here is an example of adding a generic nucleotide sequence to an unambiguous IUPAC DNA sequence, resulting in an ambiguous nucleotide sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_nucleotide
>>> from Bio.Alphabet import IUPAC
>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> nuc_seq
Seq('GATCGATGC', NucleotideAlphabet())
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> nuc_seq + dna_seq
Seq('GATCGATGCACGT', NucleotideAlphabet())
For nucleotide sequences, you can easily obtain the complement or reverse complement of a Seq object using its built-in methods:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
In all of these operations, the alphabet property is maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse)complement of a protein sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> protein_seq.complement()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/site-packages/Bio/Seq.py", line 108, in complement
raise ValueError, "Proteins do not have complements!"
ValueError: Proteins do not have complements!
Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide:
| DNA coding strand (aka Crick strand, strand +1) | ||
| 5’ | ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG | 3’ |
| ||||||||||||||||||||||||||||||||||||||| | ||
| 3’ | TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC | 5’ |
| DNA template strand (aka Watson strand, strand −1) | ||
| | | ||
| Transcription | ||
| ↓ | ||
| 5’ | AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG | 3’ |
| Single stranded messenger RNA | ||
The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T → U.
Now let’s actually get down to doing a transcription in Biopython. First, let’s create Seq objects for the coding and template DNA strands:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())
These should match the figure above - remember by convention nucleotide sequences are normally read from the 5’ to 3’ direction, while in the figure the template strand is shown reversed.
Now let’s transcribe the coding strand into the corresponding mRNA, using the Seq object’s built in transcribe method:
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
As you can see, all this does is switch T → U, and adjust the alphabet.
If you do want to do a true biological transcription starting with the template strand, then this becomes a two-step process:
>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U → T substitution and associated change of alphabet:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
Note: The Seq object’s transcribe and back_transcribe methods
are new in Biopython 1.49. For older releases you would have to use the Bio.Seq
module’s function instead, see Section 3.11.
Sticking with the same example discussed in the transcription section above,
now let’s translate this mRNA into the corresponding protein sequence - again taking
advantage of one of the Seq object’s biological methods:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
You can also translate directly from the coding strand DNA sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).
The translation tables available in Biopython are based on those from the NCBI (see the next section of this tutorial). By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
You can also specify the table using the NCBI table number which is shorter, and often included in the feature annotation of GenBank files:
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', IUPACProtein())
Notice that when you use the to_stop argument, the stop codon itself
is not translated - and the stop symbol is not included at the end of your protein
sequence.
Finally, you can even specify the stop symbol if you don’t like the default asterisk:
>>> coding_dna.translate(table=2, stop_symbol="@")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '@'))
Note: The Seq object’s translate method is new in Biopython 1.49.
For older releases you would have to use the Bio.Seq module’s translate
function instead, see Section 3.11.
In the previous sections we talked about the Seq object translation method (and mentioned the equivalent function in the Bio.Seq module – see
Section 3.11).
Internally these use codon table objects derived from the NCBI information at
ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt.
As before, let’s just focus on two choices: the Standard translation table, and the translation table for Vertebrate Mitochondrial DNA.
>>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"] >>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
Alternatively, these tables are labeled with ID numbers 1 and 2, respectively:
>>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_id[1] >>> mito_table = CodonTable.unambiguous_dna_by_id[2]
You compare the actual tables visually by printing them:
>>> print standard_table Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
and:
>>> print mito_table Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
You may find these following properties useful – for example if you are trying to do your own gene finding:
>>> mito_table.stop_codons ['TAA', 'TAG', 'AGA', 'AGG'] >>> mito_table.start_codons ['ATT', 'ATC', 'ATA', 'ATG', 'GTG'] >>> mito_table.forward_table["ACG"] 'T'
Just like the normal python string, the Seq object is “read only”, or in python terminology, immutable. Apart from the wanting the Seq object to act like a string, this is also a useful default since in many biological applications you want to ensure you are not changing your sequence data:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'Seq' instance has no attribute '__setitem__'
However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it:
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
Either way will give you a sequence object which can be changed:
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq[5] = "T"
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.remove("T")
>>> mutable_seq
MutableSeq('GCCATGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.reverse()
>>> mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())
Do note that unlike the Seq object, the MutableSeq object’s methods like reverse_complement() and reverse() act in-situ!
An important technical difference between mutable and immutable objects in python means that you can’t use a MutableSeq object as a dictionary key, but you can use a python string or a Seq object in this way.
Once you have finished editing your a MutableSeq object, it’s easy to get back to a read-only Seq object should you need to:
>>> new_seq = mutable_seq.toseq()
>>> new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())
You can also get a string from a MutableSeq object just like from a Seq object (Section 3.4).
To close this chapter, for those you who really don’t want to use the sequence
objects (or who prefer a functional programming style to an object orientated one),
there are module level functions in Bio.Seq will accept plain python strings,
Seq objects or MutableSeq objects:
>>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate >>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG" >>> reverse_complement(my_string) 'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC' >>> transcribe(my_string) 'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG' >>> back_transcribe(my_string) 'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG' >>> translate(my_string) 'AVMGRWKGGRAAG*'
You are however, encouraged to work with Seq objects by default.
In this chapter we’ll discuss in more detail the Bio.SeqIO module, which was briefly introduced in Chapter 2. This aims to provide a simple interface for working with assorted sequence file formats in a uniform way.
The “catch” is that you have to work with SeqRecord objects - which contain a Seq object (as described in Chapter 3) plus annotation like an identifier and description. We’ll introduce the basics of SeqRecord object in this chapter, but see Section 13.1 for more details.
The workhorse function Bio.SeqIO.parse() is used to read in sequence data as SeqRecord objects. This function expects two arguments:
As of Biopython 1.49 there is an optional argument alphabet to specify the alphabet to be used. This is useful for file formats like FASTA where otherwise Bio.SeqIO will default to a generic alphabet.
The Bio.SeqIO.parse() function returns an iterator which gives SeqRecord objects. Iterators are typically used in a for loop as shown below.
Sometimes you’ll find yourself dealing with files which contain only a single record. For this situation Biopython 1.45 introduced the function Bio.SeqIO.read() which takes the same arguments. Provided there is one and only one record in the file, this is returned as a SeqRecord object. Otherwise an exception is raised.
In general Bio.SeqIO.parse() is used to read in sequence files as SeqRecord objects, and is typically used with a for loop like this:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(handle, "fasta") :
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
handle.close()
The above example is repeated from the introduction in Section 2.4, and will load the orchid DNA sequences in the FASTA format file ls_orchid.fasta. If instead you wanted to load a GenBank format file like ls_orchid.gbk then all you need to do is change the filename and the format string:
from Bio import SeqIO
handle = open("ls_orchid.gbk")
for seq_record in SeqIO.parse(handle, "genbank") :
print seq_record.id
print seq_record.seq
print len(seq_record)
handle.close()
Similarly, if you wanted to read in a file in another file format, then assuming Bio.SeqIO.parse() supports it you would just need to change the format string as appropriate, for example “swiss” for SwissProt files or “embl” for EMBL text files. There is a full listing on the wiki page (http://biopython.org/wiki/SeqIO).
Another very common way to use a python iterator is within a list comprehension (or a generator expression). For example, if all you wanted to extract from the file was a list of the record identifiers we can easily do this with the following list comprehension:
>>> from Bio import SeqIO
>>> handle = open("ls_orchid.gbk")
>>> identifiers = [seq_record.id for seq_record in SeqIO.parse(handle, "genbank")]
>>> handle.close()
>>> identifiers
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', ..., 'Z78439.1']
There are more examples using SeqIO.parse() in a list
comprehension like this in Section 12.1
(e.g. for plotting sequence lengths or GC%).
In the above examples, we have usually used a for loop to iterate over all the records one by one. You can use the for loop with all sorts of Python objects (including lists, tuples and strings) which support the iteration interface.
The object returned by Bio.SeqIO is actually an iterator which returns SeqRecord objects. You get to see each record in turn, but once and only once. The plus point is that an iterator can save you memory when dealing with large files.
Instead of using a for loop, can also use the .next() method of an iterator to step through the entries, like this:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
record_iterator = SeqIO.parse(handle, "fasta")
first_record = record_iterator.next()
print first_record.id
print first_record.description
second_record = record_iterator.next()
print second_record.id
print second_record.description
handle.close()
Note that if you try and use .next() and there are no more results, you’ll either get back the special Python object None or a StopIteration exception.
One special case to consider is when your sequence files have multiple records, but you only want the first one. In this situation the following code is very concise:
from Bio import SeqIO
first_record = SeqIO.parse(open("ls_orchid.gbk"), "genbank").next()
A word of warning here – using the .next() method like this will silently ignore any additional records in the file.
If your files have one and only one record, like some of the online examples later in this chapter, or a GenBank file for a single chromosome, then use the new Bio.SeqIO.read() function instead.
This will check there are no extra unexpected records present.
In the previous section we talked about the fact that Bio.SeqIO.parse() gives you a SeqRecord iterator, and that you get the records one by one. Very often you need to be able to access the records in any order. The Python list data type is perfect for this, and we can turn the record iterator into a list of SeqRecord objects using the built-in Python function list() like so:
from Bio import SeqIO
handle = open("ls_orchid.gbk")
records = list(SeqIO.parse(handle, "genbank"))
handle.close()
print "Found %i records" % len(records)
print "The last record"
last_record = records[-1] #using Python's list tricks
print last_record.id
print repr(last_record.seq)
print len(last_record)
print "The first record"
first_record = records[0] #remember, Python counts from zero
print first_record.id
print repr(first_record.seq)
print len(first_record)
Giving:
Found 94 records
The last record
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592
The first record
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
You can of course still use a for loop with a list of SeqRecord objects. Using a list is much more flexible than an iterator (for example, you can determine the number of records from the length of the list), but does need more memory because it will hold all the records in memory at once.
The SeqRecord object and its annotation structures are described more fully in
Section 13.1. For now, as an example of how annotations are stored, we’ll look at the output from parsing the first record in the GenBank file ls_orchid.gbk.
from Bio import SeqIO
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
first_record = record_iterator.next()
print first_record
That should give something like this:
ID: Z78533.1
Name: Z78533
Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/sequence_version=1
/source=Cypripedium irapeanum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ..., 'Cypripedium']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1', 'ITS2']
/references=[...]
/accessions=['Z78533']
/data_file_division=PLN
/date=30-NOV-2006
/organism=Cypripedium irapeanum
/gi=2765658
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
This gives a human readable summary of most of the annotation data for the SeqRecord. For any python object you can get some clues as to how to interact with it using the python built in function dir() which lists all the properties and methods
of an object:
>>> dir(first_record) [..., 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'name', 'seq']
In the above list, we’ve omitted the special cases whose names start with an underscore. You’ve already met the SeqRecord object’s .description, .id and .name properties (all simple strings) and the .seq property (the sequence itself as a Seq object).
The .dbxrefs property is a list of database cross-references (as strings).
We’ll talk about the .format() method later (Section 4.4.3).
That leaves two more properties, .features which is a list of SeqFeature objects (sequence annotation associated with a particular sub-sequence, like genes – see Section 13.1), and .annotations which is a python dictionary used to hold any other annotation (associated with the full sequence, for example a description or the species it is from).
The contents of this annotations dictionary were shown when we printed the record above. You can also print them out directly:
print first_record.annotations
Like any python dictionary, you can easily get a list of the keys:
print first_record.annotations.keys()
or values:
print first_record.annotations.values()
In general, the annotation values are strings, or lists of strings. One special case is any references in the file get stored as reference objects.
Suppose you wanted to extract a list of the species from the ls_orchid.gbk GenBank file. The information we want, Cypripedium irapeanum, is held in the annotations dictionary under ‘source’ and ‘organism’, which we can access like this:
>>> print first_record.annotations["source"] Cypripedium irapeanum
or:
>>> print first_record.annotations["organism"] Cypripedium irapeanum
In general, ‘organism’ is used for the scientific name (in Latin, e.g. Arabidopsis thaliana), while ‘source’ will often be the common name (e.g. thale cress). In this example, as is often the case, the two fields are identical.
Now let’s go through all the records, building up a list of the species each orchid sequence is from:
from Bio import SeqIO
handle = open("ls_orchid.gbk")
all_species = []
for seq_record in SeqIO.parse(handle, "genbank") :
all_species.append(seq_record.annotations["organism"])
handle.close()
print all_species
Another way of writing this code is to use a list comprehension:
from Bio import SeqIO
all_species = [seq_record.annotations["organism"] for seq_record in \
SeqIO.parse(open("ls_orchid.gbk"), "genbank")]
print all_species
In either case, the result is:
['Cypripedium irapeanum', 'Cypripedium californicum', ..., 'Paphiopedilum barbatum']
Great. That was pretty easy because GenBank files are annotated in a standardised way.
Now, let’s suppose you wanted to extract a list of the species from a FASTA file, rather than the GenBank file. The bad news is you will have to write some code to extract the data you want from the record’s description line - if the information is in the file in the first place! Our example FASTA format file ls_orchid.fasta starts like this:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG ...
You can check by hand, but for every record the species name is in the description line as the second word. This means if we break up each record’s .description at the spaces, then the species is there as field number one (field zero is the record identifier). That means we can do this:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
all_species = []
for seq_record in SeqIO.parse(handle, "fasta") :
all_species.append(seq_record.description.split()[1])
handle.close()
print all_species
This gives:
['C.irapeanum', 'C.californicum', 'C.fasciculatum', 'C.margaritaceum', ..., 'P.barbatum']
The concise alternative using list comprehensions would be:
from Bio import SeqIO
all_species == [seq_record.description.split()[1] for seq_record in \
SeqIO.parse(open("ls_orchid.fasta"), "fasta")]
print all_species
In general, extracting information from the FASTA description line is not very nice. If you can get your sequences in a well annotated file format like GenBank or EMBL, then this sort of annotation information is much easier to deal with.
In the previous section, we looked at parsing sequence data from a file handle. We hinted that handles are not always from files, and in this section we’ll use handles to internet connections to download sequences.
Note that just because you can download sequence data and parse it into
a SeqRecord object in one go doesn’t mean this is a good idea.
In general, you should probably download sequences once and save them to
a file for reuse.
Section 7.6 talks about the Entrez EFetch interface in more detail, but for now let’s just connect to the NCBI and get a few Opuntia (prickly-pear) proteins from GenBank using their GI numbers.
First of all, let’s fetch just one record. Remember, when you expect the handle
to contain one and only one record, use the Bio.SeqIO.read() function:
from Bio import Entrez form Bio import SeqIO handle = Entrez.efetch(db="protein", rettype="genbank", id="6273291") seq_record = SeqIO.read(handle, "genbank") handle.close() print "%s with %i features" % (seq_record.id, len(seq_record.features))
Expected output:
gi|6273291|gb|AF191665.1|AF191665 with 3 features
The NCBI will also let you ask for the file in other formats, for example as a FASTA file. If you don’t care about the annotations and features in a GenBank file, this is a better choice to download because it’s a smaller file:
from Bio import Entrez form Bio import SeqIO handle = Entrez.efetch(db="protein", rettype="fasta", id="6273291") seq_record = SeqIO.read(handle, "fasta") handle.close() print "%s with %i features" % (seq_record.id, len(seq_record.features))
Expected output:
gi|6273291|gb|AF191665.1|AF191665 with 0 features
Now let’s fetch several records. This time the handle contains multiple records,
so we must use the Bio.SeqIO.parse() function:
from Bio import Entrez
form Bio import SeqIO
handle = Entrez.efetch(db="protein", rettype="genbank", \
id="6273291,6273290,6273289")
for seq_record in SeqIO.parse(handle, "genbank") :
print seq_record.id, seq_record.description[:50] + "..."
print "Sequence length %i," % len(seq_record),
print "%i features," % len(seq_record.features),
print "from: %s" % seq_record.annotations["source"]
handle.close()
That should give the following output:
AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c... Sequence length 902, 3 features, from: chloroplast Opuntia marenae AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c... Sequence length 899, 3 features, from: chloroplast Grusonia clavata AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for... Sequence length 899, 3 features, from: chloroplast Opuntia bradtianaa
See Chapter 7 for more about the Bio.Entrez module, and make sure to read about the NCBI guidelines for using Entrez (Section 7.1).
Now let’s use a handle to download a SwissProt file from ExPASy,
something covered in more depth in Chapter 8.
As mentioned above, when you expect the handle to contain one and only one record,
use the Bio.SeqIO.read() function:
from Bio import ExPASy
from Bio import SeqIO
handle = ExPASy.get_sprot_raw("O23729")
seq_record = SeqIO.read(handle, "swiss")
handle.close()
print seq_record.id
print seq_record.name
print seq_record.description
print repr(seq_record.seq)
print "Length %i" % len(seq_record)
print seq_record.annotations["keywords"]
Assuming your network connection is OK, you should get back:
O23729
CHS3_BROFI
Chalcone synthase 3 (EC 2.3.1.74) (Naringenin-chalcone synthase 3).
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet())
Length 394
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
The next thing that we’ll do with our ubiquitous orchid files is to show how to index them and access them like a database using the Python dictionary data type (like a hash in Perl). This is very useful for large files where you only need to access certain elements of the file, and makes for a nice quick ’n dirty database.
You can use the function Bio.SeqIO.to_dict() to make a SeqRecord dictionary (in memory). By default this will use each record’s identifier (i.e. the .id attribute) as the key. Let’s try this using our GenBank file:
from Bio import SeqIO
handle = open("ls_orchid.gbk")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
handle.close()
Since this variable orchid_dict is an ordinary Python dictionary, we can look at all of the keys we have available:
>>> print orchid_dict.keys() ['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
We can access a single SeqRecord object via the keys and manipulate the object as normal:
>>> seq_record = orchid_dict["Z78475.1"]
>>> print seq_record.description
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
>>> print repr(seq_record.seq)
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
So, it is very easy to create an in memory “database” of our GenBank records. Next we’ll try this for the FASTA file instead.
Note that those of you with prior python experience should all be able to construct a dictionary like this “by hand”. However, typical dictionary construction methods will not deal with the case of repeated keys very nicely. Using the Bio.SeqIO.to_dict() will explicitly check for duplicate keys, and raise an exception if any are found.
Using the same code as above, but for the FASTA file instead:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()
print orchid_dict.keys()
This time the keys are:
['gi|2765596|emb|Z78471.1|PDZ78471', 'gi|2765646|emb|Z78521.1|CCZ78521', ... ..., 'gi|2765613|emb|Z78488.1|PTZ78488', 'gi|2765583|emb|Z78458.1|PHZ78458']
You should recognise these strings from when we parsed the FASTA file earlier in Section 2.4.1. Suppose you would rather have something else as the keys - like the accession numbers. This brings us nicely to SeqIO.to_dict()’s optional argument key_function, which lets you define what to use as the dictionary key for your records.
First you must write your own function to return the key you want (as a string) when given a SeqRecord object. In general, the details of function will depend on the sort of input records you are dealing with. But for our orchids, we can just split up the record’s identifier using the “pipe” character (the vertical line) and return the fourth entry (field three):
def get_accession(record) :
""""Given a SeqRecord, return the accession number as a string.
e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
"""
parts = record.id.split("|")
assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
return parts[3]
Then we can give this function to the SeqIO.to_dict() function to use in building the dictionary:
from Bio import SeqIO
handle = open("ls_orchid.fasta")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"), key_function=get_accession)
handle.close()
print orchid_dict.keys()
Finally, as desired, the new dictionary keys:
>>> print orchid_dict.keys() ['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
Not too complicated, I hope!
To give another example of working with dictionaries of SeqRecord objects, we’ll use the SEGUID checksum function. This is a relatively recent checksum, and collisions should be very rare (i.e. two different sequences with the same checksum), an improvement on the CRC64 checksum.
Once again, working with the orchids GenBank file:
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
for record in SeqIO.parse(open("ls_orchid.gbk"), "genbank") :
print record.id, seguid(record.seq)
This should give:
Z78533.1 JUEoWn6DPhgZ9nAyowsgtoD9TTo Z78532.1 MN/s0q9zDoCVEEc+k/IFwCNF2pY ... Z78439.1 H+JfaShya/4yyAj7IbMqgNkxdxQ
Now, recall the Bio.SeqIO.to_dict() function’s key_function argument expects a function which turns a SeqRecord into a string. We can’t use the seguid() function directly because it expects to be given a Seq object (or a string). However, we can use python’s lambda feature to create a “one off” function to give to Bio.SeqIO.to_dict() instead:
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
seguid_dict = SeqIO.to_dict(SeqIO.parse(open("ls_orchid.gbk"), "genbank"),
lambda rec : seguid(rec.seq))
record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]
print record.id
print record.description
That should have retrieved the record Z78532.1, the second entry in the file.
We’ve talked about using Bio.SeqIO.parse() for sequence input (reading files), and now we’ll look at Bio.SeqIO.write() which is for sequence output (writing files). This is a function taking three arguments: some SeqRecord objects, a handle to write to, and a sequence format.
Here is an example, where we start by creating a few SeqRecord objects the hard way (by hand, rather than by loading them from a file):
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \
+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]")
rec3 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC" \
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT", generic_protein),
id="gi|13925890|gb|AAK49457.1|",
description="chalcone synthase [Nicotiana tabacum]")
my_records = [rec1, rec2, rec3]
Now we have a list of SeqRecord objects, we’ll write them to a FASTA format file:
from Bio import SeqIO
handle = open("my_example.faa", "w")
SeqIO.write(my_records, handle, "fasta")
handle.close()
And if you open this file in your favourite text editor it should look like this:
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus] MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM SSAC >gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp. bracteata] YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ DMVVVEIPKLGKEAAVKAIKEWGQ >gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum] MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT TGEGLEWGVLFGFGPGLTVETVVLHSVAT
Suppose you wanted to know how many records the Bio.SeqIO.write() function wrote to the handle?
If your records were in a list you could just use len(my_records), however you can’t do that when your records come from a generator/iterator. Therefore as of Biopython 1.49, the Bio.SeqIO.write() function returns the number of SeqRecord objects written to the file.
In previous example we used a list of SeqRecord objects as input to the Bio.SeqIO.write() function, but it will also accept a SeqRecord iterator like we get from Bio.SeqIO.parse() – this lets us do file conversion very succinctly. For this example we’ll read in the GenBank format file ls_orchid.gbk and write it out in FASTA format:
from Bio import SeqIO
in_handle = open("ls_orchid.gbk", "r")
out_handle = open("my_example.fasta", "w")
SeqIO.write(SeqIO.parse(in_handle, "genbank"), out_handle, "fasta")
in_handle.close()
out_handle.close()
You can in fact do this in one line, by being lazy about closing the file handles. This is arguably bad style, but it is very concise:
from Bio import SeqIO
SeqIO.write(SeqIO.parse(open("ls_orchid.gbk"), "genbank"), open("my_example.faa", "w"), "fasta")
Suppose you had a file of nucleotide sequences, and you wanted to turn it into a file containing their reverse complement sequences. This time a little bit of work is required to transform the SeqRecord objects we get from our input file into something suitable for saving to our output file.
To start with, we’ll use Bio.SeqIO.parse() to load some nucleotide
sequences from a file, then print out their reverse complements using
the Seq object’s built in .reverse_complement() method (see Section 3.6):
from Bio import SeqIO
in_handle = open("ls_orchid.gbk")
for record in SeqIO.parse(in_handle, "genbank") :
print record.id
print record.seq.reverse_complement()
in_handle.close()
Now, if we want to save these reverse complements to a file, we’ll need to make SeqRecord objects.
For this I think its most elegant to write our own function, where we can decide how to name our
new records:
from Bio.SeqRecord import SeqRecord
def make_rc_record(record) :
"""Returns a new SeqRecord with the reverse complement sequence."""
rc_rec = SeqRecord(seq = record.seq.reverse_complement(), \
id = "rc_" + record.id, \
name = "rc_" + record.name, \
description = "reverse complement")
return rc_rec
We can then use this to turn the input records into reverse complement records ready for output. If you don’t mind about having all the records in memory at once, then the python map() function is a very intuitive way of doing this:
from Bio import SeqIO
in_handle = open("ls_orchid.fasta", "r")
records = map(make_rc_record, SeqIO.parse(in_handle, "fasta"))
in_handle.close()
out_handle = open("rev_comp.fasta", "w")
SeqIO.write(records, out_handle, "fasta")
out_handle.close()
This is an excellent place to demonstrate the power of list comprehensions which in their simplest are a long-winded equivalent to using map(), like this:
records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle, "fasta")]
Now list comprehensions have a nice trick up their sleeves, you can add a conditional statement:
records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle, "fasta") if len(rec)<700]
That would create an in memory list of reverse complement records where the sequence length was under 700 base pairs. However, if you are using Python 2.4 or later, we can do exactly the same with a generator expression - but with the advantage that this does not create a list of all the records in memory at once:
records = (make_rc_record(rec) for rec in SeqIO.parse(in_handle, "fasta") if len(rec)<700)
If you like compact code, and don’t mind being lax about closing file handles, we can reduce this to one long line:
from Bio import SeqIO
SeqIO.write((make_rc_record(rec) for rec in \
SeqIO.parse(open("ls_orchid.fasta", "r"), "fasta") if len(rec) < 700), \
open("rev_comp.fasta", "w"), "fasta")
Personally, I think the above snippet of code is a little too compact, and I find the following much easier to read:
from Bio import SeqIO
records = (make_rc_record(rec) for rec in \
SeqIO.parse(open("ls_orchid.fasta", "r"), "fasta") \
if len(rec) < 700)
SeqIO.write(records, open("rev_comp.fasta", "w"), "fasta")
or, for Python 2.3 with a list comprehension,
from Bio import SeqIO
records = [make_rc_record(rec) for rec in \
SeqIO.parse(open("ls_orchid.fasta", "r"), "fasta") \
if len(rec) < 700]
SeqIO.write(records, open("rev_comp.fasta", "w"), "fasta")
Suppose that you don’t really want to write your records to a file or handle – instead you want a string containing the records in a particular file format. The Bio.SeqIO interface is based on file handles, but python has a useful built in module which provides a string based handle.
For an example of how you might use this, let’s load in a bunch of SeqRecord objects from our orchids GenBank file, and create a string containing the records in FASTA format:
from Bio import SeqIO
from StringIO import StringIO
records = SeqIO.parse(open("ls_orchid.gbk", "r"), "genbank")
out_handle = StringIO()
SeqIO.write(records, out_handle, "fasta")
#Now need to "rewind" the StringIO handle back to the start...
out_handle.seek(0)
#...and we can read back what was written to it as a string:
fasta_data = out_handle.read()
print fasta_data
This isn’t entirely straightforward the first time you see it! On the bright side, for the special case where you would like a string containing a single record in a particular file format, Biopython 1.48 added a new format() method to the SeqRecord class:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
print record.format("fasta")
which should give:
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus] MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM SSAC
This format method takes a single mandatory argument, a lower case string which is supported by Bio.SeqIO as an output format. However, some of the file formats Bio.SeqIO can write to require more than one record (typically the case for multiple sequence alignment formats), and thus won’t work via this format() method.
Note that although we don’t encourage it, you can use the format() method to write to a file, like this:
from Bio import SeqIO
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
out_handle = open("ls_orchid.tab", "w")
for record in record_iterator :
out_handle.write(record.format("tab"))
out_handle.close()
While this style of code will work for a simple sequential file format like FASTA or the simple tab separated format used in this example, it will not work for more complex or interlaced file formats. This is why we still recommend using Bio.SeqIO.write(), as in the following example:
from Bio import SeqIO
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
out_handle = open("ls_orchid.tab", "w")
SeqIO.write(record_iterator, out_handle, "tab")
out_handle.close()
In this chapter we’ll discuss the Bio.AlignIO module, which is very similar to the Bio.SeqIO module from the previous chapter, but deals with Alignment objects rather than SeqRecord objects.
This is new interface in Biopython 1.46, which aims to provide a simple interface for working with assorted sequence alignment file formats in a uniform way.
Note that both Bio.SeqIO and Bio.AlignIO can read and write sequence alignment files. The appropriate choice will depend largely on what you want to do with the data.
We have two functions for reading in sequence alignments, Bio.AlignIO.read() and Bio.AlignIO.parse() which following the convention introduced in Bio.SeqIO are for files containing one or multiple alignments respectively.
Using Bio.AlignIO.parse() will return an iterator which gives Alignment objects. Iterators are typically used in a for loop. Examples of situations where you will have multiple different alignments include resampled alignments from the PHYLIP tool seqboot, or multiple pairwise alignments from the EMBOSS tools water or needle, or Bill Pearson’s FASTA tools.
However, in many situations you will be dealing with files which contain only a single alignment. In this case, you should use the Bio.AlignIO.read() function which return a single Alignment object.
Both functions expect two mandatory arguments:
Bio.SeqIO we don’t try and guess the file format for you! See http://biopython.org/wiki/AlignIO for a full listing of supported formats.
There is also an optional seq_count argument which is discussed in Section 5.1.3 below for dealing with ambiguous file formats which may contain more than one alignment.
Biopython 1.49 introduced a further optional alphabet argument allowing you to specify the expected alphabet. This can be useful as many alignment file formats do not explicitly label the sequences as RNA, DNA or protein – which means Bio.AlignIO will default to using a generic alphabet.
As an example, consider the following annotation rich protein alignment in the PFAM or Stockholm file format:
# STOCKHOLM 1.0 #=GS COATB_BPIKE/30-81 AC P03620.1 #=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52; #=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1 #=GS COATB_BPI22/32-83 AC P15416.1 #=GS COATB_BPM13/24-72 AC P69541.1 #=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49; #=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49; #=GS COATB_BPZJ2/1-49 AC P03618.1 #=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1 #=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49; #=GS COATB_BPIF1/22-73 AC P03619.2 #=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50; COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA #=GR COATB_BPIKE/30-81 SS -HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH---- Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA #=GR COATB_BPM13/24-72 SS ---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT-- COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA #=GR Q9T0Q9_BPFD/1-49 SS ------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-- COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA #=GR COATB_BPIF1/22-73 SS XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH--- #=GC SS_cons XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC-- #=GC seq_cons AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA //
This is the seed alignment for the Phage_Coat_Gp8 (PF05371) PFAM entry, downloaded as a compressed archive from http://pfam.sanger.ac.uk/family/alignment/download/gzipped?acc=PF05371&alnType=seed. We can load this file as follows (assuming it has been saved to disk as “PF05371_seed.sth” in the current working directory):
from Bio import AlignIO
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
print alignment
This code will print out a summary of the alignment:
SingleLetterAlphabet() alignment with 7 rows and 52 columns AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
You’ll notice in the above output the sequences have been truncated. We could instead write our own code to format this as we please by iterating over the rows as SeqRecord objects:
from Bio import AlignIO
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
print "Alignment length %i" % alignment.get_alignment_length()
for record in alignment :
print "%s - %s" % (record.seq, record.id)
This will give the following output:
Alignment length 52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73
You could also use the alignment object’s format method to show it in a particular file format – see Section 5.2.2 for details.
Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? Try this:
for record in alignment :
if record.dbxrefs :
print record.id, record.dbxrefs
giving:
COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
To have a look at all the sequence annotation, try this:
for record in alignment :
print record
Sanger provide a nice web interface at http://pfam.sanger.ac.uk/family?acc=PF05371 which will actually let you download this alignment in several other formats. This is what the file looks like in the FASTA file format:
>COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA >Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA >COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA >COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA >COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA >Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA >COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
Note the website should have an option about showing gaps as periods (dots) or dashes, we’ve shown dashes above. Assuming you download and save this as file “PF05371_seed.faa” then you can load it with almost exactly the same code:
from Bio import AlignIO
alignment = AlignIO.read(open("PF05371_seed.faa"), "fasta")
print alignment
All that has changed in this code is the filename and the format string. You’ll get the same output as before, the sequences and record identifiers are the same.
However, as you should expect, if you check each SeqRecord there is no annotation nor database cross-references because these are not included in the FASTA file format.
Note that rather than using the Sanger website, you could have used Bio.AlignIO to convert the original Stockholm format file into a FASTA file yourself (see below).
With any supported file format, you can load an alignment in exactly the same way just by changing the format string. For example, use “phylip” for PHYLIP files, “nexus” for NEXUS files or “emboss” for the alignments output by the EMBOSS tools. There is a full listing on the wiki page (http://biopython.org/wiki/AlignIO).
The previous section focused on reading files containing a single alignment. In general however, files can contain more than one alignment, and to read these files we must use the Bio.AlignIO.parse() function.
Suppose you have a small alignment in PHYLIP format:
5 6 Alpha AACAAC Beta AACCCC Gamma ACCAAC Delta CCACCA Epsilon CCAAAC
If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools, one of the steps would be to create a set of many resampled alignments using the tool bootseq. This would give output something like this, which has been abbreviated for conciseness:
5 6
Alpha AAACCA
Beta AAACCC
Gamma ACCCCA
Delta CCCAAC
Epsilon CCCAAA
5 6
Alpha AAACAA
Beta AAACCC
Gamma ACCCAA
Delta CCCACC
Epsilon CCCAAA
5 6
Alpha AAAAAC
Beta AAACCC
Gamma AACAAC
Delta CCCCCA
Epsilon CCCAAC
...
5 6
Alpha AAAACC
Beta ACCCCC
Gamma AAAACC
Delta CCCCAA
Epsilon CAAACC
If you wanted to read this in using Bio.AlignIO you could use:
from Bio import AlignIO
alignments = AlignIO.parse(open("resampled.phy"), "phylip")
for alignment in alignments :
print alignment
print
This would give the following output, again abbreviated for display:
SingleLetterAlphabet() alignment with 5 rows and 6 columns AAACCA Alpha AAACCC Beta ACCCCA Gamma CCCAAC Delta CCCAAA Epsilon SingleLetterAlphabet() alignment with 5 rows and 6 columns AAACAA Alpha AAACCC Beta ACCCAA Gamma CCCACC Delta CCCAAA Epsilon SingleLetterAlphabet() alignment with 5 rows and 6 columns AAAAAC Alpha AAACCC Beta AACAAC Gamma CCCCCA Delta CCCAAC Epsilon ... SingleLetterAlphabet() alignment with 5 rows and 6 columns AAAACC Alpha ACCCCC Beta AAAACC Gamma CCCCAA Delta CAAACC Epsilon
As with the function Bio.SeqIO.parse(), using Bio.AlignIO.parse() returns an iterator.
If you want to keep all the alignments in memory at once, which will allow you to access them in any order, then turn the iterator into a list:
from Bio import AlignIO
alignments = list(AlignIO.parse(open("resampled.phy"), "phylip"))
last_align = alignments[-1]
first_align = alignments[0]
Many alignment file formats can explicitly store more than one alignment, and the division between each alignment is clear. However, when a general sequence file format has been used there is no such block structure. The most common such situation is when alignments have been saved in the FASTA file format. For example consider the following:
>Alpha ACTACGACTAGCTCAG--G >Beta ACTACCGCTAGCTCAGAAG >Gamma ACTACGGCTAGCACAGAAG >Alpha ACTACGACTAGCTCAGG-- >Beta ACTACCGCTAGCTCAGAAG >Gamma ACTACGGCTAGCACAGAAG
This could be a single alignment containing six sequences (with repeated identifiers). Or, judging from the identifiers, this is probably two different alignments each with three sequences, which happen to all have the same length.
What about this next example?
>Alpha ACTACGACTAGCTCAG--G >Beta ACTACCGCTAGCTCAGAAG >Alpha ACTACGACTAGCTCAGG-- >Gamma ACTACGGCTAGCACAGAAG >Alpha ACTACGACTAGCTCAGG-- >Delta ACTACGGCTAGCACAGAAG
Again, this could be a single alignment with six sequences. However this time based on the identifiers we might guess this is three pairwise alignments which by chance have all got the same lengths.
This final example is similar:
>Alpha ACTACGACTAGCTCAG