Matching the names in BioPerl, Biopython has a
module for sequence file input/output, and
for multiple sequence alignment input/output. The third member of the
BioPerl trio is SearchIO, and a Biopython equivalent was written during
summer 2012 by Google Summer of Code
student Wibowo Arindrarto (blog).
This covers pairwise sequence search file input/output, for example from BLAST, HMMER, BLAT, or Bill Pearson’s FASTA suite. See the BioPerl SearchIO HOWTO for background.
This wiki describes the important bits with some small examples. For a full reference, consult the API documentation.
The table below lists all formats supported by
Bio.SearchIO. Note that
for writing support, the writer assumes that all the necessary
attributes of the objects being written are present. It is not possible,
for example, to write BLAST XML data to a HMMER 3.0 plain text output
|blast-tab||1.61||1.61||1.61||BLAST+ tabular output (both
|blast-text||1.61||n/a||n/a||BLAST+ plain text output (up to version 2.2.26+). Newer versions may not always work.|
|blast-xml||1.61||1.61||1.61||BLAST+ XML output.|
|blat-psl||1.61||1.61||1.61||BLAT default output (PSL format). Variants with or without header are both supported. PSLX (PSL + sequences) is also supported.|
|exonerate-text||1.61||n/a||1.61||Exonerate plain text output. Due to the way Biopython stores its sequences, at the moment support is limited to text outputs without split codons (for protein queries). If you are parsing a text output of protein queries containing split codon alignments (for example, from the
|exonerate-cigar||1.61||n/a||1.61||Exonerate cigar string.|
|exonerate-vulgar||1.61||n/a||1.61||Exonerate vulgar string.|
|fasta-m10||1.61||n/a||1.61||Bill Pearson’s FASTA
|hmmer3-domtab||1.61||1.61||1.61||HMMER3.0 domain table output format. The name
|hmmer3-tab||1.61||1.61||1.61||HMMER 3.0 table output format.|
|hmmer3-text||1.61||n/a||1.61||HMMER 3.0 plain text output format.|
|hmmer2-text||1.61||n/a||1.61||HMMER 2.x plain text output format.|
Although mostly similar to Biopython’s
AlignIO modules, there
is a small difference in the main
Bio.SearchIO functions. Depending on
the file format being used, you may pass additional keyword arguments
that determines how the parser / indexer / writer behaves. Shown below
are some formats which accepts extra keyword arguments.
|Format name||Argument name||Default value||Applicable for||Explanation|
|blast-tab||comments||False||Reading, writing, indexing||Boolean, whether the input/output file is the commented variant or not.|
|fields||Default BLAST tabular output field names||reading, writing, indexing||Space-separated string, list of fields / columns in the input/output file.|
|blast-xml||encoding||“utf-8”||Writing||XML encoding name.|
|indent||” ” (empty space)||writing||Character(s) to use for indenting the XML.|
|increment||2||writing||How many times the character defined in
|blat-psl||pslx||False||Reading, writing, indexing||Boolean, whether the input/output file contains sequences or not.|
|header||False||writing||Boolean, whether to write PSL header or not.|
The main goal of creating
Bio.SearchIO is to have a common, easy to use
interface across different search output files. As such, we have also
created some conventions / standards for
Bio.SearchIO that extend beyond
the common object model. These conventions apply to all files parsed by
Bio.SearchIO, regardless of their individual formats.
When storing sequence coordinates (start and end values),
uses the Python-style slice convention: zero-based and half-open
intervals. For example, if in a BLAST XML output file the start and end
coordinates of an HSP are 10 and 28, they would become 9 and 28 in
Bio.SearchIO. The start coordinate becomes 9 because Python indices
start from zero, while the end coordinate remains 28 as Python slices
omit the last item in an interval.
Beside giving you the benefits of standardization, this convention also makes the coordinates usable for slicing sequences. For example, given a full query sequence and the start and end coordinates of an HSP, one can use the coordinates to extract part of the query sequence that results in the database hit.
When these objects are written to an output file using
Bio.SearchIO.write, the coordinate values are restored to their
respective format’s convention. Using the example above, if the HSP
would be written to an XML file, the start and end coordinates would
become 10 and 28 again.
Some search output format reverses the start and end coordinate sequences according to the sequence’s strand. For example, in BLAST plain text format if the matching strand lies in the minus orientation, then the start coordinate will always be bigger than the end coordinate.
Bio.SearchIO, start coordinates are always smaller than the end
coordinates, regardless of their originating strand. This ensures
consistency when using the coordinates to slice full sequences.
Note that this coordinate order convention is only enforced in the HSPFragment level. If an HSP object has several HSPFragment objects, each individual fragment will conform to this convention. But the order of the fragments within the HSP object follows what the search output file uses.
Similar to the coordinate style convention, the start and end
coordinates’ order are restored to their respective formats when the
objects are written using
Bio.SearchIO only allows -1, 0, 1 and
None as strand values. For frames,
the only allowed values are integers from -3 to 3 (inclusive) and
Both of these are standard Biopython conventions.
Bio.SearchIO differ from
Both modules are based on completely different object models and are
not compatible with each other. Not only that, the underlying
parsers and writers are also different (indexing is not possible
Bio.SearchIO is planned to be the
Bio.SearchIO differ from
Again, they provide different object models. However,
currently uses the parser from
but that old module will be deprecated.