Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Module FastaIO

source code

Bio.SearchIO support for Bill Pearson's FASTA tools.

This module adds support for parsing FASTA outputs. FASTA is a suite of
programs that finds regions of local or global similarity between protein
or nucleotide sequences, either by searching databases or identifying
local duplications.

Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions:

    - flavors: fasta, ssearch, tfastx
    - versions: 35, 36

Other flavors and/or versions may introduce some bugs. Please file a bug report
if you see such problems to Biopython's bug tracker.

More information on FASTA are available through these links:
  - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
  - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf


Supported Formats
=================

Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by
the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular
format using the -m 8 flag) may be parseable but using SearchIO's other parsers
(in this case, using the 'blast-tab' parser).


fasta-m10
=========

Note that in FASTA -m 10 outputs, HSPs from different strands are considered to
be from different hits. They are listed as two separate entries in the hit
table. FastaIO recognizes this and will group HSPs with the same hit ID into a
single Hit object, regardless of strand.

FASTA also sometimes output extra sequences adjacent to the HSP match. These
extra sequences are discarded by FastaIO. Only regions containing the actual
sequence match are extracted.

The following object attributes are provided:

+-----------------+-------------------------+----------------------------------+
| Object          | Attribute               | Value                            |
+=================+=========================+==================================+
| QueryResult     | description             | query sequence description       |
|                 +-------------------------+----------------------------------+
|                 | id                      | query sequence ID                |
|                 +-------------------------+----------------------------------+
|                 | program                 | FASTA flavor                     |
|                 +-------------------------+----------------------------------+
|                 | seq_len                 | full length of query sequence    |
|                 +-------------------------+----------------------------------+
|                 | target                  | target search database           |
|                 +-------------------------+----------------------------------+
|                 | version                 | FASTA version                    |
+-----------------+-------------------------+----------------------------------+
| Hit             | seq_len                 | full length of the hit sequence  |
+-----------------+-------------------------+----------------------------------+
| HSP             | bitscore                | *_bits line                      |
|                 +-------------------------+----------------------------------+
|                 | evalue                  | *_expect line                    |
|                 +-------------------------+----------------------------------+
|                 | ident_pct               | *_ident line                     |
|                 +-------------------------+----------------------------------+
|                 | init1_score             | *_init1 line                     |
|                 +-------------------------+----------------------------------+
|                 | initn_score             | *_initn line                     |
|                 +-------------------------+----------------------------------+
|                 | opt_score               | *_opt line, *_s-w opt line       |
|                 +-------------------------+----------------------------------+
|                 | pos_pct                 | *_sim line                       |
|                 +-------------------------+----------------------------------+
|                 | sw_score                | *_score line                     |
|                 +-------------------------+----------------------------------+
|                 | z_score                 | *_z-score line                   |
+-----------------+-------------------------+----------------------------------+
| HSPFragment     | aln_annotation          | al_cons block, if present        |
| (also via HSP)  +-------------------------+----------------------------------+
|                 | hit                     | hit sequence                     |
|                 +-------------------------+----------------------------------+
|                 | hit_end                 | hit sequence end coordinate      |
|                 +-------------------------+----------------------------------+
|                 | hit_start               | hit sequence start coordinate    |
|                 +-------------------------+----------------------------------+
|                 | hit_strand              | hit sequence strand              |
|                 +-------------------------+----------------------------------+
|                 | query                   | query sequence                   |
|                 +-------------------------+----------------------------------+
|                 | query_end               | query sequence end coordinate    |
|                 +-------------------------+----------------------------------+
|                 | query_start             | query sequence start coordinate  |
|                 +-------------------------+----------------------------------+
|                 | query_strand            | query sequence strand            |
+-----------------+-------------------------+----------------------------------+

Classes [hide private]
  FastaM10Parser
Parser for Bill Pearson's FASTA suite's -m 10 output.
  FastaM10Indexer
Indexer class for Bill Pearson's FASTA suite's -m 10 output.
Functions [hide private]
 
_set_qresult_hits(qresult, hit_rows=[])
Helper function for appending Hits without alignments into QueryResults.
source code
 
_set_hsp_seqs(hsp, parsed, program)
Helper function for the main parsing code.
source code
 
_get_aln_slice_coords(parsed_hsp)
Helper function for the main parsing code.
source code
Variables [hide private]
  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]...
  _PTR_ID_DESC_SEQLEN = '>>>(.+?)\\s+(.*?) *- (\\d+) (?:aa|nt)\\...
  _RE_ID_DESC_SEQLEN = re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?...
  _RE_ID_DESC_SEQLEN_IDX = re.compile(r'>>>(.+?)\s+(.*?) *- (\d+...
  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$')
  _RE_START_EXC = re.compile(r'^-*')
  _RE_END_EXC = re.compile(r'-*$')
  _HSP_ATTR_MAP = {'_bits': ('bitscore', <type 'float'>), '_expe...
  _STATE_NONE = 0
  _STATE_QUERY_BLOCK = 1
  _STATE_HIT_BLOCK = 2
  _STATE_CONS_BLOCK = 3
  __package__ = 'Bio.SearchIO'
Function Details [hide private]

_set_hsp_seqs(hsp, parsed, program)

source code 
Helper function for the main parsing code.

Arguments:
hsp -- HSP object whose properties are to be set.
parsed -- Dictionary containing parsed values for HSP attributes.
program -- String of program name.

_get_aln_slice_coords(parsed_hsp)

source code 
Helper function for the main parsing code.

To get the actual pairwise alignment sequences, we must first
translate the un-gapped sequence based coordinates into positions
in the gapped sequence (which may have a flanking region shown
using leading - characters).  To date, I have never seen any
trailing flanking region shown in the m10 file, but the
following code should also cope with that.

Note that this code seems to work fine even when the "sq_offset"
entries are prsent as a result of using the -X command line option.


Variables Details [hide private]

_RE_FLAVS

Value:
re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search')

_PTR_ID_DESC_SEQLEN

Value:
'>>>(.+?)\\s+(.*?) *- (\\d+) (?:aa|nt)\\s*$'

_RE_ID_DESC_SEQLEN

Value:
re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$')

_RE_ID_DESC_SEQLEN_IDX

Value:
re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$')

_HSP_ATTR_MAP

Value:
{'_bits': ('bitscore', <type 'float'>),
 '_expect': ('evalue', <type 'float'>),
 '_ident': ('ident_pct', <type 'float'>),
 '_init1': ('init1_score', <type 'int'>),
 '_initn': ('initn_score', <type 'int'>),
 '_opt': ('opt_score', <type 'int'>),
 '_s-w opt': ('opt_score', <type 'int'>),
 '_score': ('sw_score', <type 'int'>),
...