Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Module FastaIO

source code

Bio.SearchIO support for Bill Pearson's FASTA tools.

This module adds support for parsing FASTA outputs. FASTA is a suite of programs that finds regions of local or global similarity between protein or nucleotide sequences, either by searching databases or identifying local duplications.

Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions:

Other flavors and/or versions may introduce some bugs. Please file a bug report if you see such problems to Biopython's bug tracker.

More information on FASTA are available through these links:

Supported Formats

Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular format using the -m 8 flag) may be parseable but using SearchIO's other parsers (in this case, using the 'blast-tab' parser).

fasta-m10

Note that in FASTA -m 10 outputs, HSPs from different strands are considered to be from different hits. They are listed as two separate entries in the hit table. FastaIO recognizes this and will group HSPs with the same hit ID into a single Hit object, regardless of strand.

FASTA also sometimes output extra sequences adjacent to the HSP match. These extra sequences are discarded by FastaIO. Only regions containing the actual sequence match are extracted.

The following object attributes are provided:

Object Attribute Value
QueryResult description query sequence description
id query sequence ID
program FASTA flavor
seq_len full length of query sequence
target target search database
version FASTA version
Hit seq_len full length of the hit sequence
HSP bitscore *_bits line
evalue *_expect line
ident_pct *_ident line
init1_score *_init1 line
initn_score *_initn line
opt_score *_opt line, *_s-w opt line
pos_pct *_sim line
sw_score *_score line
z_score *_z-score line
HSPFragment (also via HSP) aln_annotation al_cons block, if present
hit hit sequence
hit_end hit sequence end coordinate
hit_start hit sequence start coordinate
hit_strand hit sequence strand
query query sequence
query_end query sequence end coordinate
query_start query sequence start coordinate
query_strand query sequence strand
Classes [hide private]
  FastaM10Parser
Parser for Bill Pearson's FASTA suite's -m 10 output.
  FastaM10Indexer
Indexer class for Bill Pearson's FASTA suite's -m 10 output.
Functions [hide private]
 
_set_qresult_hits(qresult, hit_rows=[])
Helper function for appending Hits without alignments into QueryResults.
source code
 
_set_hsp_seqs(hsp, parsed, program)
Helper function for the main parsing code.
source code
 
_get_aln_slice_coords(parsed_hsp)
Helper function for the main parsing code.
source code
Variables [hide private]
  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]...
  _PTR_ID_DESC_SEQLEN = '>>>(.+?)\\s+(.*?) *- (\\d+) (?:aa|nt)\\...
  _RE_ID_DESC_SEQLEN = re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?...
  _RE_ID_DESC_SEQLEN_IDX = re.compile(r'>>>(.+?)\s+(.*?) *- (\d+...
  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$')
  _RE_START_EXC = re.compile(r'^-*')
  _RE_END_EXC = re.compile(r'-*$')
  _HSP_ATTR_MAP = {'_bits': ('bitscore', <type 'float'>), '_expe...
  _STATE_NONE = 0
  _STATE_QUERY_BLOCK = 1
  _STATE_HIT_BLOCK = 2
  _STATE_CONS_BLOCK = 3
  __package__ = 'Bio.SearchIO'
Function Details [hide private]

_set_hsp_seqs(hsp, parsed, program)

source code 
Helper function for the main parsing code.
Parameters:
  • hsp (HSP) - HSP whose properties will be set
  • parsed (dictionary {string: object}) - parsed values of the HSP attributes
  • program (string) - program name

_get_aln_slice_coords(parsed_hsp)

source code 

Helper function for the main parsing code.

To get the actual pairwise alignment sequences, we must first translate the un-gapped sequence based coordinates into positions in the gapped sequence (which may have a flanking region shown using leading - characters). To date, I have never seen any trailing flanking region shown in the m10 file, but the following code should also cope with that.

Note that this code seems to work fine even when the "sq_offset" entries are prsent as a result of using the -X command line option.


Variables Details [hide private]

_RE_FLAVS

Value:
re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search')

_PTR_ID_DESC_SEQLEN

Value:
'>>>(.+?)\\s+(.*?) *- (\\d+) (?:aa|nt)\\s*$'

_RE_ID_DESC_SEQLEN

Value:
re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$')

_RE_ID_DESC_SEQLEN_IDX

Value:
re.compile(r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$')

_HSP_ATTR_MAP

Value:
{'_bits': ('bitscore', <type 'float'>),
 '_expect': ('evalue', <type 'float'>),
 '_ident': ('ident_pct', <type 'float'>),
 '_init1': ('init1_score', <type 'int'>),
 '_initn': ('initn_score', <type 'int'>),
 '_opt': ('opt_score', <type 'int'>),
 '_s-w opt': ('opt_score', <type 'int'>),
 '_score': ('sw_score', <type 'int'>),
...