Package Bio :: Package SearchIO :: Package BlastIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SearchIO.BlastIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO support for BLAST+ output formats. 
  7   
  8  This module adds support for parsing BLAST+ outputs. BLAST+ is a rewrite of 
  9  NCBI's legacy BLAST (Basic Local Alignment Search Tool), based on the NCBI 
 10  C++ toolkit. The BLAST+ suite is available as command line programs or on 
 11  NCBI's web page. 
 12   
 13  Bio.SearchIO.BlastIO was tested on the following BLAST+ flavors and versions: 
 14   
 15      - flavors: blastn, blastp, blastx, tblastn, tblastx 
 16      - versions: 2.2.22+, 2.2.26+ 
 17   
 18  You should also be able to parse outputs from a local BLAST+ search or from 
 19  NCBI's web interface. Although the module was not tested against all BLAST+, 
 20  it should still be able to parse these other versions' outputs. Please submit 
 21  a bug report if you stumble upon an unparseable file. 
 22   
 23  Some output formats from the BLAST legacy suite (BLAST+'s predecessor) may 
 24  still be parsed by this module. However, results are not guaranteed. You may 
 25  try to use the Bio.Blast module to parse them instead. 
 26   
 27  More information about BLAST are available through these links: 
 28    - Publication: http://www.biomedcentral.com/1471-2105/10/421 
 29    - Web interface: http://blast.ncbi.nlm.nih.gov/ 
 30    - User guide: http://www.ncbi.nlm.nih.gov/books/NBK1762/ 
 31   
 32   
 33  Supported Formats 
 34  ================= 
 35   
 36  Bio.SearchIO.BlastIO supports the following BLAST+ output formats: 
 37   
 38    - XML        - 'blast-xml'  - parsing, indexing, writing 
 39    - Tabular    - 'blast-tab'  - parsing, indexing, writing 
 40    - Plain text - 'blast-text' - parsing 
 41   
 42   
 43  blast-xml 
 44  ========= 
 45   
 46  The blast-xml parser follows the BLAST XML DTD written here: 
 47  http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.mod.dtd 
 48   
 49  It provides the following attributes for each SearchIO object: 
 50   
 51  +----------------+-------------------------+-----------------------------+ 
 52  | Object         | Attribute               | XML Element                 | 
 53  +================+=========================+=============================+ 
 54  | QueryResult    | target                  | BlastOutput_db              | 
 55  |                +-------------------------+-----------------------------+ 
 56  |                | program                 | BlastOutput_program         | 
 57  |                +-------------------------+-----------------------------+ 
 58  |                | reference               | BlastOutput_reference       | 
 59  |                +-------------------------+-----------------------------+ 
 60  |                | version                 | BlastOutput_version[1]      | 
 61  |                +-------------------------+-----------------------------+ 
 62  |                | description             | Iteration_query-def         | 
 63  |                +-------------------------+-----------------------------+ 
 64  |                | id                      | Iteration_query-ID          | 
 65  |                +-------------------------+-----------------------------+ 
 66  |                | seq_len                 | Iteration_query-len         | 
 67  |                +-------------------------+-----------------------------+ 
 68  |                | param_evalue_threshold  | Parameters_expect           | 
 69  |                +-------------------------+-----------------------------+ 
 70  |                | param_entrez_query      | Parameters_entrez-query     | 
 71  |                +-------------------------+-----------------------------+ 
 72  |                | param_filter            | Parameters_filter           | 
 73  |                +-------------------------+-----------------------------+ 
 74  |                | param_gap_extend        | Parameters_gap-extend       | 
 75  |                +-------------------------+-----------------------------+ 
 76  |                | param_gap_open          | Parameters_gap-open         | 
 77  |                +-------------------------+-----------------------------+ 
 78  |                | param_include           | Parameters_include          | 
 79  |                +-------------------------+-----------------------------+ 
 80  |                | param_matrix            | Parameters_matrix           | 
 81  |                +-------------------------+-----------------------------+ 
 82  |                | param_pattern           | Parameters_pattern          | 
 83  |                +-------------------------+-----------------------------+ 
 84  |                | param_score_match       | Parameters_sc-match         | 
 85  |                +-------------------------+-----------------------------+ 
 86  |                | param_score_mismatch    | Parameters_sc-mismatch      | 
 87  |                +-------------------------+-----------------------------+ 
 88  |                | stat_db_num             | Statistics_db-num           | 
 89  |                +-------------------------+-----------------------------+ 
 90  |                | stat_db_len             | Statistics_db-len           | 
 91  |                +-------------------------+-----------------------------+ 
 92  |                | stat_eff_space          | Statistics_eff-space        | 
 93  |                +-------------------------+-----------------------------+ 
 94  |                | stat_entropy            | Statistics_entropy          | 
 95  |                +-------------------------+-----------------------------+ 
 96  |                | stat_hsp_len            | Statistics_hsp-len          | 
 97  |                +-------------------------+-----------------------------+ 
 98  |                | stat_kappa              | Statistics_kappa            | 
 99  |                +-------------------------+-----------------------------+ 
100  |                | stat_lambda             | Statistics_lambda           | 
101  +----------------+-------------------------+-----------------------------+ 
102  | Hit            | accession               | Hit_accession               | 
103  |                +-------------------------+-----------------------------+ 
104  |                | description             | Hit_def                     | 
105  |                +-------------------------+-----------------------------+ 
106  |                | id                      | Hit_id                      | 
107  |                +-------------------------+-----------------------------+ 
108  |                | seq_len                 | Hit_len                     | 
109  +----------------+-------------------------+-----------------------------+ 
110  | HSP            | bitscore                | Hsp_bit-score               | 
111  |                +-------------------------+-----------------------------+ 
112  |                | density                 | Hsp_density                 | 
113  |                +-------------------------+-----------------------------+ 
114  |                | evalue                  | Hsp_evalue                  | 
115  |                +-------------------------+-----------------------------+ 
116  |                | gap_num                 | Hsp_gaps                    | 
117  |                +-------------------------+-----------------------------+ 
118  |                | ident_num               | Hsp_identity                | 
119  |                +-------------------------+-----------------------------+ 
120  |                | pos_num                 | Hsp_positive                | 
121  |                +-------------------------+-----------------------------+ 
122  |                | bitscore_raw            | Hsp_score                   | 
123  +----------------+-------------------------+-----------------------------+ 
124  | HSPFragment    | aln_span                | Hsp_align-len               | 
125  | (also via      +-------------------------+-----------------------------+ 
126  | HSP)           | hit_frame               | Hsp_hit-frame               | 
127  |                +-------------------------+-----------------------------+ 
128  |                | hit_start               | Hsp_hit-from                | 
129  |                +-------------------------+-----------------------------+ 
130  |                | hit_end                 | Hsp_hit-to                  | 
131  |                +-------------------------+-----------------------------+ 
132  |                | hit                     | Hsp_hseq                    | 
133  |                +-------------------------+-----------------------------+ 
134  |                | aln_annotation          | Hsp_midline                 | 
135  |                +-------------------------+-----------------------------+ 
136  |                | pattern_start           | Hsp_pattern-from            | 
137  |                +-------------------------+-----------------------------+ 
138  |                | pattern_end             | Hsp_pattern-to              | 
139  |                +-------------------------+-----------------------------+ 
140  |                | query_frame             | Hsp_query-frame             | 
141  |                +-------------------------+-----------------------------+ 
142  |                | query_start             | Hsp_query-from              | 
143  |                +-------------------------+-----------------------------+ 
144  |                | query_end               | Hsp_query-to                | 
145  |                +-------------------------+-----------------------------+ 
146  |                | query                   | Hsp_qseq                    | 
147  +----------------+-------------------------+-----------------------------+ 
148   
149  You may notice that in BLAST XML files, sometimes BLAST replaces your true 
150  sequence ID with its own generated ID. For example, the query IDs become 
151  'Query_1', 'Query_2', and so on. While the hit IDs sometimes become 
152  'gnl|BL_ORD_ID|1', 'gnl|BL_ORD_ID|2', and so on. In these cases, BLAST lumps the 
153  true sequence IDs together with their descriptions. 
154   
155  The blast-xml parser is aware of these modifications and will attempt to extract 
156  the true sequence IDs out of the descriptions. So when accessing QueryResult or 
157  Hit objects, you will use the non-BLAST-generated IDs. 
158   
159  This behavior on the query IDs can be disabled using the 'use_raw_query_ids' 
160  parameter while the behavior on the hit IDs can be disabled using the 
161  'use_raw_hit_ids' parameter. Both are boolean values that can be supplied 
162  to SearchIO.read or SearchIO.parse, with the default values set to 'False'. 
163   
164  In any case, the raw BLAST IDs can always be accessed using the query or hit 
165  object's 'blast_id' attribute. 
166   
167  The blast-xml write function also accepts 'use_raw_query_ids' and 
168  'use_raw_hit_ids' parameters. However, note that the default values for the 
169  writer are set to 'True'. This is because the writer is meant to mimic native 
170  BLAST result as much as possible. 
171   
172   
173  blast-tab 
174  ========= 
175   
176  The default format for blast-tab support is the variant without comments (-m 6 
177  flag). Commented BLAST tabular files may be parsed, indexed, or written using 
178  the keyword argument 'comments' set to True: 
179   
180      # blast-tab defaults to parsing uncommented files 
181      >>> from Bio import SearchIO 
182      >>> uncommented = 'Blast/tab_2226_tblastn_004.txt' 
183      >>> qresult = SearchIO.read(uncommented, 'blast-tab') 
184      >>> qresult 
185      QueryResult(id='gi|11464971:4-101', 5 hits) 
186   
187      # set the keyword argument to parse commented files 
188      >>> commented = 'Blast/tab_2226_tblastn_008.txt' 
189      >>> qresult = SearchIO.read(commented, 'blast-tab', comments=True) 
190      >>> qresult 
191      QueryResult(id='gi|11464971:4-101', 5 hits) 
192   
193  For uncommented files, the parser defaults to using BLAST's default column 
194  ordering: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send 
195  evalue bitscore'. 
196   
197  If you want to parse an uncommented file with a customized column order, you can 
198  use the 'fields' keyword argument to pass the custom column order. The names of 
199  the column follow BLAST's naming. For example, 'qseqid' is the column for the 
200  query sequence ID. These names may be passed either as a Python list or as a 
201  space-separated strings. 
202   
203      # pass the custom column names as a Python list 
204      >>> fname = 'Blast/tab_2226_tblastn_009.txt' 
205      >>> custom_fields = ['qseqid', 'sseqid'] 
206      >>> qresult = next(SearchIO.parse(fname, 'blast-tab', fields=custom_fields)) 
207      >>> qresult 
208      QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) 
209   
210      # pass the custom column names as a space-separated string 
211      >>> fname = 'Blast/tab_2226_tblastn_009.txt' 
212      >>> custom_fields = 'qseqid sseqid' 
213      >>> qresult = next(SearchIO.parse(fname, 'blast-tab', fields=custom_fields)) 
214      >>> qresult 
215      QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) 
216   
217  You may also use the 'std' field name as an alias to BLAST's default 12 columns, 
218  just like when you run a command line BLAST search. 
219   
220  Note that the 'fields' keyword argument will be ignored if the parsed file is 
221  commented. Commented files have their column ordering stated explicitly in the 
222  file, so there is no need to specify it again in SearchIO. 
223   
224  'comments' and 'fields' keyword arguments are both applicable for parsing, 
225  indexing, and writing. 
226   
227  blast-tab provides the following attributes for each SearchIO objects: 
228   
229  +-------------+-------------------+--------------+ 
230  | Object      | Attribute         | Column name  | 
231  +=============+===================+==============+ 
232  | QueryResult | accession         | qacc         | 
233  |             +-------------------+--------------+ 
234  |             | accession_version | qaccver      | 
235  |             +-------------------+--------------+ 
236  |             | gi                | qgi          | 
237  |             +-------------------+--------------+ 
238  |             | seq_len           | qlen         | 
239  |             +-------------------+--------------+ 
240  |             | id                | qseqid       | 
241  +-------------+-------------------+--------------+ 
242  | Hit         | accession         | sacc         | 
243  |             +-------------------+--------------+ 
244  |             | accession_version | sacc_ver     | 
245  |             +-------------------+--------------+ 
246  |             | gi                | sgi          | 
247  |             +-------------------+--------------+ 
248  |             | gi_all            | sallgi       | 
249  |             +-------------------+--------------+ 
250  |             | id_all            | sallseqid    | 
251  |             +-------------------+--------------+ 
252  |             | seq_len           | slen         | 
253  |             +-------------------+--------------+ 
254  |             | id                | sseqid       | 
255  +-------------+-------------------+--------------+ 
256  | HSP         | bitscore          | bitscore     | 
257  |             +-------------------+--------------+ 
258  |             | btop              | btop         | 
259  |             +-------------------+--------------+ 
260  |             | evalue            | evalue       | 
261  |             +-------------------+--------------+ 
262  |             | gapopen_num       | gapopen      | 
263  |             +-------------------+--------------+ 
264  |             | gap_num           | gaps         | 
265  |             +-------------------+--------------+ 
266  |             | ident_pct         | nident       | 
267  |             +-------------------+--------------+ 
268  |             | ident_num         | pident       | 
269  |             +-------------------+--------------+ 
270  |             | mismatch_num      | mismatch     | 
271  |             +-------------------+--------------+ 
272  |             | pos_pct           | ppos         | 
273  |             +-------------------+--------------+ 
274  |             | pos_num           | positive     | 
275  |             +-------------------+--------------+ 
276  |             | bitscore_raw      | score        | 
277  +-------------+-------------------+--------------+ 
278  | HSPFragment | frames            | frames[2]    | 
279  | (also via   +-------------------+--------------+ 
280  | HSP)        | aln_span          | length       | 
281  |             +-------------------+--------------+ 
282  |             | query_end         | qend         | 
283  |             +-------------------+--------------+ 
284  |             | query_frame       | qframe       | 
285  |             +-------------------+--------------+ 
286  |             | query             | qseq         | 
287  |             +-------------------+--------------+ 
288  |             | query_start       | qstart       | 
289  |             +-------------------+--------------+ 
290  |             | hit_end           | send         | 
291  |             +-------------------+--------------+ 
292  |             | hit_frame         | sframe       | 
293  |             +-------------------+--------------+ 
294  |             | hit               | sseq         | 
295  |             +-------------------+--------------+ 
296  |             | hit_start         | sstart       | 
297  +-------------+-------------------+--------------+ 
298   
299  If the parsed file is commented, the following attributes may be available as 
300  well: 
301   
302  +--------------+---------------+----------------------------+ 
303  | Object       | Attribute     | Value                      | 
304  +==============+===============+============================+ 
305  | QueryResult  | description   | query description          | 
306  |              +---------------+----------------------------+ 
307  |              | fields        | columns in the output file | 
308  |              +---------------+----------------------------+ 
309  |              | program       | BLAST flavor               | 
310  |              +---------------+----------------------------+ 
311  |              | rid           | remote search ID           | 
312  |              +---------------+----------------------------+ 
313  |              | target        | target database            | 
314  |              +---------------+----------------------------+ 
315  |              | version       | BLAST version              | 
316  +--------------+---------------+----------------------------+ 
317   
318   
319  blast-text 
320  ========== 
321  The BLAST plain text output format has been known to change considerably between 
322  BLAST versions. NCBI itself has recommended that users not rely on the plain 
323  text output for parsing-related work. 
324   
325  However, in some cases parsing the plain text output may still be useful. 
326  SearchIO provides parsing support for the plain text output, but guarantees only 
327  a minimum level of support. Writing a parser that fully supports plain text 
328  output for all BLAST versions is not a priority at the moment. 
329   
330  If you do have a BLAST plain text file that can not be parsed and would like to 
331  submit a patch, we are more than happy to accept it. 
332   
333  The blast-text parser provides the following object attributes: 
334   
335  +-----------------+-------------------------+----------------------------------+ 
336  | Object          | Attribute               | Value                            | 
337  +=================+=========================+==================================+ 
338  | QueryResult     | description             | query sequence description       | 
339  |                 +-------------------------+----------------------------------+ 
340  |                 | id                      | query sequence ID                | 
341  |                 +-------------------------+----------------------------------+ 
342  |                 | program                 | BLAST flavor                     | 
343  |                 +-------------------------+----------------------------------+ 
344  |                 | seq_len                 | full length of query sequence    | 
345  |                 +-------------------------+----------------------------------+ 
346  |                 | target                  | target database of the search    | 
347  |                 +-------------------------+----------------------------------+ 
348  |                 | version                 | BLAST version                    | 
349  +-----------------+-------------------------+----------------------------------+ 
350  | Hit             | evalue                  | hit-level evalue, from the hit   | 
351  |                 |                         | table                            | 
352  |                 +-------------------------+----------------------------------+ 
353  |                 | id                      | hit sequence ID                  | 
354  |                 +-------------------------+----------------------------------+ 
355  |                 | description             | hit sequence description         | 
356  |                 +-------------------------+----------------------------------+ 
357  |                 | score                   | hit-level score, from the hit    | 
358  |                 |                         | table                            | 
359  |                 +-------------------------+----------------------------------+ 
360  |                 | seq_len                 | full length of hit sequence      | 
361  +-----------------+-------------------------+----------------------------------+ 
362  | HSP             | evalue                  | hsp-level evalue                 | 
363  |                 +-------------------------+----------------------------------+ 
364  |                 | bitscore                | hsp-level bit score              | 
365  |                 +-------------------------+----------------------------------+ 
366  |                 | bitscore_raw            | hsp-level score                  | 
367  |                 +-------------------------+----------------------------------+ 
368  |                 | gap_num                 | number of gaps in alignment      | 
369  |                 +-------------------------+----------------------------------+ 
370  |                 | ident_num               | number of identical residues     | 
371  |                 |                         | in alignment                     | 
372  |                 +-------------------------+----------------------------------+ 
373  |                 | pos_num                 | number of positive matches in    | 
374  |                 |                         | alignment                        | 
375  +-----------------+-------------------------+----------------------------------+ 
376  | HSPFragment     | aln_annotation          | alignment similarity string      | 
377  | (also via       +-------------------------+----------------------------------+ 
378  | HSP)            | aln_span                | length of alignment fragment     | 
379  |                 +-------------------------+----------------------------------+ 
380  |                 | hit                     | hit sequence                     | 
381  |                 +-------------------------+----------------------------------+ 
382  |                 | hit_end                 | hit sequence end coordinate      | 
383  |                 +-------------------------+----------------------------------+ 
384  |                 | hit_frame               | hit sequence reading frame       | 
385  |                 +-------------------------+----------------------------------+ 
386  |                 | hit_start               | hit sequence start coordinate    | 
387  |                 +-------------------------+----------------------------------+ 
388  |                 | hit_strand              | hit sequence strand              | 
389  |                 +-------------------------+----------------------------------+ 
390  |                 | query                   | query sequence                   | 
391  |                 +-------------------------+----------------------------------+ 
392  |                 | query_end               | query sequence end coordinate    | 
393  |                 +-------------------------+----------------------------------+ 
394  |                 | query_frame             | query sequence reading frame     | 
395  |                 +-------------------------+----------------------------------+ 
396  |                 | query_start             | query sequence start coordinate  | 
397  |                 +-------------------------+----------------------------------+ 
398  |                 | query_strand            | query sequence strand            | 
399  +-----------------+-------------------------+----------------------------------+ 
400   
401   
402  .. [1] may be modified 
403   
404  .. [2] When 'frames' is present, both ``query_frame`` and ``hit_frame`` will be 
405     present as well. It is recommended that you use these instead of 'frames' directly. 
406   
407  """ 
408   
409  from .blast_tab import BlastTabParser, BlastTabIndexer, BlastTabWriter 
410  from .blast_xml import BlastXmlParser, BlastXmlIndexer, BlastXmlWriter 
411  from .blast_text import BlastTextParser 
412   
413   
414  # if not used as a module, run the doctest 
415  if __name__ == "__main__": 
416      from Bio._utils import run_doctest 
417      run_doctest() 
418