Package Bio :: Package SearchIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SearchIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Biopython interface for sequence search program outputs. 
  7   
  8  The SearchIO submodule provides parsers, indexers, and writers for outputs from 
  9  various sequence search programs. It provides an API similar to SeqIO and 
 10  AlignIO, with the following main functions: `parse`, `read`, `to_dict`, `index`, 
 11  `index_db`, `write`, and `convert`. 
 12   
 13  SearchIO parses a search output file's contents into a hierarchy of four nested 
 14  objects: QueryResult, Hit, HSP, and HSPFragment. Each of them models a part of 
 15  the search output file: 
 16   
 17      - QueryResult represents a search query. This is the main object returned 
 18        by the input functions and it contains all other objects. 
 19      - Hit represents a database hit, 
 20      - HSP represents high-scoring alignment region(s) in the hit, 
 21      - HSPFragment represents a contiguous alignment within the HSP 
 22   
 23  In addition to the four objects above, SearchIO is also tightly integrated with 
 24  the SeqRecord objects (see SeqIO) and MultipleSeqAlignment objects (see 
 25  AlignIO). SeqRecord objects are used to store the actual matching hit and query 
 26  sequences, while MultipleSeqAlignment objects stores the alignment between them. 
 27   
 28  A detailed description of these objects' features and their example usages are 
 29  available in their respective documentations. 
 30   
 31   
 32  Input 
 33  ===== 
 34  The main function for parsing search output files is Bio.SearchIO.parse(...). 
 35  This function parses a given search output file and returns a generator object 
 36  that yields one QueryResult object per iteration. 
 37   
 38  `parse` takes two arguments: 1) a file handle or a filename of the input file 
 39  (the search output file) and 2) the format name. 
 40   
 41      >>> from Bio import SearchIO 
 42      >>> for qresult in SearchIO.parse('Blast/mirna.xml', 'blast-xml'): 
 43      ...     print("%s %s" % (qresult.id, qresult.description)) 
 44      ... 
 45      33211 mir_1 
 46      33212 mir_2 
 47      33213 mir_3 
 48   
 49  SearchIO also provides the Bio.SearchIO.read(...) function, which is intended 
 50  for use on search output files containing only one query. `read` returns one 
 51  QueryResult object and will raise an exception if the source file contains more 
 52  than one queries: 
 53   
 54      >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 
 55      >>> print("%s %s" % (qresult.id, qresult.description)) 
 56      ... 
 57      gi|11464971:4-101 pleckstrin [Mus musculus] 
 58   
 59      >>> SearchIO.read('Blast/mirna.xml', 'blast-xml') 
 60      Traceback (most recent call last): 
 61      ... 
 62      ValueError: ... 
 63   
 64  For accessing search results of large output files, you may use the indexing 
 65  functions Bio.SearchIO.index(...) or Bio.SearchIO.index_db(...). They have a 
 66  similar interface to their counterparts in SeqIO and AlignIO, with the addition 
 67  of optional, format-specific keyword arguments. 
 68   
 69   
 70  Output 
 71  ====== 
 72  SearchIO has writing support for several formats, accessible from the 
 73  Bio.SearchIO.write(...) function. This function returns a tuple of four 
 74  numbers: the number of QueryResult, Hit, HSP, and HSPFragment written:: 
 75   
 76      qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 
 77      SearchIO.write(qresults, 'results.tab', 'blast-tab') 
 78      <stdout> (3, 239, 277, 277) 
 79   
 80  Note that different writers may require different attribute values of the 
 81  SearchIO objects. This limits the scope of writable search results to search 
 82  results possessing the required attributes. 
 83   
 84  For example, the writer for HMMER domain table output requires 
 85  the conditional e-value attribute from each HSP object, among others. If you 
 86  try to write to the HMMER domain table format and your HSPs do not have this 
 87  attribute, an exception will be raised. 
 88   
 89   
 90  Conversion 
 91  ========== 
 92  SearchIO provides a shortcut function Bio.SearchIO.convert(...) to convert a 
 93  given file into another format. Under the hood, `convert` simply parses a given 
 94  output file and writes it to another using the `parse` and `write` functions. 
 95   
 96  Note that the same restrictions found in Bio.SearchIO.write(...) applies to the 
 97  convert function as well. 
 98   
 99   
100  Conventions 
101  =========== 
102  The main goal of creating SearchIO is to have a common, easy to use interface 
103  across different search output files. As such, we have also created some 
104  conventions / standards for SearchIO that extend beyond the common object model. 
105  These conventions apply to all files parsed by SearchIO, regardless of their 
106  individual formats. 
107   
108  Python-style sequence coordinates 
109  --------------------------------- 
110   
111  When storing sequence coordinates (start and end values), SearchIO uses 
112  the Python-style slice convention: zero-based and half-open intervals. For 
113  example, if in a BLAST XML output file the start and end coordinates of an 
114  HSP are 10 and 28, they would become 9 and 28 in SearchIO. The start 
115  coordinate becomes 9 because Python indices start from zero, while the end 
116  coordinate remains 28 as Python slices omit the last item in an interval. 
117   
118  Beside giving you the benefits of standardization, this convention also 
119  makes the coordinates usable for slicing sequences. For example, given a 
120  full query sequence and the start and end coordinates of an HSP, one can 
121  use the coordinates to extract part of the query sequence that results in 
122  the database hit. 
123   
124  When these objects are written to an output file using 
125  SearchIO.write(...), the coordinate values are restored to their 
126  respective format's convention. Using the example above, if the HSP would 
127  be written to an XML file, the start and end coordinates would become 10 
128  and 28 again. 
129   
130  Sequence coordinate order 
131  ------------------------- 
132   
133  Some search output format reverses the start and end coordinate sequences 
134  according to the sequence's strand. For example, in BLAST plain text 
135  format if the matching strand lies in the minus orientation, then the 
136  start coordinate will always be bigger than the end coordinate. 
137   
138  In SearchIO, start coordinates are always smaller than the end 
139  coordinates, regardless of their originating strand. This ensures 
140  consistency when using the coordinates to slice full sequences. 
141   
142  Note that this coordinate order convention is only enforced in the 
143  HSPFragment level. If an HSP object has several HSPFragment objects, each 
144  individual fragment will conform to this convention. But the order of the 
145  fragments within the HSP object follows what the search output file uses. 
146   
147  Similar to the coordinate style convention, the start and end coordinates' 
148  order are restored to their respective formats when the objects are 
149  written using Bio.SearchIO.write(...). 
150   
151  Frames and strand values 
152  ------------------------ 
153   
154  SearchIO only allows -1, 0, 1 and None as strand values. For frames, the 
155  only allowed values are integers from -3 to 3 (inclusive) and None. Both 
156  of these are standard Biopython conventions. 
157   
158   
159  Supported Formats 
160  ================= 
161  Below is a list of search program output formats supported by SearchIO. 
162   
163  Support for parsing, indexing, and writing: 
164   
165   - blast-tab        - BLAST+ tabular output. Both variants without comments 
166                        (-m 6 flag) and with comments (-m 7 flag) are supported. 
167   - blast-xml        - BLAST+ XML output. 
168   - blat-psl         - The default output of BLAT (PSL format). Variants with or 
169                        without header are both supported. PSLX (PSL + sequences) 
170                        is also supported. 
171   - hmmer3-tab       - HMMER3 table output. 
172   - hmmer3-domtab    - HMMER3 domain table output. When using this format, the 
173                        program name has to be specified. For example, for parsing 
174                        hmmscan output, the name would be 'hmmscan-domtab'. 
175   
176  Support for parsing and indexing: 
177   
178   - exonerate-text   - Exonerate plain text output. 
179   - exonerate-vulgar - Exonerate vulgar line. 
180   - exonerate-cigar  - Exonerate cigar line. 
181   - fasta-m10        - Bill Pearson's FASTA -m 10 output. 
182   - hmmer3-text      - HMMER3 regular text output format. Supported HMMER3 
183                        subprograms are hmmscan, hmmsearch, and phmmer. 
184   - hmmer2-text      - HMMER2 regular text output format. Supported HMMER2 
185                        subprograms are hmmpfam, hmmsearch. 
186   
187  Support for parsing: 
188   
189   - blast-text       - BLAST+ plain text output. 
190   
191  Each of these formats have different keyword arguments available for use with 
192  the main SearchIO functions. More details and examples are available in each 
193  of the format's documentation. 
194   
195  """ 
196   
197  from __future__ import print_function 
198  from Bio._py3k import basestring 
199   
200  import sys 
201  import warnings 
202   
203  from Bio import BiopythonExperimentalWarning 
204  from Bio.File import as_handle 
205  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
206  from Bio.SearchIO._utils import get_processor 
207   
208   
209  warnings.warn('Bio.SearchIO is an experimental submodule which may undergo ' 
210          'significant changes prior to its future official release.', 
211          BiopythonExperimentalWarning) 
212   
213   
214  __all__ = ['read', 'parse', 'to_dict', 'index', 'index_db', 'write', 'convert'] 
215   
216   
217  # dictionary of supported formats for parse() and read() 
218  _ITERATOR_MAP = { 
219          'blast-tab': ('BlastIO', 'BlastTabParser'), 
220          'blast-text': ('BlastIO', 'BlastTextParser'), 
221          'blast-xml': ('BlastIO', 'BlastXmlParser'), 
222          'blat-psl': ('BlatIO', 'BlatPslParser'), 
223          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarParser'), 
224          'exonerate-text': ('ExonerateIO', 'ExonerateTextParser'), 
225          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarParser'), 
226          'fasta-m10': ('FastaIO', 'FastaM10Parser'), 
227          'hmmer2-text': ('HmmerIO', 'Hmmer2TextParser'), 
228          'hmmer3-text': ('HmmerIO', 'Hmmer3TextParser'), 
229          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabParser'), 
230          # for hmmer3-domtab, the specific program is part of the format name 
231          # as we need it distinguish hit / target coordinates 
232          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitParser'), 
233          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
234          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
235  } 
236   
237  # dictionary of supported formats for index() 
238  _INDEXER_MAP = { 
239          'blast-tab': ('BlastIO', 'BlastTabIndexer'), 
240          'blast-xml': ('BlastIO', 'BlastXmlIndexer'), 
241          'blat-psl': ('BlatIO', 'BlatPslIndexer'), 
242          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarIndexer'), 
243          'exonerate-text': ('ExonerateIO', 'ExonerateTextIndexer'), 
244          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarIndexer'), 
245          'fasta-m10': ('FastaIO', 'FastaM10Indexer'), 
246          'hmmer2-text': ('HmmerIO', 'Hmmer2TextIndexer'), 
247          'hmmer3-text': ('HmmerIO', 'Hmmer3TextIndexer'), 
248          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabIndexer'), 
249          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitIndexer'), 
250          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
251          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
252  } 
253   
254  # dictionary of supported formats for write() 
255  _WRITER_MAP = { 
256          'blast-tab': ('BlastIO', 'BlastTabWriter'), 
257          'blast-xml': ('BlastIO', 'BlastXmlWriter'), 
258          'blat-psl': ('BlatIO', 'BlatPslWriter'), 
259          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabWriter'), 
260          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitWriter'), 
261          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
262          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
263  } 
264   
265   
266 -def parse(handle, format=None, **kwargs):
267 """Turns a search output file into a generator that yields QueryResult 268 objects. 269 270 - handle - Handle to the file, or the filename as a string. 271 - format - Lower case string denoting one of the supported formats. 272 - kwargs - Format-specific keyword arguments. 273 274 This function is used to iterate over each query in a given search output 275 file: 276 277 >>> from Bio import SearchIO 278 >>> qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 279 >>> qresults 280 <generator object ...> 281 >>> for qresult in qresults: 282 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 283 ... 284 Search 33211 has 100 hits 285 Search 33212 has 44 hits 286 Search 33213 has 95 hits 287 288 Depending on the file format, `parse` may also accept additional keyword 289 argument(s) that modifies the behavior of the format parser. Here is a 290 simple example, where the keyword argument enables parsing of a commented 291 BLAST tabular output file: 292 293 >>> from Bio import SearchIO 294 >>> for qresult in SearchIO.parse('Blast/mirna.tab', 'blast-tab', comments=True): 295 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 296 ... 297 Search 33211 has 100 hits 298 Search 33212 has 44 hits 299 Search 33213 has 95 hits 300 301 """ 302 # get the iterator object and do error checking 303 iterator = get_processor(format, _ITERATOR_MAP) 304 305 # HACK: force BLAST XML decoding to use utf-8 306 handle_kwargs = {} 307 if format == 'blast-xml' and sys.version_info[0] > 2: 308 handle_kwargs['encoding'] = 'utf-8' 309 310 # and start iterating 311 with as_handle(handle, 'rU', **handle_kwargs) as source_file: 312 generator = iterator(source_file, **kwargs) 313 314 for qresult in generator: 315 yield qresult
316 317
318 -def read(handle, format=None, **kwargs):
319 """Turns a search output file containing one query into a single QueryResult. 320 321 - handle - Handle to the file, or the filename as a string. 322 - format - Lower case string denoting one of the supported formats. 323 - kwargs - Format-specific keyword arguments. 324 325 `read` is used for parsing search output files containing exactly one query: 326 327 >>> from Bio import SearchIO 328 >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 329 >>> print("%s %s" % (qresult.id, qresult.description)) 330 ... 331 gi|11464971:4-101 pleckstrin [Mus musculus] 332 333 If the given handle has no results, an exception will be raised: 334 335 >>> from Bio import SearchIO 336 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_002.txt', 'blast-tab') 337 Traceback (most recent call last): 338 ... 339 ValueError: No query results found in handle 340 341 Similarly, if the given handle has more than one results, an exception will 342 also be raised: 343 344 >>> from Bio import SearchIO 345 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_001.txt', 'blast-tab') 346 Traceback (most recent call last): 347 ... 348 ValueError: More than one query results found in handle 349 350 Like `parse`, `read` may also accept keyword argument(s) depending on the 351 search output file format. 352 353 """ 354 generator = parse(handle, format, **kwargs) 355 356 try: 357 first = next(generator) 358 except StopIteration: 359 raise ValueError("No query results found in handle") 360 else: 361 try: 362 second = next(generator) 363 except StopIteration: 364 second = None 365 366 if second is not None: 367 raise ValueError("More than one query results found in handle") 368 369 return first
370 371
372 -def to_dict(qresults, key_function=lambda rec: rec.id):
373 """Turns a QueryResult iterator or list into a dictionary. 374 375 - qresults - Iterable returning QueryResult objects. 376 - key_function - Optional callback function which when given a 377 QueryResult object should return a unique key for the 378 dictionary. 379 380 This function enables access of QueryResult objects from a single search 381 output file using its identifier. 382 383 >>> from Bio import SearchIO 384 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 385 >>> search_dict = SearchIO.to_dict(qresults) 386 >>> sorted(search_dict) 387 ['gi|156630997:105-1160', ..., 'gi|371502086:108-1205', 'gi|53729353:216-1313'] 388 >>> search_dict['gi|156630997:105-1160'] 389 QueryResult(id='gi|156630997:105-1160', 5 hits) 390 391 By default, the dictionary key is the QueryResult's string ID. This may be 392 changed by supplying a callback function that returns the desired identifier. 393 Here is an example using a function that removes the 'gi|' part in the 394 beginning of the QueryResult ID. 395 396 >>> from Bio import SearchIO 397 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 398 >>> key_func = lambda qresult: qresult.id.split('|')[1] 399 >>> search_dict = SearchIO.to_dict(qresults, key_func) 400 >>> sorted(search_dict) 401 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 402 >>> search_dict['156630997:105-1160'] 403 QueryResult(id='gi|156630997:105-1160', 5 hits) 404 405 Note that the callback function does not change the QueryResult's ID value. 406 It only changes the key value used to retrieve the associated QueryResult. 407 408 As this function loads all QueryResult objects into memory, it may be 409 unsuitable for dealing with files containing many queries. In that case, it 410 is recommended that you use either `index` or `index_db`. 411 412 """ 413 qdict = {} 414 for qresult in qresults: 415 key = key_function(qresult) 416 if key in qdict: 417 raise ValueError("Duplicate key %r" % key) 418 qdict[key] = qresult 419 return qdict
420 421
422 -def index(filename, format=None, key_function=None, **kwargs):
423 """Indexes a search output file and returns a dictionary-like object. 424 425 - filename - string giving name of file to be indexed 426 - format - Lower case string denoting one of the supported formats. 427 - key_function - Optional callback function which when given a 428 QueryResult should return a unique key for the dictionary. 429 - kwargs - Format-specific keyword arguments. 430 431 Index returns a pseudo-dictionary object with QueryResult objects as its 432 values and a string identifier as its keys. The function is mainly useful 433 for dealing with large search output files, as it enables access to any 434 given QueryResult object much faster than using parse or read. 435 436 Index works by storing in-memory the start locations of all queries in a 437 file. When a user requested access to the query, this function will jump 438 to its start position, parse the whole query, and return it as a 439 QueryResult object: 440 441 >>> from Bio import SearchIO 442 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml') 443 >>> search_idx 444 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=None) 445 >>> sorted(search_idx) 446 ['gi|156630997:105-1160', 'gi|195230749:301-1383', ..., 'gi|53729353:216-1313'] 447 >>> search_idx['gi|195230749:301-1383'] 448 QueryResult(id='gi|195230749:301-1383', 5 hits) 449 >>> search_idx.close() 450 451 If the file is BGZF compressed, this is detected automatically. Ordinary 452 GZIP files are not supported: 453 454 >>> from Bio import SearchIO 455 >>> search_idx = SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml') 456 >>> search_idx 457 SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml', key_function=None) 458 >>> search_idx['gi|195230749:301-1383'] 459 QueryResult(id='gi|195230749:301-1383', 5 hits) 460 >>> search_idx.close() 461 462 You can supply a custom callback function to alter the default identifier 463 string. This function should accept as its input the QueryResult ID string 464 and return a modified version of it. 465 466 >>> from Bio import SearchIO 467 >>> key_func = lambda id: id.split('|')[1] 468 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml', key_func) 469 >>> search_idx 470 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=<function <lambda> at ...>) 471 >>> sorted(search_idx) 472 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 473 >>> search_idx['156630997:105-1160'] 474 QueryResult(id='gi|156630997:105-1160', 5 hits) 475 >>> search_idx.close() 476 477 Note that the callback function does not change the QueryResult's ID value. 478 It only changes the key value used to retrieve the associated QueryResult. 479 480 """ 481 if not isinstance(filename, basestring): 482 raise TypeError("Need a filename (not a handle)") 483 484 from Bio.File import _IndexedSeqFileDict 485 proxy_class = get_processor(format, _INDEXER_MAP) 486 repr = "SearchIO.index(%r, %r, key_function=%r)" \ 487 % (filename, format, key_function) 488 return _IndexedSeqFileDict(proxy_class(filename, **kwargs), 489 key_function, repr, "QueryResult")
490 491
492 -def index_db(index_filename, filenames=None, format=None, 493 key_function=None, **kwargs):
494 """Indexes several search output files into an SQLite database. 495 496 - index_filename - The SQLite filename. 497 - filenames - List of strings specifying file(s) to be indexed, or when 498 indexing a single file this can be given as a string. 499 (optional if reloading an existing index, but must match) 500 - format - Lower case string denoting one of the supported formats. 501 (optional if reloading an existing index, but must match) 502 - key_function - Optional callback function which when given a 503 QueryResult identifier string should return a unique 504 key for the dictionary. 505 - kwargs - Format-specific keyword arguments. 506 507 The `index_db` function is similar to `index` in that it indexes the start 508 position of all queries from search output files. The main difference is 509 instead of storing these indices in-memory, they are written to disk as an 510 SQLite database file. This allows the indices to persist between Python 511 sessions. This enables access to any queries in the file without any 512 indexing overhead, provided it has been indexed at least once. 513 514 >>> from Bio import SearchIO 515 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 516 >>> db_idx = SearchIO.index_db(idx_filename, 'Blast/mirna.xml', 'blast-xml') 517 >>> sorted(db_idx) 518 ['33211', '33212', '33213'] 519 >>> db_idx['33212'] 520 QueryResult(id='33212', 44 hits) 521 >>> db_idx.close() 522 523 `index_db` can also index multiple files and store them in the same 524 database, making it easier to group multiple search files and access them 525 from a single interface. 526 527 >>> from Bio import SearchIO 528 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 529 >>> files = ['Blast/mirna.xml', 'Blast/wnts.xml'] 530 >>> db_idx = SearchIO.index_db(idx_filename, files, 'blast-xml') 531 >>> sorted(db_idx) 532 ['33211', '33212', '33213', 'gi|156630997:105-1160', ..., 'gi|53729353:216-1313'] 533 >>> db_idx['33212'] 534 QueryResult(id='33212', 44 hits) 535 >>> db_idx.close() 536 537 One common example where this is helpful is if you had a large set of 538 query sequences (say ten thousand) which you split into ten query files 539 of one thousand sequences each in order to run as ten separate BLAST jobs 540 on a cluster. You could use `index_db` to index the ten BLAST output 541 files together for seamless access to all the results as one dictionary. 542 543 Note that ':memory:' rather than an index filename tells SQLite to hold 544 the index database in memory. This is useful for quick tests, but using 545 the Bio.SearchIO.index(...) function instead would use less memory. 546 547 BGZF compressed files are supported, and detected automatically. Ordinary 548 GZIP compressed files are not supported. 549 550 See also Bio.SearchIO.index(), Bio.SearchIO.to_dict(), and the Python module 551 glob which is useful for building lists of files. 552 """ 553 # cast filenames to list if it's a string 554 # (can we check if it's a string or a generator?) 555 if isinstance(filenames, basestring): 556 filenames = [filenames] 557 558 from Bio.File import _SQLiteManySeqFilesDict 559 repr = "SearchIO.index_db(%r, filenames=%r, format=%r, key_function=%r, ...)" \ 560 % (index_filename, filenames, format, key_function) 561 562 def proxy_factory(format, filename=None): 563 """Given a filename returns proxy object, else boolean if format OK.""" 564 if filename: 565 return get_processor(format, _INDEXER_MAP)(filename, **kwargs) 566 else: 567 return format in _INDEXER_MAP
568 569 return _SQLiteManySeqFilesDict(index_filename, filenames, 570 proxy_factory, format, 571 key_function, repr) 572 573
574 -def write(qresults, handle, format=None, **kwargs):
575 """Writes QueryResult objects to a file in the given format. 576 577 - qresults - An iterator returning QueryResult objects or a single 578 QueryResult object. 579 - handle - Handle to the file, or the filename as a string. 580 - format - Lower case string denoting one of the supported formats. 581 - kwargs - Format-specific keyword arguments. 582 583 The `write` function writes QueryResult object(s) into the given output 584 handle / filename. You can supply it with a single QueryResult object or an 585 iterable returning one or more QueryResult objects. In both cases, the 586 function will return a tuple of four values: the number of QueryResult, Hit, 587 HSP, and HSPFragment objects it writes to the output file:: 588 589 from Bio import SearchIO 590 qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 591 SearchIO.write(qresults, 'results.tab', 'blast-tab') 592 <stdout> (3, 239, 277, 277) 593 594 The output of different formats may be adjusted using the format-specific 595 keyword arguments. Here is an example that writes BLAT PSL output file with 596 a header:: 597 598 from Bio import SearchIO 599 qresults = SearchIO.parse('Blat/psl_34_001.psl', 'blat-psl') 600 SearchIO.write(qresults, 'results.tab', 'blat-psl', header=True) 601 <stdout> (2, 13, 22, 26) 602 603 """ 604 # turn qresults into an iterator if it's a single QueryResult object 605 if isinstance(qresults, QueryResult): 606 qresults = iter([qresults]) 607 else: 608 qresults = iter(qresults) 609 610 # get the writer object and do error checking 611 writer_class = get_processor(format, _WRITER_MAP) 612 613 # write to the handle 614 with as_handle(handle, 'w') as target_file: 615 writer = writer_class(target_file, **kwargs) 616 # count how many qresults, hits, and hsps 617 qresult_count, hit_count, hsp_count, frag_count = \ 618 writer.write_file(qresults) 619 620 return qresult_count, hit_count, hsp_count, frag_count
621 622
623 -def convert(in_file, in_format, out_file, out_format, in_kwargs=None, 624 out_kwargs=None):
625 """Convert between two search output formats, return number of records. 626 627 - in_file - Handle to the input file, or the filename as string. 628 - in_format - Lower case string denoting the format of the input file. 629 - out_file - Handle to the output file, or the filename as string. 630 - out_format - Lower case string denoting the format of the output file. 631 - in_kwargs - Dictionary of keyword arguments for the input function. 632 - out_kwargs - Dictionary of keyword arguments for the output function. 633 634 The convert function is a shortcut function for `parse` and `write`. It has 635 the same return type as `write`. Format-specific arguments may be passed to 636 the convert function, but only as dictionaries. 637 638 Here is an example of using `convert` to convert from a BLAST+ XML file 639 into a tabular file with comments:: 640 641 from Bio import SearchIO 642 in_file = 'Blast/mirna.xml' 643 in_fmt = 'blast-xml' 644 out_file = 'results.tab' 645 out_fmt = 'blast-tab' 646 out_kwarg = {'comments': True} 647 SearchIO.convert(in_file, in_fmt, out_file, out_fmt, out_kwargs=out_kwarg) 648 <stdout> (3, 239, 277, 277) 649 650 Given that different search output file provide different statistics and 651 different level of details, the convert function is limited only to 652 converting formats that have the same statistics and for conversion to 653 formats with the same level of detail, or less. 654 655 For example, converting from a BLAST+ XML output to a HMMER table file 656 is not possible, as these are two search programs with different kinds of 657 statistics. In theory, you may provide the necessary values required by the 658 HMMER table file (e.g. conditional e-values, envelope coordinates, etc). 659 However, these values are likely to hold little meaning as they are not true 660 HMMER-computed values. 661 662 Another example is converting from BLAST+ XML to BLAST+ tabular file. This 663 is possible, as BLAST+ XML provide all the values necessary to create a 664 BLAST+ tabular file. However, the reverse conversion may not be possible. 665 There are more details covered in the XML file that are not found in a 666 tabular file (e.g. the lambda and kappa values) 667 668 """ 669 if in_kwargs is None: 670 in_kwargs = {} 671 if out_kwargs is None: 672 out_kwargs = {} 673 674 qresults = parse(in_file, in_format, **in_kwargs) 675 return write(qresults, out_file, out_format, **out_kwargs)
676 677 678 # if not used as a module, run the doctest 679 if __name__ == "__main__": 680 from Bio._utils import run_doctest 681 run_doctest() 682