Package Bio :: Package SearchIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SearchIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Biopython interface for sequence search program outputs. 
  7   
  8  The SearchIO submodule provides parsers, indexers, and writers for outputs from 
  9  various sequence search programs. It provides an API similar to SeqIO and 
 10  AlignIO, with the following main functions: `parse`, `read`, `to_dict`, `index`, 
 11  `index_db`, `write`, and `convert`. 
 12   
 13  SearchIO parses a search output file's contents into a hierarchy of four nested 
 14  objects: QueryResult, Hit, HSP, and HSPFragment. Each of them models a part of 
 15  the search output file: 
 16   
 17      - QueryResult represents a search query. This is the main object returned 
 18        by the input functions and it contains all other objects. 
 19      - Hit represents a database hit, 
 20      - HSP represents high-scoring alignment region(s) in the hit, 
 21      - HSPFragment represents a contiguous alignment within the HSP 
 22   
 23  In addition to the four objects above, SearchIO is also tightly integrated with 
 24  the SeqRecord objects (see SeqIO) and MultipleSeqAlignment objects (see 
 25  AlignIO). SeqRecord objects are used to store the actual matching hit and query 
 26  sequences, while MultipleSeqAlignment objects stores the alignment between them. 
 27   
 28  A detailed description of these objects' features and their example usages are 
 29  available in their respective documentations. 
 30   
 31   
 32  Input 
 33  ===== 
 34  The main function for parsing search output files is Bio.SearchIO.parse(...). 
 35  This function parses a given search output file and returns a generator object 
 36  that yields one QueryResult object per iteration. 
 37   
 38  `parse` takes two arguments: 1) a file handle or a filename of the input file 
 39  (the search output file) and 2) the format name. 
 40   
 41      >>> from Bio import SearchIO 
 42      >>> for qresult in SearchIO.parse('Blast/mirna.xml', 'blast-xml'): 
 43      ...     print("%s %s" % (qresult.id, qresult.description)) 
 44      ... 
 45      33211 mir_1 
 46      33212 mir_2 
 47      33213 mir_3 
 48   
 49  SearchIO also provides the Bio.SearchIO.read(...) function, which is intended 
 50  for use on search output files containing only one query. `read` returns one 
 51  QueryResult object and will raise an exception if the source file contains more 
 52  than one queries: 
 53   
 54      >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 
 55      >>> print("%s %s" % (qresult.id, qresult.description)) 
 56      ... 
 57      gi|11464971:4-101 pleckstrin [Mus musculus] 
 58   
 59      >>> SearchIO.read('Blast/mirna.xml', 'blast-xml') 
 60      Traceback (most recent call last): 
 61      ... 
 62      ValueError: ... 
 63   
 64  For accessing search results of large output files, you may use the indexing 
 65  functions Bio.SearchIO.index(...) or Bio.SearchIO.index_db(...). They have a 
 66  similar interface to their counterparts in SeqIO and AlignIO, with the addition 
 67  of optional, format-specific keyword arguments. 
 68   
 69   
 70  Output 
 71  ====== 
 72  SearchIO has writing support for several formats, accessible from the 
 73  Bio.SearchIO.write(...) function. This function returns a tuple of four 
 74  numbers: the number of QueryResult, Hit, HSP, and HSPFragment written:: 
 75   
 76      qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 
 77      SearchIO.write(qresults, 'results.tab', 'blast-tab') 
 78      <stdout> (3, 239, 277, 277) 
 79   
 80  Note that different writers may require different attribute values of the 
 81  SearchIO objects. This limits the scope of writable search results to search 
 82  results possessing the required attributes. 
 83   
 84  For example, the writer for HMMER domain table output requires 
 85  the conditional e-value attribute from each HSP object, among others. If you 
 86  try to write to the HMMER domain table format and your HSPs do not have this 
 87  attribute, an exception will be raised. 
 88   
 89   
 90  Conversion 
 91  ========== 
 92  SearchIO provides a shortcut function Bio.SearchIO.convert(...) to convert a 
 93  given file into another format. Under the hood, `convert` simply parses a given 
 94  output file and writes it to another using the `parse` and `write` functions. 
 95   
 96  Note that the same restrictions found in Bio.SearchIO.write(...) applies to the 
 97  convert function as well. 
 98   
 99   
100  Conventions 
101  =========== 
102  The main goal of creating SearchIO is to have a common, easy to use interface 
103  across different search output files. As such, we have also created some 
104  conventions / standards for SearchIO that extend beyond the common object model. 
105  These conventions apply to all files parsed by SearchIO, regardless of their 
106  individual formats. 
107   
108  Python-style sequence coordinates 
109  --------------------------------- 
110   
111  When storing sequence coordinates (start and end values), SearchIO uses 
112  the Python-style slice convention: zero-based and half-open intervals. For 
113  example, if in a BLAST XML output file the start and end coordinates of an 
114  HSP are 10 and 28, they would become 9 and 28 in SearchIO. The start 
115  coordinate becomes 9 because Python indices start from zero, while the end 
116  coordinate remains 28 as Python slices omit the last item in an interval. 
117   
118  Beside giving you the benefits of standardization, this convention also 
119  makes the coordinates usable for slicing sequences. For example, given a 
120  full query sequence and the start and end coordinates of an HSP, one can 
121  use the coordinates to extract part of the query sequence that results in 
122  the database hit. 
123   
124  When these objects are written to an output file using 
125  SearchIO.write(...), the coordinate values are restored to their 
126  respective format's convention. Using the example above, if the HSP would 
127  be written to an XML file, the start and end coordinates would become 10 
128  and 28 again. 
129   
130  Sequence coordinate order 
131  ------------------------- 
132   
133  Some search output format reverses the start and end coordinate sequences 
134  according to the sequence's strand. For example, in BLAST plain text 
135  format if the matching strand lies in the minus orientation, then the 
136  start coordinate will always be bigger than the end coordinate. 
137   
138  In SearchIO, start coordinates are always smaller than the end 
139  coordinates, regardless of their originating strand. This ensures 
140  consistency when using the coordinates to slice full sequences. 
141   
142  Note that this coordinate order convention is only enforced in the 
143  HSPFragment level. If an HSP object has several HSPFragment objects, each 
144  individual fragment will conform to this convention. But the order of the 
145  fragments within the HSP object follows what the search output file uses. 
146   
147  Similar to the coordinate style convention, the start and end coordinates' 
148  order are restored to their respective formats when the objects are 
149  written using Bio.SearchIO.write(...). 
150   
151  Frames and strand values 
152  ------------------------ 
153   
154  SearchIO only allows -1, 0, 1 and None as strand values. For frames, the 
155  only allowed values are integers from -3 to 3 (inclusive) and None. Both 
156  of these are standard Biopython conventions. 
157   
158   
159  Supported Formats 
160  ================= 
161  Below is a list of search program output formats supported by SearchIO. 
162   
163  Support for parsing, indexing, and writing: 
164   
165   - blast-tab        - BLAST+ tabular output. Both variants without comments 
166                        (-m 6 flag) and with comments (-m 7 flag) are supported. 
167   - blast-xml        - BLAST+ XML output. 
168   - blat-psl         - The default output of BLAT (PSL format). Variants with or 
169                        without header are both supported. PSLX (PSL + sequences) 
170                        is also supported. 
171   - hmmer3-tab       - HMMER3 table output. 
172   - hmmer3-domtab    - HMMER3 domain table output. When using this format, the 
173                        program name has to be specified. For example, for parsing 
174                        hmmscan output, the name would be 'hmmscan-domtab'. 
175   
176  Support for parsing and indexing: 
177   
178   - exonerate-text   - Exonerate plain text output. 
179   - exonerate-vulgar - Exonerate vulgar line. 
180   - exonerate-cigar  - Exonerate cigar line. 
181   - fasta-m10        - Bill Pearson's FASTA -m 10 output. 
182   - hmmer3-text      - HMMER3 regular text output format. Supported HMMER3 
183                        subprograms are hmmscan, hmmsearch, and phmmer. 
184   - hmmer2-text      - HMMER2 regular text output format. Supported HMMER2 
185                        subprograms are hmmpfam, hmmsearch. 
186   
187  Support for parsing: 
188   
189   - blast-text       - BLAST+ plain text output. 
190   
191  Each of these formats have different keyword arguments available for use with 
192  the main SearchIO functions. More details and examples are available in each 
193  of the format's documentation. 
194   
195  """ 
196   
197  from __future__ import print_function 
198  from Bio._py3k import basestring 
199   
200  import sys 
201   
202  from Bio.File import as_handle 
203  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
204  from Bio.SearchIO._utils import get_processor 
205   
206   
207  __all__ = ('read', 'parse', 'to_dict', 'index', 'index_db', 'write', 'convert') 
208   
209   
210  # dictionary of supported formats for parse() and read() 
211  _ITERATOR_MAP = { 
212          'blast-tab': ('BlastIO', 'BlastTabParser'), 
213          'blast-text': ('BlastIO', 'BlastTextParser'), 
214          'blast-xml': ('BlastIO', 'BlastXmlParser'), 
215          'blat-psl': ('BlatIO', 'BlatPslParser'), 
216          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarParser'), 
217          'exonerate-text': ('ExonerateIO', 'ExonerateTextParser'), 
218          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarParser'), 
219          'fasta-m10': ('FastaIO', 'FastaM10Parser'), 
220          'hmmer2-text': ('HmmerIO', 'Hmmer2TextParser'), 
221          'hmmer3-text': ('HmmerIO', 'Hmmer3TextParser'), 
222          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabParser'), 
223          # for hmmer3-domtab, the specific program is part of the format name 
224          # as we need it distinguish hit / target coordinates 
225          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitParser'), 
226          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
227          'interproscan-xml': ('InterproscanIO', 'InterproscanXmlParser'), 
228          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
229  } 
230   
231  # dictionary of supported formats for index() 
232  _INDEXER_MAP = { 
233          'blast-tab': ('BlastIO', 'BlastTabIndexer'), 
234          'blast-xml': ('BlastIO', 'BlastXmlIndexer'), 
235          'blat-psl': ('BlatIO', 'BlatPslIndexer'), 
236          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarIndexer'), 
237          'exonerate-text': ('ExonerateIO', 'ExonerateTextIndexer'), 
238          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarIndexer'), 
239          'fasta-m10': ('FastaIO', 'FastaM10Indexer'), 
240          'hmmer2-text': ('HmmerIO', 'Hmmer2TextIndexer'), 
241          'hmmer3-text': ('HmmerIO', 'Hmmer3TextIndexer'), 
242          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabIndexer'), 
243          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitIndexer'), 
244          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
245          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
246  } 
247   
248  # dictionary of supported formats for write() 
249  _WRITER_MAP = { 
250          'blast-tab': ('BlastIO', 'BlastTabWriter'), 
251          'blast-xml': ('BlastIO', 'BlastXmlWriter'), 
252          'blat-psl': ('BlatIO', 'BlatPslWriter'), 
253          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabWriter'), 
254          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitWriter'), 
255          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
256          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
257  } 
258   
259   
260 -def parse(handle, format=None, **kwargs):
261 """Iterate over search tool output file as QueryResult objects. 262 263 Arguments: 264 - handle - Handle to the file, or the filename as a string. 265 - format - Lower case string denoting one of the supported formats. 266 - kwargs - Format-specific keyword arguments. 267 268 This function is used to iterate over each query in a given search output 269 file: 270 271 >>> from Bio import SearchIO 272 >>> qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 273 >>> qresults 274 <generator object ...> 275 >>> for qresult in qresults: 276 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 277 ... 278 Search 33211 has 100 hits 279 Search 33212 has 44 hits 280 Search 33213 has 95 hits 281 282 Depending on the file format, `parse` may also accept additional keyword 283 argument(s) that modifies the behavior of the format parser. Here is a 284 simple example, where the keyword argument enables parsing of a commented 285 BLAST tabular output file: 286 287 >>> from Bio import SearchIO 288 >>> for qresult in SearchIO.parse('Blast/mirna.tab', 'blast-tab', comments=True): 289 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 290 ... 291 Search 33211 has 100 hits 292 Search 33212 has 44 hits 293 Search 33213 has 95 hits 294 295 """ 296 # get the iterator object and do error checking 297 iterator = get_processor(format, _ITERATOR_MAP) 298 299 # HACK: force BLAST XML decoding to use utf-8 300 handle_kwargs = {} 301 if format == 'blast-xml' and sys.version_info[0] > 2: 302 handle_kwargs['encoding'] = 'utf-8' 303 304 # and start iterating 305 with as_handle(handle, 'rU', **handle_kwargs) as source_file: 306 generator = iterator(source_file, **kwargs) 307 308 for qresult in generator: 309 yield qresult
310 311
312 -def read(handle, format=None, **kwargs):
313 """Turn a search output file containing one query into a single QueryResult. 314 315 - handle - Handle to the file, or the filename as a string. 316 - format - Lower case string denoting one of the supported formats. 317 - kwargs - Format-specific keyword arguments. 318 319 `read` is used for parsing search output files containing exactly one query: 320 321 >>> from Bio import SearchIO 322 >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 323 >>> print("%s %s" % (qresult.id, qresult.description)) 324 ... 325 gi|11464971:4-101 pleckstrin [Mus musculus] 326 327 If the given handle has no results, an exception will be raised: 328 329 >>> from Bio import SearchIO 330 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_002.txt', 'blast-tab') 331 Traceback (most recent call last): 332 ... 333 ValueError: No query results found in handle 334 335 Similarly, if the given handle has more than one results, an exception will 336 also be raised: 337 338 >>> from Bio import SearchIO 339 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_001.txt', 'blast-tab') 340 Traceback (most recent call last): 341 ... 342 ValueError: More than one query results found in handle 343 344 Like `parse`, `read` may also accept keyword argument(s) depending on the 345 search output file format. 346 347 """ 348 generator = parse(handle, format, **kwargs) 349 350 try: 351 first = next(generator) 352 except StopIteration: 353 raise ValueError("No query results found in handle") 354 else: 355 try: 356 second = next(generator) 357 except StopIteration: 358 second = None 359 360 if second is not None: 361 raise ValueError("More than one query results found in handle") 362 363 return first
364 365
366 -def to_dict(qresults, key_function=lambda rec: rec.id):
367 """Turn a QueryResult iterator or list into a dictionary. 368 369 - qresults - Iterable returning QueryResult objects. 370 - key_function - Optional callback function which when given a 371 QueryResult object should return a unique key for the 372 dictionary. 373 374 This function enables access of QueryResult objects from a single search 375 output file using its identifier. 376 377 >>> from Bio import SearchIO 378 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 379 >>> search_dict = SearchIO.to_dict(qresults) 380 >>> sorted(search_dict) 381 ['gi|156630997:105-1160', ..., 'gi|371502086:108-1205', 'gi|53729353:216-1313'] 382 >>> search_dict['gi|156630997:105-1160'] 383 QueryResult(id='gi|156630997:105-1160', 5 hits) 384 385 By default, the dictionary key is the QueryResult's string ID. This may be 386 changed by supplying a callback function that returns the desired identifier. 387 Here is an example using a function that removes the 'gi|' part in the 388 beginning of the QueryResult ID. 389 390 >>> from Bio import SearchIO 391 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 392 >>> key_func = lambda qresult: qresult.id.split('|')[1] 393 >>> search_dict = SearchIO.to_dict(qresults, key_func) 394 >>> sorted(search_dict) 395 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 396 >>> search_dict['156630997:105-1160'] 397 QueryResult(id='gi|156630997:105-1160', 5 hits) 398 399 Note that the callback function does not change the QueryResult's ID value. 400 It only changes the key value used to retrieve the associated QueryResult. 401 402 As this function loads all QueryResult objects into memory, it may be 403 unsuitable for dealing with files containing many queries. In that case, it 404 is recommended that you use either `index` or `index_db`. 405 406 """ 407 qdict = {} 408 for qresult in qresults: 409 key = key_function(qresult) 410 if key in qdict: 411 raise ValueError("Duplicate key %r" % key) 412 qdict[key] = qresult 413 return qdict
414 415
416 -def index(filename, format=None, key_function=None, **kwargs):
417 """Indexes a search output file and returns a dictionary-like object. 418 419 - filename - string giving name of file to be indexed 420 - format - Lower case string denoting one of the supported formats. 421 - key_function - Optional callback function which when given a 422 QueryResult should return a unique key for the dictionary. 423 - kwargs - Format-specific keyword arguments. 424 425 Index returns a pseudo-dictionary object with QueryResult objects as its 426 values and a string identifier as its keys. The function is mainly useful 427 for dealing with large search output files, as it enables access to any 428 given QueryResult object much faster than using parse or read. 429 430 Index works by storing in-memory the start locations of all queries in a 431 file. When a user requested access to the query, this function will jump 432 to its start position, parse the whole query, and return it as a 433 QueryResult object: 434 435 >>> from Bio import SearchIO 436 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml') 437 >>> search_idx 438 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=None) 439 >>> sorted(search_idx) 440 ['gi|156630997:105-1160', 'gi|195230749:301-1383', ..., 'gi|53729353:216-1313'] 441 >>> search_idx['gi|195230749:301-1383'] 442 QueryResult(id='gi|195230749:301-1383', 5 hits) 443 >>> search_idx.close() 444 445 If the file is BGZF compressed, this is detected automatically. Ordinary 446 GZIP files are not supported: 447 448 >>> from Bio import SearchIO 449 >>> search_idx = SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml') 450 >>> search_idx 451 SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml', key_function=None) 452 >>> search_idx['gi|195230749:301-1383'] 453 QueryResult(id='gi|195230749:301-1383', 5 hits) 454 >>> search_idx.close() 455 456 You can supply a custom callback function to alter the default identifier 457 string. This function should accept as its input the QueryResult ID string 458 and return a modified version of it. 459 460 >>> from Bio import SearchIO 461 >>> key_func = lambda id: id.split('|')[1] 462 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml', key_func) 463 >>> search_idx 464 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=<function <lambda> at ...>) 465 >>> sorted(search_idx) 466 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 467 >>> search_idx['156630997:105-1160'] 468 QueryResult(id='gi|156630997:105-1160', 5 hits) 469 >>> search_idx.close() 470 471 Note that the callback function does not change the QueryResult's ID value. 472 It only changes the key value used to retrieve the associated QueryResult. 473 474 """ 475 if not isinstance(filename, basestring): 476 raise TypeError("Need a filename (not a handle)") 477 478 from Bio.File import _IndexedSeqFileDict 479 proxy_class = get_processor(format, _INDEXER_MAP) 480 repr = "SearchIO.index(%r, %r, key_function=%r)" \ 481 % (filename, format, key_function) 482 return _IndexedSeqFileDict(proxy_class(filename, **kwargs), 483 key_function, repr, "QueryResult")
484 485
486 -def index_db(index_filename, filenames=None, format=None, 487 key_function=None, **kwargs):
488 """Indexes several search output files into an SQLite database. 489 490 - index_filename - The SQLite filename. 491 - filenames - List of strings specifying file(s) to be indexed, or when 492 indexing a single file this can be given as a string. 493 (optional if reloading an existing index, but must match) 494 - format - Lower case string denoting one of the supported formats. 495 (optional if reloading an existing index, but must match) 496 - key_function - Optional callback function which when given a 497 QueryResult identifier string should return a unique 498 key for the dictionary. 499 - kwargs - Format-specific keyword arguments. 500 501 The `index_db` function is similar to `index` in that it indexes the start 502 position of all queries from search output files. The main difference is 503 instead of storing these indices in-memory, they are written to disk as an 504 SQLite database file. This allows the indices to persist between Python 505 sessions. This enables access to any queries in the file without any 506 indexing overhead, provided it has been indexed at least once. 507 508 >>> from Bio import SearchIO 509 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 510 >>> db_idx = SearchIO.index_db(idx_filename, 'Blast/mirna.xml', 'blast-xml') 511 >>> sorted(db_idx) 512 ['33211', '33212', '33213'] 513 >>> db_idx['33212'] 514 QueryResult(id='33212', 44 hits) 515 >>> db_idx.close() 516 517 `index_db` can also index multiple files and store them in the same 518 database, making it easier to group multiple search files and access them 519 from a single interface. 520 521 >>> from Bio import SearchIO 522 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 523 >>> files = ['Blast/mirna.xml', 'Blast/wnts.xml'] 524 >>> db_idx = SearchIO.index_db(idx_filename, files, 'blast-xml') 525 >>> sorted(db_idx) 526 ['33211', '33212', '33213', 'gi|156630997:105-1160', ..., 'gi|53729353:216-1313'] 527 >>> db_idx['33212'] 528 QueryResult(id='33212', 44 hits) 529 >>> db_idx.close() 530 531 One common example where this is helpful is if you had a large set of 532 query sequences (say ten thousand) which you split into ten query files 533 of one thousand sequences each in order to run as ten separate BLAST jobs 534 on a cluster. You could use `index_db` to index the ten BLAST output 535 files together for seamless access to all the results as one dictionary. 536 537 Note that ':memory:' rather than an index filename tells SQLite to hold 538 the index database in memory. This is useful for quick tests, but using 539 the Bio.SearchIO.index(...) function instead would use less memory. 540 541 BGZF compressed files are supported, and detected automatically. Ordinary 542 GZIP compressed files are not supported. 543 544 See also Bio.SearchIO.index(), Bio.SearchIO.to_dict(), and the Python module 545 glob which is useful for building lists of files. 546 """ 547 # cast filenames to list if it's a string 548 # (can we check if it's a string or a generator?) 549 if isinstance(filenames, basestring): 550 filenames = [filenames] 551 552 from Bio.File import _SQLiteManySeqFilesDict 553 repr = ("SearchIO.index_db(%r, filenames=%r, format=%r, key_function=%r, ...)" 554 % (index_filename, filenames, format, key_function)) 555 556 def proxy_factory(format, filename=None): 557 """Given a filename returns proxy object, else boolean if format OK.""" 558 if filename: 559 return get_processor(format, _INDEXER_MAP)(filename, **kwargs) 560 else: 561 return format in _INDEXER_MAP
562 563 return _SQLiteManySeqFilesDict(index_filename, filenames, 564 proxy_factory, format, 565 key_function, repr) 566 567
568 -def write(qresults, handle, format=None, **kwargs):
569 """Write QueryResult objects to a file in the given format. 570 571 - qresults - An iterator returning QueryResult objects or a single 572 QueryResult object. 573 - handle - Handle to the file, or the filename as a string. 574 - format - Lower case string denoting one of the supported formats. 575 - kwargs - Format-specific keyword arguments. 576 577 The `write` function writes QueryResult object(s) into the given output 578 handle / filename. You can supply it with a single QueryResult object or an 579 iterable returning one or more QueryResult objects. In both cases, the 580 function will return a tuple of four values: the number of QueryResult, Hit, 581 HSP, and HSPFragment objects it writes to the output file:: 582 583 from Bio import SearchIO 584 qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 585 SearchIO.write(qresults, 'results.tab', 'blast-tab') 586 <stdout> (3, 239, 277, 277) 587 588 The output of different formats may be adjusted using the format-specific 589 keyword arguments. Here is an example that writes BLAT PSL output file with 590 a header:: 591 592 from Bio import SearchIO 593 qresults = SearchIO.parse('Blat/psl_34_001.psl', 'blat-psl') 594 SearchIO.write(qresults, 'results.tab', 'blat-psl', header=True) 595 <stdout> (2, 13, 22, 26) 596 597 """ 598 # turn qresults into an iterator if it's a single QueryResult object 599 if isinstance(qresults, QueryResult): 600 qresults = iter([qresults]) 601 else: 602 qresults = iter(qresults) 603 604 # get the writer object and do error checking 605 writer_class = get_processor(format, _WRITER_MAP) 606 607 # write to the handle 608 with as_handle(handle, 'w') as target_file: 609 writer = writer_class(target_file, **kwargs) 610 # count how many qresults, hits, and hsps 611 qresult_count, hit_count, hsp_count, frag_count = writer.write_file(qresults) 612 613 return qresult_count, hit_count, hsp_count, frag_count
614 615
616 -def convert(in_file, in_format, out_file, out_format, in_kwargs=None, 617 out_kwargs=None):
618 """Convert between two search output formats, return number of records. 619 620 - in_file - Handle to the input file, or the filename as string. 621 - in_format - Lower case string denoting the format of the input file. 622 - out_file - Handle to the output file, or the filename as string. 623 - out_format - Lower case string denoting the format of the output file. 624 - in_kwargs - Dictionary of keyword arguments for the input function. 625 - out_kwargs - Dictionary of keyword arguments for the output function. 626 627 The convert function is a shortcut function for `parse` and `write`. It has 628 the same return type as `write`. Format-specific arguments may be passed to 629 the convert function, but only as dictionaries. 630 631 Here is an example of using `convert` to convert from a BLAST+ XML file 632 into a tabular file with comments:: 633 634 from Bio import SearchIO 635 in_file = 'Blast/mirna.xml' 636 in_fmt = 'blast-xml' 637 out_file = 'results.tab' 638 out_fmt = 'blast-tab' 639 out_kwarg = {'comments': True} 640 SearchIO.convert(in_file, in_fmt, out_file, out_fmt, out_kwargs=out_kwarg) 641 <stdout> (3, 239, 277, 277) 642 643 Given that different search output file provide different statistics and 644 different level of details, the convert function is limited only to 645 converting formats that have the same statistics and for conversion to 646 formats with the same level of detail, or less. 647 648 For example, converting from a BLAST+ XML output to a HMMER table file 649 is not possible, as these are two search programs with different kinds of 650 statistics. In theory, you may provide the necessary values required by the 651 HMMER table file (e.g. conditional e-values, envelope coordinates, etc). 652 However, these values are likely to hold little meaning as they are not true 653 HMMER-computed values. 654 655 Another example is converting from BLAST+ XML to BLAST+ tabular file. This 656 is possible, as BLAST+ XML provide all the values necessary to create a 657 BLAST+ tabular file. However, the reverse conversion may not be possible. 658 There are more details covered in the XML file that are not found in a 659 tabular file (e.g. the lambda and kappa values) 660 661 """ 662 if in_kwargs is None: 663 in_kwargs = {} 664 if out_kwargs is None: 665 out_kwargs = {} 666 667 qresults = parse(in_file, in_format, **in_kwargs) 668 return write(qresults, out_file, out_format, **out_kwargs)
669 670 671 # if not used as a module, run the doctest 672 if __name__ == "__main__": 673 from Bio._utils import run_doctest 674 run_doctest() 675