Package Bio :: Package SearchIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SearchIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Biopython interface for sequence search program outputs. 
  7   
  8  The SearchIO submodule provides parsers, indexers, and writers for outputs from 
  9  various sequence search programs. It provides an API similar to SeqIO and 
 10  AlignIO, with the following main functions: `parse`, `read`, `to_dict`, `index`, 
 11  `index_db`, `write`, and `convert`. 
 12   
 13  SearchIO parses a search output file's contents into a hierarchy of four nested 
 14  objects: QueryResult, Hit, HSP, and HSPFragment. Each of them models a part of 
 15  the search output file: 
 16   
 17      - QueryResult represents a search query. This is the main object returned 
 18        by the input functions and it contains all other objects. 
 19      - Hit represents a database hit, 
 20      - HSP represents high-scoring alignment region(s) in the hit, 
 21      - HSPFragment represents a contiguous alignment within the HSP 
 22   
 23  In addition to the four objects above, SearchIO is also tightly integrated with 
 24  the SeqRecord objects (see SeqIO) and MultipleSeqAlignment objects (see 
 25  AlignIO). SeqRecord objects are used to store the actual matching hit and query 
 26  sequences, while MultipleSeqAlignment objects stores the alignment between them. 
 27   
 28  A detailed description of these objects' features and their example usages are 
 29  available in their respective documentations. 
 30   
 31   
 32  Input 
 33  ===== 
 34  The main function for parsing search output files is Bio.SearchIO.parse(...). 
 35  This function parses a given search output file and returns a generator object 
 36  that yields one QueryResult object per iteration. 
 37   
 38  `parse` takes two arguments: 1) a file handle or a filename of the input file 
 39  (the search output file) and 2) the format name. 
 40   
 41      >>> from Bio import SearchIO 
 42      >>> for qresult in SearchIO.parse('Blast/mirna.xml', 'blast-xml'): 
 43      ...     print("%s %s" % (qresult.id, qresult.description)) 
 44      ... 
 45      33211 mir_1 
 46      33212 mir_2 
 47      33213 mir_3 
 48   
 49  SearchIO also provides the Bio.SearchIO.read(...) function, which is intended 
 50  for use on search output files containing only one query. `read` returns one 
 51  QueryResult object and will raise an exception if the source file contains more 
 52  than one queries: 
 53   
 54      >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 
 55      >>> print("%s %s" % (qresult.id, qresult.description)) 
 56      ... 
 57      gi|11464971:4-101 pleckstrin [Mus musculus] 
 58   
 59      >>> SearchIO.read('Blast/mirna.xml', 'blast-xml') 
 60      Traceback (most recent call last): 
 61      ... 
 62      ValueError: ... 
 63   
 64  For accessing search results of large output files, you may use the indexing 
 65  functions Bio.SearchIO.index(...) or Bio.SearchIO.index_db(...). They have a 
 66  similar interface to their counterparts in SeqIO and AlignIO, with the addition 
 67  of optional, format-specific keyword arguments. 
 68   
 69   
 70  Output 
 71  ====== 
 72  SearchIO has writing support for several formats, accessible from the 
 73  Bio.SearchIO.write(...) function. This function returns a tuple of four 
 74  numbers: the number of QueryResult, Hit, HSP, and HSPFragment written:: 
 75   
 76      qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 
 77      SearchIO.write(qresults, 'results.tab', 'blast-tab') 
 78      <stdout> (3, 239, 277, 277) 
 79   
 80  Note that different writers may require different attribute values of the 
 81  SearchIO objects. This limits the scope of writable search results to search 
 82  results possessing the required attributes. 
 83   
 84  For example, the writer for HMMER domain table output requires 
 85  the conditional e-value attribute from each HSP object, among others. If you 
 86  try to write to the HMMER domain table format and your HSPs do not have this 
 87  attribute, an exception will be raised. 
 88   
 89   
 90  Conversion 
 91  ========== 
 92  SearchIO provides a shortcut function Bio.SearchIO.convert(...) to convert a 
 93  given file into another format. Under the hood, `convert` simply parses a given 
 94  output file and writes it to another using the `parse` and `write` functions. 
 95   
 96  Note that the same restrictions found in Bio.SearchIO.write(...) applies to the 
 97  convert function as well. 
 98   
 99   
100  Conventions 
101  =========== 
102  The main goal of creating SearchIO is to have a common, easy to use interface 
103  across different search output files. As such, we have also created some 
104  conventions / standards for SearchIO that extend beyond the common object model. 
105  These conventions apply to all files parsed by SearchIO, regardless of their 
106  individual formats. 
107   
108  Python-style sequence coordinates 
109  --------------------------------- 
110   
111  When storing sequence coordinates (start and end values), SearchIO uses 
112  the Python-style slice convention: zero-based and half-open intervals. For 
113  example, if in a BLAST XML output file the start and end coordinates of an 
114  HSP are 10 and 28, they would become 9 and 28 in SearchIO. The start 
115  coordinate becomes 9 because Python indices start from zero, while the end 
116  coordinate remains 28 as Python slices omit the last item in an interval. 
117   
118  Beside giving you the benefits of standardization, this convention also 
119  makes the coordinates usable for slicing sequences. For example, given a 
120  full query sequence and the start and end coordinates of an HSP, one can 
121  use the coordinates to extract part of the query sequence that results in 
122  the database hit. 
123   
124  When these objects are written to an output file using 
125  SearchIO.write(...), the coordinate values are restored to their 
126  respective format's convention. Using the example above, if the HSP would 
127  be written to an XML file, the start and end coordinates would become 10 
128  and 28 again. 
129   
130  Sequence coordinate order 
131  ------------------------- 
132   
133  Some search output format reverses the start and end coordinate sequences 
134  according to the sequence's strand. For example, in BLAST plain text 
135  format if the matching strand lies in the minus orientation, then the 
136  start coordinate will always be bigger than the end coordinate. 
137   
138  In SearchIO, start coordinates are always smaller than the end 
139  coordinates, regardless of their originating strand. This ensures 
140  consistency when using the coordinates to slice full sequences. 
141   
142  Note that this coordinate order convention is only enforced in the 
143  HSPFragment level. If an HSP object has several HSPFragment objects, each 
144  individual fragment will conform to this convention. But the order of the 
145  fragments within the HSP object follows what the search output file uses. 
146   
147  Similar to the coordinate style convention, the start and end coordinates' 
148  order are restored to their respective formats when the objects are 
149  written using Bio.SearchIO.write(...). 
150   
151  Frames and strand values 
152  ------------------------ 
153   
154  SearchIO only allows -1, 0, 1 and None as strand values. For frames, the 
155  only allowed values are integers from -3 to 3 (inclusive) and None. Both 
156  of these are standard Biopython conventions. 
157   
158   
159  Supported Formats 
160  ================= 
161  Below is a list of search program output formats supported by SearchIO. 
162   
163  Support for parsing, indexing, and writing: 
164   
165   - blast-tab        - BLAST+ tabular output. Both variants without comments 
166                        (-m 6 flag) and with comments (-m 7 flag) are supported. 
167   - blast-xml        - BLAST+ XML output. 
168   - blat-psl         - The default output of BLAT (PSL format). Variants with or 
169                        without header are both supported. PSLX (PSL + sequences) 
170                        is also supported. 
171   - hmmer3-tab       - HMMER3 table output. 
172   - hmmer3-domtab    - HMMER3 domain table output. When using this format, the 
173                        program name has to be specified. For example, for parsing 
174                        hmmscan output, the name would be 'hmmscan-domtab'. 
175   
176  Support for parsing and indexing: 
177   
178   - exonerate-text   - Exonerate plain text output. 
179   - exonerate-vulgar - Exonerate vulgar line. 
180   - exonerate-cigar  - Exonerate cigar line. 
181   - fasta-m10        - Bill Pearson's FASTA -m 10 output. 
182   - hmmer3-text      - HMMER3 regular text output format. Supported HMMER3 
183                        subprograms are hmmscan, hmmsearch, and phmmer. 
184   - hmmer2-text      - HMMER2 regular text output format. Supported HMMER2 
185                        subprograms are hmmpfam, hmmsearch. 
186   
187  Support for parsing: 
188   
189   - blast-text       - BLAST+ plain text output. 
190   
191  Each of these formats have different keyword arguments available for use with 
192  the main SearchIO functions. More details and examples are available in each 
193  of the format's documentation. 
194   
195  """ 
196   
197  from __future__ import print_function 
198  from Bio._py3k import basestring 
199   
200  import sys 
201  import warnings 
202   
203  from Bio import BiopythonExperimentalWarning 
204  from Bio.File import as_handle 
205  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
206  from Bio.SearchIO._utils import get_processor 
207   
208   
209  warnings.warn('Bio.SearchIO is an experimental submodule which may undergo ' 
210          'significant changes prior to its future official release.', 
211          BiopythonExperimentalWarning) 
212   
213   
214  __all__ = ['read', 'parse', 'to_dict', 'index', 'index_db', 'write', 'convert'] 
215   
216  __docformat__ = "restructuredtext en" 
217   
218   
219  # dictionary of supported formats for parse() and read() 
220  _ITERATOR_MAP = { 
221          'blast-tab': ('BlastIO', 'BlastTabParser'), 
222          'blast-text': ('BlastIO', 'BlastTextParser'), 
223          'blast-xml': ('BlastIO', 'BlastXmlParser'), 
224          'blat-psl': ('BlatIO', 'BlatPslParser'), 
225          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarParser'), 
226          'exonerate-text': ('ExonerateIO', 'ExonerateTextParser'), 
227          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarParser'), 
228          'fasta-m10': ('FastaIO', 'FastaM10Parser'), 
229          'hmmer2-text': ('HmmerIO', 'Hmmer2TextParser'), 
230          'hmmer3-text': ('HmmerIO', 'Hmmer3TextParser'), 
231          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabParser'), 
232          # for hmmer3-domtab, the specific program is part of the format name 
233          # as we need it distinguish hit / target coordinates 
234          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitParser'), 
235          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
236          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryParser'), 
237  } 
238   
239  # dictionary of supported formats for index() 
240  _INDEXER_MAP = { 
241          'blast-tab': ('BlastIO', 'BlastTabIndexer'), 
242          'blast-xml': ('BlastIO', 'BlastXmlIndexer'), 
243          'blat-psl': ('BlatIO', 'BlatPslIndexer'), 
244          'exonerate-cigar': ('ExonerateIO', 'ExonerateCigarIndexer'), 
245          'exonerate-text': ('ExonerateIO', 'ExonerateTextIndexer'), 
246          'exonerate-vulgar': ('ExonerateIO', 'ExonerateVulgarIndexer'), 
247          'fasta-m10': ('FastaIO', 'FastaM10Indexer'), 
248          'hmmer2-text': ('HmmerIO', 'Hmmer2TextIndexer'), 
249          'hmmer3-text': ('HmmerIO', 'Hmmer3TextIndexer'), 
250          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabIndexer'), 
251          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitIndexer'), 
252          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
253          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryIndexer'), 
254  } 
255   
256  # dictionary of supported formats for write() 
257  _WRITER_MAP = { 
258          'blast-tab': ('BlastIO', 'BlastTabWriter'), 
259          'blast-xml': ('BlastIO', 'BlastXmlWriter'), 
260          'blat-psl': ('BlatIO', 'BlatPslWriter'), 
261          'hmmer3-tab': ('HmmerIO', 'Hmmer3TabWriter'), 
262          'hmmscan3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmhitWriter'), 
263          'hmmsearch3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
264          'phmmer3-domtab': ('HmmerIO', 'Hmmer3DomtabHmmqueryWriter'), 
265  } 
266   
267   
268 -def parse(handle, format=None, **kwargs):
269 """Turns a search output file into a generator that yields QueryResult 270 objects. 271 272 - handle - Handle to the file, or the filename as a string. 273 - format - Lower case string denoting one of the supported formats. 274 - kwargs - Format-specific keyword arguments. 275 276 This function is used to iterate over each query in a given search output 277 file: 278 279 >>> from Bio import SearchIO 280 >>> qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 281 >>> qresults 282 <generator object ...> 283 >>> for qresult in qresults: 284 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 285 ... 286 Search 33211 has 100 hits 287 Search 33212 has 44 hits 288 Search 33213 has 95 hits 289 290 Depending on the file format, `parse` may also accept additional keyword 291 argument(s) that modifies the behavior of the format parser. Here is a 292 simple example, where the keyword argument enables parsing of a commented 293 BLAST tabular output file: 294 295 >>> from Bio import SearchIO 296 >>> for qresult in SearchIO.parse('Blast/mirna.tab', 'blast-tab', comments=True): 297 ... print("Search %s has %i hits" % (qresult.id, len(qresult))) 298 ... 299 Search 33211 has 100 hits 300 Search 33212 has 44 hits 301 Search 33213 has 95 hits 302 303 """ 304 # get the iterator object and do error checking 305 iterator = get_processor(format, _ITERATOR_MAP) 306 307 # HACK: force BLAST XML decoding to use utf-8 308 handle_kwargs = {} 309 if format == 'blast-xml' and sys.version_info[0] > 2: 310 handle_kwargs['encoding'] = 'utf-8' 311 312 # and start iterating 313 with as_handle(handle, 'rU', **handle_kwargs) as source_file: 314 generator = iterator(source_file, **kwargs) 315 316 for qresult in generator: 317 yield qresult
318 319
320 -def read(handle, format=None, **kwargs):
321 """Turns a search output file containing one query into a single QueryResult. 322 323 - handle - Handle to the file, or the filename as a string. 324 - format - Lower case string denoting one of the supported formats. 325 - kwargs - Format-specific keyword arguments. 326 327 `read` is used for parsing search output files containing exactly one query: 328 329 >>> from Bio import SearchIO 330 >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') 331 >>> print("%s %s" % (qresult.id, qresult.description)) 332 ... 333 gi|11464971:4-101 pleckstrin [Mus musculus] 334 335 If the given handle has no results, an exception will be raised: 336 337 >>> from Bio import SearchIO 338 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_002.txt', 'blast-tab') 339 Traceback (most recent call last): 340 ... 341 ValueError: No query results found in handle 342 343 Similarly, if the given handle has more than one results, an exception will 344 also be raised: 345 346 >>> from Bio import SearchIO 347 >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_001.txt', 'blast-tab') 348 Traceback (most recent call last): 349 ... 350 ValueError: More than one query results found in handle 351 352 Like `parse`, `read` may also accept keyword argument(s) depending on the 353 search output file format. 354 355 """ 356 generator = parse(handle, format, **kwargs) 357 358 try: 359 first = next(generator) 360 except StopIteration: 361 raise ValueError("No query results found in handle") 362 else: 363 try: 364 second = next(generator) 365 except StopIteration: 366 second = None 367 368 if second is not None: 369 raise ValueError("More than one query results found in handle") 370 371 return first
372 373
374 -def to_dict(qresults, key_function=lambda rec: rec.id):
375 """Turns a QueryResult iterator or list into a dictionary. 376 377 - qresults - Iterable returning QueryResult objects. 378 - key_function - Optional callback function which when given a 379 QueryResult object should return a unique key for the 380 dictionary. 381 382 This function enables access of QueryResult objects from a single search 383 output file using its identifier. 384 385 >>> from Bio import SearchIO 386 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 387 >>> search_dict = SearchIO.to_dict(qresults) 388 >>> sorted(search_dict) 389 ['gi|156630997:105-1160', ..., 'gi|371502086:108-1205', 'gi|53729353:216-1313'] 390 >>> search_dict['gi|156630997:105-1160'] 391 QueryResult(id='gi|156630997:105-1160', 5 hits) 392 393 By default, the dictionary key is the QueryResult's string ID. This may be 394 changed by supplying a callback function that returns the desired identifier. 395 Here is an example using a function that removes the 'gi|' part in the 396 beginning of the QueryResult ID. 397 398 >>> from Bio import SearchIO 399 >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') 400 >>> key_func = lambda qresult: qresult.id.split('|')[1] 401 >>> search_dict = SearchIO.to_dict(qresults, key_func) 402 >>> sorted(search_dict) 403 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 404 >>> search_dict['156630997:105-1160'] 405 QueryResult(id='gi|156630997:105-1160', 5 hits) 406 407 Note that the callback function does not change the QueryResult's ID value. 408 It only changes the key value used to retrieve the associated QueryResult. 409 410 As this function loads all QueryResult objects into memory, it may be 411 unsuitable for dealing with files containing many queries. In that case, it 412 is recommended that you use either `index` or `index_db`. 413 414 """ 415 qdict = {} 416 for qresult in qresults: 417 key = key_function(qresult) 418 if key in qdict: 419 raise ValueError("Duplicate key %r" % key) 420 qdict[key] = qresult 421 return qdict
422 423
424 -def index(filename, format=None, key_function=None, **kwargs):
425 """Indexes a search output file and returns a dictionary-like object. 426 427 - filename - string giving name of file to be indexed 428 - format - Lower case string denoting one of the supported formats. 429 - key_function - Optional callback function which when given a 430 QueryResult should return a unique key for the dictionary. 431 - kwargs - Format-specific keyword arguments. 432 433 Index returns a pseudo-dictionary object with QueryResult objects as its 434 values and a string identifier as its keys. The function is mainly useful 435 for dealing with large search output files, as it enables access to any 436 given QueryResult object much faster than using parse or read. 437 438 Index works by storing in-memory the start locations of all queries in a 439 file. When a user requested access to the query, this function will jump 440 to its start position, parse the whole query, and return it as a 441 QueryResult object: 442 443 >>> from Bio import SearchIO 444 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml') 445 >>> search_idx 446 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=None) 447 >>> sorted(search_idx) 448 ['gi|156630997:105-1160', 'gi|195230749:301-1383', ..., 'gi|53729353:216-1313'] 449 >>> search_idx['gi|195230749:301-1383'] 450 QueryResult(id='gi|195230749:301-1383', 5 hits) 451 >>> search_idx.close() 452 453 If the file is BGZF compressed, this is detected automatically. Ordinary 454 GZIP files are not supported: 455 456 >>> from Bio import SearchIO 457 >>> search_idx = SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml') 458 >>> search_idx 459 SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml', key_function=None) 460 >>> search_idx['gi|195230749:301-1383'] 461 QueryResult(id='gi|195230749:301-1383', 5 hits) 462 >>> search_idx.close() 463 464 You can supply a custom callback function to alter the default identifier 465 string. This function should accept as its input the QueryResult ID string 466 and return a modified version of it. 467 468 >>> from Bio import SearchIO 469 >>> key_func = lambda id: id.split('|')[1] 470 >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml', key_func) 471 >>> search_idx 472 SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=<function <lambda> at ...>) 473 >>> sorted(search_idx) 474 ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] 475 >>> search_idx['156630997:105-1160'] 476 QueryResult(id='gi|156630997:105-1160', 5 hits) 477 >>> search_idx.close() 478 479 Note that the callback function does not change the QueryResult's ID value. 480 It only changes the key value used to retrieve the associated QueryResult. 481 482 """ 483 if not isinstance(filename, basestring): 484 raise TypeError("Need a filename (not a handle)") 485 486 from Bio.File import _IndexedSeqFileDict 487 proxy_class = get_processor(format, _INDEXER_MAP) 488 repr = "SearchIO.index(%r, %r, key_function=%r)" \ 489 % (filename, format, key_function) 490 return _IndexedSeqFileDict(proxy_class(filename, **kwargs), 491 key_function, repr, "QueryResult")
492 493
494 -def index_db(index_filename, filenames=None, format=None, 495 key_function=None, **kwargs):
496 """Indexes several search output files into an SQLite database. 497 498 - index_filename - The SQLite filename. 499 - filenames - List of strings specifying file(s) to be indexed, or when 500 indexing a single file this can be given as a string. 501 (optional if reloading an existing index, but must match) 502 - format - Lower case string denoting one of the supported formats. 503 (optional if reloading an existing index, but must match) 504 - key_function - Optional callback function which when given a 505 QueryResult identifier string should return a unique 506 key for the dictionary. 507 - kwargs - Format-specific keyword arguments. 508 509 The `index_db` function is similar to `index` in that it indexes the start 510 position of all queries from search output files. The main difference is 511 instead of storing these indices in-memory, they are written to disk as an 512 SQLite database file. This allows the indices to persist between Python 513 sessions. This enables access to any queries in the file without any 514 indexing overhead, provided it has been indexed at least once. 515 516 >>> from Bio import SearchIO 517 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 518 >>> db_idx = SearchIO.index_db(idx_filename, 'Blast/mirna.xml', 'blast-xml') 519 >>> sorted(db_idx) 520 ['33211', '33212', '33213'] 521 >>> db_idx['33212'] 522 QueryResult(id='33212', 44 hits) 523 >>> db_idx.close() 524 525 `index_db` can also index multiple files and store them in the same 526 database, making it easier to group multiple search files and access them 527 from a single interface. 528 529 >>> from Bio import SearchIO 530 >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! 531 >>> files = ['Blast/mirna.xml', 'Blast/wnts.xml'] 532 >>> db_idx = SearchIO.index_db(idx_filename, files, 'blast-xml') 533 >>> sorted(db_idx) 534 ['33211', '33212', '33213', 'gi|156630997:105-1160', ..., 'gi|53729353:216-1313'] 535 >>> db_idx['33212'] 536 QueryResult(id='33212', 44 hits) 537 >>> db_idx.close() 538 539 One common example where this is helpful is if you had a large set of 540 query sequences (say ten thousand) which you split into ten query files 541 of one thousand sequences each in order to run as ten separate BLAST jobs 542 on a cluster. You could use `index_db` to index the ten BLAST output 543 files together for seamless access to all the results as one dictionary. 544 545 Note that ':memory:' rather than an index filename tells SQLite to hold 546 the index database in memory. This is useful for quick tests, but using 547 the Bio.SearchIO.index(...) function instead would use less memory. 548 549 BGZF compressed files are supported, and detected automatically. Ordinary 550 GZIP compressed files are not supported. 551 552 See also Bio.SearchIO.index(), Bio.SearchIO.to_dict(), and the Python module 553 glob which is useful for building lists of files. 554 """ 555 # cast filenames to list if it's a string 556 # (can we check if it's a string or a generator?) 557 if isinstance(filenames, basestring): 558 filenames = [filenames] 559 560 from Bio.File import _SQLiteManySeqFilesDict 561 repr = "SearchIO.index_db(%r, filenames=%r, format=%r, key_function=%r, ...)" \ 562 % (index_filename, filenames, format, key_function) 563 564 def proxy_factory(format, filename=None): 565 """Given a filename returns proxy object, else boolean if format OK.""" 566 if filename: 567 return get_processor(format, _INDEXER_MAP)(filename, **kwargs) 568 else: 569 return format in _INDEXER_MAP
570 571 return _SQLiteManySeqFilesDict(index_filename, filenames, 572 proxy_factory, format, 573 key_function, repr) 574 575
576 -def write(qresults, handle, format=None, **kwargs):
577 """Writes QueryResult objects to a file in the given format. 578 579 - qresults - An iterator returning QueryResult objects or a single 580 QueryResult object. 581 - handle - Handle to the file, or the filename as a string. 582 - format - Lower case string denoting one of the supported formats. 583 - kwargs - Format-specific keyword arguments. 584 585 The `write` function writes QueryResult object(s) into the given output 586 handle / filename. You can supply it with a single QueryResult object or an 587 iterable returning one or more QueryResult objects. In both cases, the 588 function will return a tuple of four values: the number of QueryResult, Hit, 589 HSP, and HSPFragment objects it writes to the output file:: 590 591 from Bio import SearchIO 592 qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') 593 SearchIO.write(qresults, 'results.tab', 'blast-tab') 594 <stdout> (3, 239, 277, 277) 595 596 The output of different formats may be adjusted using the format-specific 597 keyword arguments. Here is an example that writes BLAT PSL output file with 598 a header:: 599 600 from Bio import SearchIO 601 qresults = SearchIO.parse('Blat/psl_34_001.psl', 'blat-psl') 602 SearchIO.write(qresults, 'results.tab', 'blat-psl', header=True) 603 <stdout> (2, 13, 22, 26) 604 605 """ 606 # turn qresults into an iterator if it's a single QueryResult object 607 if isinstance(qresults, QueryResult): 608 qresults = iter([qresults]) 609 else: 610 qresults = iter(qresults) 611 612 # get the writer object and do error checking 613 writer_class = get_processor(format, _WRITER_MAP) 614 615 # write to the handle 616 with as_handle(handle, 'w') as target_file: 617 writer = writer_class(target_file, **kwargs) 618 # count how many qresults, hits, and hsps 619 qresult_count, hit_count, hsp_count, frag_count = \ 620 writer.write_file(qresults) 621 622 return qresult_count, hit_count, hsp_count, frag_count
623 624
625 -def convert(in_file, in_format, out_file, out_format, in_kwargs=None, 626 out_kwargs=None):
627 """Convert between two search output formats, return number of records. 628 629 - in_file - Handle to the input file, or the filename as string. 630 - in_format - Lower case string denoting the format of the input file. 631 - out_file - Handle to the output file, or the filename as string. 632 - out_format - Lower case string denoting the format of the output file. 633 - in_kwargs - Dictionary of keyword arguments for the input function. 634 - out_kwargs - Dictionary of keyword arguments for the output function. 635 636 The convert function is a shortcut function for `parse` and `write`. It has 637 the same return type as `write`. Format-specific arguments may be passed to 638 the convert function, but only as dictionaries. 639 640 Here is an example of using `convert` to convert from a BLAST+ XML file 641 into a tabular file with comments:: 642 643 from Bio import SearchIO 644 in_file = 'Blast/mirna.xml' 645 in_fmt = 'blast-xml' 646 out_file = 'results.tab' 647 out_fmt = 'blast-tab' 648 out_kwarg = {'comments': True} 649 SearchIO.convert(in_file, in_fmt, out_file, out_fmt, out_kwargs=out_kwarg) 650 <stdout> (3, 239, 277, 277) 651 652 Given that different search output file provide different statistics and 653 different level of details, the convert function is limited only to 654 converting formats that have the same statistics and for conversion to 655 formats with the same level of detail, or less. 656 657 For example, converting from a BLAST+ XML output to a HMMER table file 658 is not possible, as these are two search programs with different kinds of 659 statistics. In theory, you may provide the necessary values required by the 660 HMMER table file (e.g. conditional e-values, envelope coordinates, etc). 661 However, these values are likely to hold little meaning as they are not true 662 HMMER-computed values. 663 664 Another example is converting from BLAST+ XML to BLAST+ tabular file. This 665 is possible, as BLAST+ XML provide all the values necessary to create a 666 BLAST+ tabular file. However, the reverse conversion may not be possible. 667 There are more details covered in the XML file that are not found in a 668 tabular file (e.g. the lambda and kappa values) 669 670 """ 671 if in_kwargs is None: 672 in_kwargs = {} 673 if out_kwargs is None: 674 out_kwargs = {} 675 676 qresults = parse(in_file, in_format, **in_kwargs) 677 return write(qresults, out_file, out_format, **out_kwargs)
678 679 680 # if not used as a module, run the doctest 681 if __name__ == "__main__": 682 from Bio._utils import run_doctest 683 run_doctest() 684