Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.FastaIO

  1  # Adapted from Bio.AlignIO.FastaIO copyright 2008-2011 by Peter Cock. 
  2  # Copyright 2012 by Wibowo Arindrarto. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.SearchIO support for Bill Pearson's FASTA tools. 
  9   
 10  This module adds support for parsing FASTA outputs. FASTA is a suite of 
 11  programs that finds regions of local or global similarity between protein 
 12  or nucleotide sequences, either by searching databases or identifying 
 13  local duplications. 
 14   
 15  Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions: 
 16   
 17      - flavors: fasta, ssearch, tfastx 
 18      - versions: 35, 36 
 19   
 20  Other flavors and/or versions may introduce some bugs. Please file a bug report 
 21  if you see such problems to Biopython's bug tracker. 
 22   
 23  More information on FASTA are available through these links: 
 24    - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml 
 25    - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf 
 26   
 27   
 28  Supported Formats 
 29  ================= 
 30   
 31  Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by 
 32  the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular 
 33  format using the -m 8 flag) may be parseable but using SearchIO's other parsers 
 34  (in this case, using the 'blast-tab' parser). 
 35   
 36   
 37  fasta-m10 
 38  ========= 
 39   
 40  Note that in FASTA -m 10 outputs, HSPs from different strands are considered to 
 41  be from different hits. They are listed as two separate entries in the hit 
 42  table. FastaIO recognizes this and will group HSPs with the same hit ID into a 
 43  single Hit object, regardless of strand. 
 44   
 45  FASTA also sometimes output extra sequences adjacent to the HSP match. These 
 46  extra sequences are discarded by FastaIO. Only regions containing the actual 
 47  sequence match are extracted. 
 48   
 49  The following object attributes are provided: 
 50   
 51  +-----------------+-------------------------+----------------------------------+ 
 52  | Object          | Attribute               | Value                            | 
 53  +=================+=========================+==================================+ 
 54  | QueryResult     | description             | query sequence description       | 
 55  |                 +-------------------------+----------------------------------+ 
 56  |                 | id                      | query sequence ID                | 
 57  |                 +-------------------------+----------------------------------+ 
 58  |                 | program                 | FASTA flavor                     | 
 59  |                 +-------------------------+----------------------------------+ 
 60  |                 | seq_len                 | full length of query sequence    | 
 61  |                 +-------------------------+----------------------------------+ 
 62  |                 | target                  | target search database           | 
 63  |                 +-------------------------+----------------------------------+ 
 64  |                 | version                 | FASTA version                    | 
 65  +-----------------+-------------------------+----------------------------------+ 
 66  | Hit             | seq_len                 | full length of the hit sequence  | 
 67  +-----------------+-------------------------+----------------------------------+ 
 68  | HSP             | bitscore                | *_bits line                      | 
 69  |                 +-------------------------+----------------------------------+ 
 70  |                 | evalue                  | *_expect line                    | 
 71  |                 +-------------------------+----------------------------------+ 
 72  |                 | ident_pct               | *_ident line                     | 
 73  |                 +-------------------------+----------------------------------+ 
 74  |                 | init1_score             | *_init1 line                     | 
 75  |                 +-------------------------+----------------------------------+ 
 76  |                 | initn_score             | *_initn line                     | 
 77  |                 +-------------------------+----------------------------------+ 
 78  |                 | opt_score               | *_opt line, *_s-w opt line       | 
 79  |                 +-------------------------+----------------------------------+ 
 80  |                 | pos_pct                 | *_sim line                       | 
 81  |                 +-------------------------+----------------------------------+ 
 82  |                 | sw_score                | *_score line                     | 
 83  |                 +-------------------------+----------------------------------+ 
 84  |                 | z_score                 | *_z-score line                   | 
 85  +-----------------+-------------------------+----------------------------------+ 
 86  | HSPFragment     | aln_annotation          | al_cons block, if present        | 
 87  | (also via HSP)  +-------------------------+----------------------------------+ 
 88  |                 | hit                     | hit sequence                     | 
 89  |                 +-------------------------+----------------------------------+ 
 90  |                 | hit_end                 | hit sequence end coordinate      | 
 91  |                 +-------------------------+----------------------------------+ 
 92  |                 | hit_start               | hit sequence start coordinate    | 
 93  |                 +-------------------------+----------------------------------+ 
 94  |                 | hit_strand              | hit sequence strand              | 
 95  |                 +-------------------------+----------------------------------+ 
 96  |                 | query                   | query sequence                   | 
 97  |                 +-------------------------+----------------------------------+ 
 98  |                 | query_end               | query sequence end coordinate    | 
 99  |                 +-------------------------+----------------------------------+ 
100  |                 | query_start             | query sequence start coordinate  | 
101  |                 +-------------------------+----------------------------------+ 
102  |                 | query_strand            | query sequence strand            | 
103  +-----------------+-------------------------+----------------------------------+ 
104   
105  """ 
106   
107  import re 
108   
109  from Bio._py3k import _as_bytes, _bytes_to_string 
110  from Bio.Alphabet import generic_dna, generic_protein 
111  from Bio.File import UndoHandle 
112  from Bio.SearchIO._index import SearchIndexer 
113  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
114   
115   
116  __all__ = ['FastaM10Parser', 'FastaM10Indexer'] 
117   
118   
119  # precompile regex patterns 
120  # regex for program name 
121  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search') 
122  # regex for sequence ID and length ~ deals with both \n and \r\n 
123  _PTR_ID_DESC_SEQLEN = r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$' 
124  _RE_ID_DESC_SEQLEN = re.compile(_PTR_ID_DESC_SEQLEN) 
125  _RE_ID_DESC_SEQLEN_IDX = re.compile(_as_bytes(_PTR_ID_DESC_SEQLEN)) 
126  # regex for qresult, hit, or hsp attribute value 
127  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$') 
128  # regex for capturing excess start and end sequences in alignments 
129  _RE_START_EXC = re.compile(r'^-*') 
130  _RE_END_EXC = re.compile(r'-*$') 
131   
132  # attribute name mappings 
133  _HSP_ATTR_MAP = { 
134      '_initn': ('initn_score', int), 
135      '_init1': ('init1_score', int), 
136      '_opt': ('opt_score', int), 
137      '_s-w opt': ('opt_score', int), 
138      '_z-score': ('z_score', float), 
139      '_bits': ('bitscore', float), 
140      '_expect': ('evalue', float), 
141      '_score': ('sw_score', int), 
142      '_ident': ('ident_pct', float), 
143      '_sim': ('pos_pct', float), 
144  } 
145   
146  # state flags 
147  _STATE_NONE = 0 
148  _STATE_QUERY_BLOCK = 1 
149  _STATE_HIT_BLOCK = 2 
150  _STATE_CONS_BLOCK = 3 
151   
152   
153 -def _set_qresult_hits(qresult, hit_rows=[]):
154 """Helper function for appending Hits without alignments into QueryResults.""" 155 for hit_row in hit_rows: 156 hit_id, remainder = hit_row.split(' ', 1) 157 # TODO: parse hit and hsp properties properly; by dealing with: 158 # - any character in the description (brackets, spaces, etc.) 159 # - possible [f] or [r] presence (for frame info) 160 # - possible presence of E2() column 161 # - possible incomplete hit_id due to column length limit 162 # The current method only looks at the Hit ID, none of the things above 163 if hit_id not in qresult: 164 frag = HSPFragment(hit_id, qresult.id) 165 hsp = HSP([frag]) 166 hit = Hit([hsp]) 167 qresult.append(hit) 168 169 return qresult
170 171
172 -def _set_hsp_seqs(hsp, parsed, program):
173 """Helper function for the main parsing code. 174 175 Arguments: 176 hsp -- HSP object whose properties are to be set. 177 parsed -- Dictionary containing parsed values for HSP attributes. 178 program -- String of program name. 179 180 """ 181 # get aligned sequences and check if they have equal lengths 182 start = 0 183 for seq_type in ('hit', 'query'): 184 if 'tfast' not in program: 185 pseq = parsed[seq_type] 186 # adjust start and end coordinates based on the amount of 187 # filler characters 188 start, stop = _get_aln_slice_coords(pseq) 189 start_adj = len(re.search(_RE_START_EXC, pseq['seq']).group(0)) 190 stop_adj = len(re.search(_RE_END_EXC, pseq['seq']).group(0)) 191 start = start + start_adj 192 stop = stop + start_adj - stop_adj 193 parsed[seq_type]['seq'] = pseq['seq'][start:stop] 194 assert len(parsed['query']['seq']) == len(parsed['hit']['seq']), "%r %r" \ 195 % (len(parsed['query']['seq']), len(parsed['hit']['seq'])) 196 if 'similarity' in hsp.aln_annotation: 197 # only using 'start' since FASTA seems to have trimmed the 'excess' 198 # end part 199 hsp.aln_annotation['similarity'] = hsp.aln_annotation['similarity'][start:] 200 # hit or query works equally well here 201 assert len(hsp.aln_annotation['similarity']) == len(parsed['hit']['seq']) 202 203 # query and hit sequence types must be the same 204 assert parsed['query']['_type'] == parsed['hit']['_type'] 205 type_val = parsed['query']['_type'] # hit works fine too 206 alphabet = generic_dna if type_val == 'D' else generic_protein 207 setattr(hsp.fragment, 'alphabet', alphabet) 208 209 for seq_type in ('hit', 'query'): 210 # get and set start and end coordinates 211 start = int(parsed[seq_type]['_start']) 212 end = int(parsed[seq_type]['_stop']) 213 214 setattr(hsp.fragment, seq_type + '_start', min(start, end) - 1) 215 setattr(hsp.fragment, seq_type + '_end', max(start, end)) 216 # set seq and alphabet 217 setattr(hsp.fragment, seq_type, parsed[seq_type]['seq']) 218 219 if alphabet is not generic_protein: 220 # get strand from coordinate; start <= end is plus 221 # start > end is minus 222 if start <= end: 223 setattr(hsp.fragment, seq_type + '_strand', 1) 224 else: 225 setattr(hsp.fragment, seq_type + '_strand', -1) 226 else: 227 setattr(hsp.fragment, seq_type + '_strand', 0)
228 229
230 -def _get_aln_slice_coords(parsed_hsp):
231 """Helper function for the main parsing code. 232 233 To get the actual pairwise alignment sequences, we must first 234 translate the un-gapped sequence based coordinates into positions 235 in the gapped sequence (which may have a flanking region shown 236 using leading - characters). To date, I have never seen any 237 trailing flanking region shown in the m10 file, but the 238 following code should also cope with that. 239 240 Note that this code seems to work fine even when the "sq_offset" 241 entries are prsent as a result of using the -X command line option. 242 """ 243 seq = parsed_hsp['seq'] 244 seq_stripped = seq.strip('-') 245 disp_start = int(parsed_hsp['_display_start']) 246 start = int(parsed_hsp['_start']) 247 stop = int(parsed_hsp['_stop']) 248 249 if start <= stop: 250 start = start - disp_start 251 stop = stop - disp_start + 1 252 else: 253 start = disp_start - start 254 stop = disp_start - stop + 1 255 stop += seq_stripped.count('-') 256 assert 0 <= start and start < stop and stop <= len(seq_stripped), \ 257 "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \ 258 % (seq, start, stop, parsed_hsp) 259 return start, stop
260 261
262 -class FastaM10Parser(object):
263 """Parser for Bill Pearson's FASTA suite's -m 10 output.""" 264
265 - def __init__(self, handle, __parse_hit_table=False):
266 self.handle = UndoHandle(handle) 267 self._preamble = self._parse_preamble()
268
269 - def __iter__(self):
270 for qresult in self._parse_qresult(): 271 # re-set desc, for hsp query description 272 qresult.description = qresult.description 273 yield qresult
274
275 - def _parse_preamble(self):
276 """Parses the Fasta preamble for Fasta flavor and version.""" 277 preamble = {} 278 while True: 279 self.line = self.handle.readline() 280 # this should be the line just before the first qresult 281 if self.line.startswith('Query'): 282 break 283 # try to match for version line 284 elif self.line.startswith(' version'): 285 preamble['version'] = self.line.split(' ')[2] 286 else: 287 # try to match for flavor line 288 flav_match = re.match(_RE_FLAVS, self.line.lower()) 289 if flav_match: 290 preamble['program'] = flav_match.group(0) 291 292 return preamble
293
294 - def __parse_hit_table(self):
295 """Parses hit table rows.""" 296 # move to the first row 297 self.line = self.handle.readline() 298 # parse hit table until we see an empty line 299 hit_rows = [] 300 while self.line and not self.line.strip(): 301 hit_rows.append(self.line.strip()) 302 self.line = self.handle.readline() 303 return hit_rows
304
305 - def _parse_qresult(self):
306 # initial qresult value 307 qresult = None 308 hit_rows = [] 309 # state values 310 state_QRES_NEW = 1 311 state_QRES_HITTAB = 3 312 state_QRES_CONTENT = 5 313 state_QRES_END = 7 314 315 while True: 316 317 # one line before the hit table 318 if self.line.startswith('The best scores are:'): 319 qres_state = state_QRES_HITTAB 320 # the end of a query or the file altogether 321 elif self.line.strip() == '>>>///' or not self.line: 322 qres_state = state_QRES_END 323 # the beginning of a new query 324 elif not self.line.startswith('>>>') and '>>>' in self.line: 325 qres_state = state_QRES_NEW 326 # the beginning of the query info and its hits + hsps 327 elif self.line.startswith('>>>') and not \ 328 self.line.strip() == '>>><<<': 329 qres_state = state_QRES_CONTENT 330 # default qres mark 331 else: 332 qres_state = None 333 334 if qres_state is not None: 335 if qres_state == state_QRES_HITTAB: 336 # parse hit table if flag is set 337 hit_rows = self.__parse_hit_table() 338 339 elif qres_state == state_QRES_END: 340 yield _set_qresult_hits(qresult, hit_rows) 341 break 342 343 elif qres_state == state_QRES_NEW: 344 # if qresult is filled, yield it first 345 if qresult is not None: 346 yield _set_qresult_hits(qresult, hit_rows) 347 regx = re.search(_RE_ID_DESC_SEQLEN, self.line) 348 query_id = regx.group(1) 349 seq_len = regx.group(3) 350 desc = regx.group(2) 351 qresult = QueryResult(id=query_id) 352 qresult.seq_len = int(seq_len) 353 # get target from the next line 354 self.line = self.handle.readline() 355 qresult.target = [x for x in self.line.split(' ') if x][1].strip() 356 if desc is not None: 357 qresult.description = desc 358 # set values from preamble 359 for key, value in self._preamble.items(): 360 setattr(qresult, key, value) 361 362 elif qres_state == state_QRES_CONTENT: 363 assert self.line[3:].startswith(qresult.id), self.line 364 for hit, strand in self._parse_hit(query_id): 365 # HACK: re-set desc, for hsp hit and query description 366 hit.description = hit.description 367 hit.query_description = qresult.description 368 # if hit is not in qresult, append it 369 if hit.id not in qresult: 370 qresult.append(hit) 371 # otherwise, it might be the same hit with a different strand 372 else: 373 # make sure strand is different and then append hsp to 374 # existing hit 375 for hsp in hit.hsps: 376 assert strand != hsp.query_strand 377 qresult[hit.id].append(hsp) 378 379 self.line = self.handle.readline()
380
381 - def _parse_hit(self, query_id):
382 while True: 383 self.line = self.handle.readline() 384 if self.line.startswith('>>'): 385 break 386 387 strand = None 388 hsp_list = [] 389 while True: 390 peekline = self.handle.peekline() 391 # yield hit if we've reached the start of a new query or 392 # the end of the search 393 if peekline.strip() in [">>><<<", ">>>///"] or \ 394 (not peekline.startswith('>>>') and '>>>' in peekline): 395 # append last parsed_hsp['hit']['seq'] line 396 if state == _STATE_HIT_BLOCK: 397 parsed_hsp['hit']['seq'] += self.line.strip() 398 elif state == _STATE_CONS_BLOCK: 399 hsp.aln_annotation['similarity'] += \ 400 self.line.strip('\r\n') 401 # process HSP alignment and coordinates 402 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 403 hit = Hit(hsp_list) 404 hit.description = hit_desc 405 hit.seq_len = seq_len 406 yield hit, strand 407 hsp_list = [] 408 break 409 # yield hit and create a new one if we're still in the same query 410 elif self.line.startswith('>>'): 411 # try yielding, if we have hsps 412 if hsp_list: 413 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 414 hit = Hit(hsp_list) 415 hit.description = hit_desc 416 hit.seq_len = seq_len 417 yield hit, strand 418 hsp_list = [] 419 # try to get the hit id and desc, and handle cases without descs 420 try: 421 hit_id, hit_desc = self.line[2:].strip().split(' ', 1) 422 except ValueError: 423 hit_id = self.line[2:].strip().split(' ', 1)[0] 424 hit_desc = '' 425 # create the HSP object for Hit 426 frag = HSPFragment(hit_id, query_id) 427 hsp = HSP([frag]) 428 hsp_list.append(hsp) 429 # set or reset the state to none 430 state = _STATE_NONE 431 parsed_hsp = {'query':{}, 'hit': {}} 432 # create and append a new HSP if line starts with '>--' 433 elif self.line.startswith('>--'): 434 # set seq attributes of previous hsp 435 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 436 # and create a new one 437 frag = HSPFragment(hit_id, query_id) 438 hsp = HSP([frag]) 439 hsp_list.append(hsp) 440 # set the state ~ none yet 441 state = _STATE_NONE 442 parsed_hsp = {'query':{}, 'hit': {}} 443 # this is either query or hit data in the HSP, depending on the state 444 elif self.line.startswith('>'): 445 if state == _STATE_NONE: 446 # make sure it's the correct query 447 assert query_id.startswith(self.line[1:].split(' ')[0]), \ 448 "%r vs %r" % (query_id, self.line) 449 state = _STATE_QUERY_BLOCK 450 parsed_hsp['query']['seq'] = '' 451 elif state == _STATE_QUERY_BLOCK: 452 # make sure it's the correct hit 453 assert hit_id.startswith(self.line[1:].split(' ')[0]) 454 state = _STATE_HIT_BLOCK 455 parsed_hsp['hit']['seq'] = '' 456 # check for conservation block 457 elif self.line.startswith('; al_cons'): 458 state = _STATE_CONS_BLOCK 459 hsp.fragment.aln_annotation['similarity'] = '' 460 elif self.line.startswith(';'): 461 # Fasta outputs do not make a clear distinction between Hit 462 # and HSPs, so we check the attribute names to determine 463 # whether it belongs to a Hit or HSP 464 regx = re.search(_RE_ATTR, self.line.strip()) 465 name = regx.group(1) 466 value = regx.group(2) 467 468 # for values before the '>...' query block 469 if state == _STATE_NONE: 470 if name in _HSP_ATTR_MAP: 471 attr_name, caster = _HSP_ATTR_MAP[name] 472 if caster is not str: 473 value = caster(value) 474 if name in ['_ident', '_sim']: 475 value *= 100 476 setattr(hsp, attr_name, value) 477 # otherwise, pool the values for processing later 478 elif state == _STATE_QUERY_BLOCK: 479 parsed_hsp['query'][name] = value 480 elif state == _STATE_HIT_BLOCK: 481 if name == '_len': 482 seq_len = int(value) 483 else: 484 parsed_hsp['hit'][name] = value 485 # for values in the hit block 486 else: 487 raise ValueError("Unexpected line: %r" % self.line) 488 # otherwise, it must be lines containing the sequences 489 else: 490 assert '>' not in self.line 491 # if we're in hit, parse into hsp.hit 492 if state == _STATE_HIT_BLOCK: 493 parsed_hsp['hit']['seq'] += self.line.strip() 494 elif state == _STATE_QUERY_BLOCK: 495 parsed_hsp['query']['seq'] += self.line.strip() 496 elif state == _STATE_CONS_BLOCK: 497 hsp.fragment.aln_annotation['similarity'] += \ 498 self.line.strip('\r\n') 499 # we should not get here! 500 else: 501 raise ValueError("Unexpected line: %r" % self.line) 502 503 self.line = self.handle.readline()
504 505
506 -class FastaM10Indexer(SearchIndexer):
507 """Indexer class for Bill Pearson's FASTA suite's -m 10 output.""" 508 509 _parser = FastaM10Parser 510
511 - def __init__(self, filename):
512 SearchIndexer.__init__(self, filename) 513 self._handle = UndoHandle(self._handle)
514
515 - def __iter__(self):
516 handle = self._handle 517 handle.seek(0) 518 start_offset = handle.tell() 519 qresult_key = None 520 query_mark = _as_bytes('>>>') 521 522 while True: 523 line = handle.readline() 524 peekline = handle.peekline() 525 end_offset = handle.tell() 526 527 if not line.startswith(query_mark) and query_mark in line: 528 regx = re.search(_RE_ID_DESC_SEQLEN_IDX, line) 529 qresult_key = _bytes_to_string(regx.group(1)) 530 start_offset = end_offset - len(line) 531 # yield whenever we encounter a new query or at the end of the file 532 if qresult_key is not None: 533 if (not peekline.startswith(query_mark) 534 and query_mark in peekline) or not line: 535 yield qresult_key, start_offset, end_offset - start_offset 536 if not line: 537 break 538 start_offset = end_offset
539
540 - def get_raw(self, offset):
541 handle = self._handle 542 qresult_raw = _as_bytes('') 543 query_mark = _as_bytes('>>>') 544 545 # read header first 546 handle.seek(0) 547 while True: 548 line = handle.readline() 549 peekline = handle.peekline() 550 qresult_raw += line 551 if not peekline.startswith(query_mark) and query_mark in peekline: 552 break 553 554 # and read the qresult raw string 555 handle.seek(offset) 556 while True: 557 # preserve whitespace, don't use read_forward 558 line = handle.readline() 559 peekline = handle.peekline() 560 qresult_raw += line 561 562 # break when we've reached qresult end 563 if (not peekline.startswith(query_mark) and query_mark in peekline) or \ 564 not line: 565 break 566 567 # append mock end marker to qresult_raw, since it's not always present 568 return qresult_raw + _as_bytes('>>><<<\n')
569 570 571 # if not used as a module, run the doctest 572 if __name__ == "__main__": 573 from Bio._utils import run_doctest 574 run_doctest() 575