Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.FastaIO

  1  # Adapted from Bio.AlignIO.FastaIO copyright 2008-2011 by Peter Cock. 
  2  # Copyright 2012 by Wibowo Arindrarto. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.SearchIO support for Bill Pearson's FASTA tools. 
  9   
 10  This module adds support for parsing FASTA outputs. FASTA is a suite of 
 11  programs that finds regions of local or global similarity between protein 
 12  or nucleotide sequences, either by searching databases or identifying 
 13  local duplications. 
 14   
 15  Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions: 
 16   
 17      - flavors: fasta, ssearch, tfastx 
 18      - versions: 35, 36 
 19   
 20  Other flavors and/or versions may introduce some bugs. Please file a bug report 
 21  if you see such problems to Biopython's bug tracker. 
 22   
 23  More information on FASTA are available through these links: 
 24   
 25      - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml 
 26      - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf 
 27   
 28   
 29  Supported Formats 
 30  ================= 
 31   
 32  Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by 
 33  the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular 
 34  format using the -m 8 flag) may be parseable but using SearchIO's other parsers 
 35  (in this case, using the 'blast-tab' parser). 
 36   
 37   
 38  fasta-m10 
 39  ========= 
 40   
 41  Note that in FASTA -m 10 outputs, HSPs from different strands are considered to 
 42  be from different hits. They are listed as two separate entries in the hit 
 43  table. FastaIO recognizes this and will group HSPs with the same hit ID into a 
 44  single Hit object, regardless of strand. 
 45   
 46  FASTA also sometimes output extra sequences adjacent to the HSP match. These 
 47  extra sequences are discarded by FastaIO. Only regions containing the actual 
 48  sequence match are extracted. 
 49   
 50  The following object attributes are provided: 
 51   
 52  +-----------------+-------------------------+----------------------------------+ 
 53  | Object          | Attribute               | Value                            | 
 54  +=================+=========================+==================================+ 
 55  | QueryResult     | description             | query sequence description       | 
 56  |                 +-------------------------+----------------------------------+ 
 57  |                 | id                      | query sequence ID                | 
 58  |                 +-------------------------+----------------------------------+ 
 59  |                 | program                 | FASTA flavor                     | 
 60  |                 +-------------------------+----------------------------------+ 
 61  |                 | seq_len                 | full length of query sequence    | 
 62  |                 +-------------------------+----------------------------------+ 
 63  |                 | target                  | target search database           | 
 64  |                 +-------------------------+----------------------------------+ 
 65  |                 | version                 | FASTA version                    | 
 66  +-----------------+-------------------------+----------------------------------+ 
 67  | Hit             | seq_len                 | full length of the hit sequence  | 
 68  +-----------------+-------------------------+----------------------------------+ 
 69  | HSP             | bitscore                | \*_bits line                     | 
 70  |                 +-------------------------+----------------------------------+ 
 71  |                 | evalue                  | \*_expect line                   | 
 72  |                 +-------------------------+----------------------------------+ 
 73  |                 | ident_pct               | \*_ident line                    | 
 74  |                 +-------------------------+----------------------------------+ 
 75  |                 | init1_score             | \*_init1 line                    | 
 76  |                 +-------------------------+----------------------------------+ 
 77  |                 | initn_score             | \*_initn line                    | 
 78  |                 +-------------------------+----------------------------------+ 
 79  |                 | opt_score               | \*_opt line, \*_s-w opt line     | 
 80  |                 +-------------------------+----------------------------------+ 
 81  |                 | pos_pct                 | \*_sim line                      | 
 82  |                 +-------------------------+----------------------------------+ 
 83  |                 | sw_score                | \*_score line                    | 
 84  |                 +-------------------------+----------------------------------+ 
 85  |                 | z_score                 | \*_z-score line                  | 
 86  +-----------------+-------------------------+----------------------------------+ 
 87  | HSPFragment     | aln_annotation          | al_cons block, if present        | 
 88  | (also via HSP)  +-------------------------+----------------------------------+ 
 89  |                 | hit                     | hit sequence                     | 
 90  |                 +-------------------------+----------------------------------+ 
 91  |                 | hit_end                 | hit sequence end coordinate      | 
 92  |                 +-------------------------+----------------------------------+ 
 93  |                 | hit_start               | hit sequence start coordinate    | 
 94  |                 +-------------------------+----------------------------------+ 
 95  |                 | hit_strand              | hit sequence strand              | 
 96  |                 +-------------------------+----------------------------------+ 
 97  |                 | query                   | query sequence                   | 
 98  |                 +-------------------------+----------------------------------+ 
 99  |                 | query_end               | query sequence end coordinate    | 
100  |                 +-------------------------+----------------------------------+ 
101  |                 | query_start             | query sequence start coordinate  | 
102  |                 +-------------------------+----------------------------------+ 
103  |                 | query_strand            | query sequence strand            | 
104  +-----------------+-------------------------+----------------------------------+ 
105   
106  """ 
107   
108  import re 
109   
110  from Bio._py3k import _as_bytes, _bytes_to_string 
111  from Bio.Alphabet import generic_dna, generic_protein 
112  from Bio.File import UndoHandle 
113  from Bio.SearchIO._index import SearchIndexer 
114  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
115   
116   
117  __all__ = ['FastaM10Parser', 'FastaM10Indexer'] 
118   
119   
120  # precompile regex patterns 
121  # regex for program name 
122  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search') 
123  # regex for sequence ID and length ~ deals with both \n and \r\n 
124  _PTR_ID_DESC_SEQLEN = r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$' 
125  _RE_ID_DESC_SEQLEN = re.compile(_PTR_ID_DESC_SEQLEN) 
126  _RE_ID_DESC_SEQLEN_IDX = re.compile(_as_bytes(_PTR_ID_DESC_SEQLEN)) 
127  # regex for qresult, hit, or hsp attribute value 
128  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$') 
129  # regex for capturing excess start and end sequences in alignments 
130  _RE_START_EXC = re.compile(r'^-*') 
131  _RE_END_EXC = re.compile(r'-*$') 
132   
133  # attribute name mappings 
134  _HSP_ATTR_MAP = { 
135      '_initn': ('initn_score', int), 
136      '_init1': ('init1_score', int), 
137      '_opt': ('opt_score', int), 
138      '_s-w opt': ('opt_score', int), 
139      '_z-score': ('z_score', float), 
140      '_bits': ('bitscore', float), 
141      '_expect': ('evalue', float), 
142      '_score': ('sw_score', int), 
143      '_ident': ('ident_pct', float), 
144      '_sim': ('pos_pct', float), 
145  } 
146   
147  # state flags 
148  _STATE_NONE = 0 
149  _STATE_QUERY_BLOCK = 1 
150  _STATE_HIT_BLOCK = 2 
151  _STATE_CONS_BLOCK = 3 
152   
153   
154 -def _set_qresult_hits(qresult, hit_rows=()):
155 """Helper function for appending Hits without alignments into QueryResults.""" 156 for hit_row in hit_rows: 157 hit_id, remainder = hit_row.split(' ', 1) 158 # TODO: parse hit and hsp properties properly; by dealing with: 159 # - any character in the description (brackets, spaces, etc.) 160 # - possible [f] or [r] presence (for frame info) 161 # - possible presence of E2() column 162 # - possible incomplete hit_id due to column length limit 163 # The current method only looks at the Hit ID, none of the things above 164 if hit_id not in qresult: 165 frag = HSPFragment(hit_id, qresult.id) 166 hsp = HSP([frag]) 167 hit = Hit([hsp]) 168 qresult.append(hit) 169 170 return qresult
171 172
173 -def _set_hsp_seqs(hsp, parsed, program):
174 """Helper function for the main parsing code. 175 176 :param hsp: HSP whose properties will be set 177 :type hsp: HSP 178 :param parsed: parsed values of the HSP attributes 179 :type parsed: dictionary {string: object} 180 :param program: program name 181 :type program: string 182 183 """ 184 # get aligned sequences and check if they have equal lengths 185 start = 0 186 for seq_type in ('hit', 'query'): 187 if 'tfast' not in program: 188 pseq = parsed[seq_type] 189 # adjust start and end coordinates based on the amount of 190 # filler characters 191 start, stop = _get_aln_slice_coords(pseq) 192 start_adj = len(re.search(_RE_START_EXC, pseq['seq']).group(0)) 193 stop_adj = len(re.search(_RE_END_EXC, pseq['seq']).group(0)) 194 start = start + start_adj 195 stop = stop + start_adj - stop_adj 196 parsed[seq_type]['seq'] = pseq['seq'][start:stop] 197 assert len(parsed['query']['seq']) == len(parsed['hit']['seq']), "%r %r" \ 198 % (len(parsed['query']['seq']), len(parsed['hit']['seq'])) 199 if 'similarity' in hsp.aln_annotation: 200 # only using 'start' since FASTA seems to have trimmed the 'excess' 201 # end part 202 hsp.aln_annotation['similarity'] = hsp.aln_annotation['similarity'][start:] 203 # hit or query works equally well here 204 assert len(hsp.aln_annotation['similarity']) == len(parsed['hit']['seq']) 205 206 # query and hit sequence types must be the same 207 assert parsed['query']['_type'] == parsed['hit']['_type'] 208 type_val = parsed['query']['_type'] # hit works fine too 209 alphabet = generic_dna if type_val == 'D' else generic_protein 210 setattr(hsp.fragment, 'alphabet', alphabet) 211 212 for seq_type in ('hit', 'query'): 213 # get and set start and end coordinates 214 start = int(parsed[seq_type]['_start']) 215 end = int(parsed[seq_type]['_stop']) 216 217 setattr(hsp.fragment, seq_type + '_start', min(start, end) - 1) 218 setattr(hsp.fragment, seq_type + '_end', max(start, end)) 219 # set seq and alphabet 220 setattr(hsp.fragment, seq_type, parsed[seq_type]['seq']) 221 222 if alphabet is not generic_protein: 223 # get strand from coordinate; start <= end is plus 224 # start > end is minus 225 if start <= end: 226 setattr(hsp.fragment, seq_type + '_strand', 1) 227 else: 228 setattr(hsp.fragment, seq_type + '_strand', -1) 229 else: 230 setattr(hsp.fragment, seq_type + '_strand', 0)
231 232
233 -def _get_aln_slice_coords(parsed_hsp):
234 """Helper function for the main parsing code. 235 236 To get the actual pairwise alignment sequences, we must first 237 translate the un-gapped sequence based coordinates into positions 238 in the gapped sequence (which may have a flanking region shown 239 using leading - characters). To date, I have never seen any 240 trailing flanking region shown in the m10 file, but the 241 following code should also cope with that. 242 243 Note that this code seems to work fine even when the "sq_offset" 244 entries are prsent as a result of using the -X command line option. 245 """ 246 seq = parsed_hsp['seq'] 247 seq_stripped = seq.strip('-') 248 disp_start = int(parsed_hsp['_display_start']) 249 start = int(parsed_hsp['_start']) 250 stop = int(parsed_hsp['_stop']) 251 252 if start <= stop: 253 start = start - disp_start 254 stop = stop - disp_start + 1 255 else: 256 start = disp_start - start 257 stop = disp_start - stop + 1 258 stop += seq_stripped.count('-') 259 assert 0 <= start and start < stop and stop <= len(seq_stripped), \ 260 "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \ 261 % (seq, start, stop, parsed_hsp) 262 return start, stop
263 264
265 -class FastaM10Parser(object):
266 """Parser for Bill Pearson's FASTA suite's -m 10 output.""" 267
268 - def __init__(self, handle, __parse_hit_table=False):
269 self.handle = UndoHandle(handle) 270 self._preamble = self._parse_preamble()
271
272 - def __iter__(self):
273 for qresult in self._parse_qresult(): 274 # re-set desc, for hsp query description 275 qresult.description = qresult.description 276 yield qresult
277
278 - def _parse_preamble(self):
279 """Parses the Fasta preamble for Fasta flavor and version.""" 280 preamble = {} 281 while True: 282 self.line = self.handle.readline() 283 # this should be the line just before the first qresult 284 if self.line.startswith('Query'): 285 break 286 # try to match for version line 287 elif self.line.startswith(' version'): 288 preamble['version'] = self.line.split(' ')[2] 289 else: 290 # try to match for flavor line 291 flav_match = re.match(_RE_FLAVS, self.line.lower()) 292 if flav_match: 293 preamble['program'] = flav_match.group(0) 294 295 return preamble
296
297 - def __parse_hit_table(self):
298 """Parses hit table rows.""" 299 # move to the first row 300 self.line = self.handle.readline() 301 # parse hit table until we see an empty line 302 hit_rows = [] 303 while self.line and not self.line.strip(): 304 hit_rows.append(self.line.strip()) 305 self.line = self.handle.readline() 306 return hit_rows
307
308 - def _parse_qresult(self):
309 # initial qresult value 310 qresult = None 311 hit_rows = [] 312 # state values 313 state_QRES_NEW = 1 314 state_QRES_HITTAB = 3 315 state_QRES_CONTENT = 5 316 state_QRES_END = 7 317 318 while True: 319 320 # one line before the hit table 321 if self.line.startswith('The best scores are:'): 322 qres_state = state_QRES_HITTAB 323 # the end of a query or the file altogether 324 elif self.line.strip() == '>>>///' or not self.line: 325 qres_state = state_QRES_END 326 # the beginning of a new query 327 elif not self.line.startswith('>>>') and '>>>' in self.line: 328 qres_state = state_QRES_NEW 329 # the beginning of the query info and its hits + hsps 330 elif self.line.startswith('>>>') and not \ 331 self.line.strip() == '>>><<<': 332 qres_state = state_QRES_CONTENT 333 # default qres mark 334 else: 335 qres_state = None 336 337 if qres_state is not None: 338 if qres_state == state_QRES_HITTAB: 339 # parse hit table if flag is set 340 hit_rows = self.__parse_hit_table() 341 342 elif qres_state == state_QRES_END: 343 yield _set_qresult_hits(qresult, hit_rows) 344 break 345 346 elif qres_state == state_QRES_NEW: 347 # if qresult is filled, yield it first 348 if qresult is not None: 349 yield _set_qresult_hits(qresult, hit_rows) 350 regx = re.search(_RE_ID_DESC_SEQLEN, self.line) 351 query_id = regx.group(1) 352 seq_len = regx.group(3) 353 desc = regx.group(2) 354 qresult = QueryResult(id=query_id) 355 qresult.seq_len = int(seq_len) 356 # get target from the next line 357 self.line = self.handle.readline() 358 qresult.target = [x for x in self.line.split(' ') if x][1].strip() 359 if desc is not None: 360 qresult.description = desc 361 # set values from preamble 362 for key, value in self._preamble.items(): 363 setattr(qresult, key, value) 364 365 elif qres_state == state_QRES_CONTENT: 366 assert self.line[3:].startswith(qresult.id), self.line 367 for hit, strand in self._parse_hit(query_id): 368 # HACK: re-set desc, for hsp hit and query description 369 hit.description = hit.description 370 hit.query_description = qresult.description 371 # if hit is not in qresult, append it 372 if hit.id not in qresult: 373 qresult.append(hit) 374 # otherwise, it might be the same hit with a different strand 375 else: 376 # make sure strand is different and then append hsp to 377 # existing hit 378 for hsp in hit.hsps: 379 assert strand != hsp.query_strand 380 qresult[hit.id].append(hsp) 381 382 self.line = self.handle.readline()
383
384 - def _parse_hit(self, query_id):
385 while True: 386 self.line = self.handle.readline() 387 if self.line.startswith('>>'): 388 break 389 390 state = _STATE_NONE 391 strand = None 392 hsp_list = [] 393 while True: 394 peekline = self.handle.peekline() 395 # yield hit if we've reached the start of a new query or 396 # the end of the search 397 if peekline.strip() in [">>><<<", ">>>///"] or \ 398 (not peekline.startswith('>>>') and '>>>' in peekline): 399 # append last parsed_hsp['hit']['seq'] line 400 if state == _STATE_HIT_BLOCK: 401 parsed_hsp['hit']['seq'] += self.line.strip() 402 elif state == _STATE_CONS_BLOCK: 403 hsp.aln_annotation['similarity'] += \ 404 self.line.strip('\r\n') 405 # process HSP alignment and coordinates 406 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 407 hit = Hit(hsp_list) 408 hit.description = hit_desc 409 hit.seq_len = seq_len 410 yield hit, strand 411 hsp_list = [] 412 break 413 # yield hit and create a new one if we're still in the same query 414 elif self.line.startswith('>>'): 415 # try yielding, if we have hsps 416 if hsp_list: 417 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 418 hit = Hit(hsp_list) 419 hit.description = hit_desc 420 hit.seq_len = seq_len 421 yield hit, strand 422 hsp_list = [] 423 # try to get the hit id and desc, and handle cases without descs 424 try: 425 hit_id, hit_desc = self.line[2:].strip().split(' ', 1) 426 except ValueError: 427 hit_id = self.line[2:].strip().split(' ', 1)[0] 428 hit_desc = '' 429 # create the HSP object for Hit 430 frag = HSPFragment(hit_id, query_id) 431 hsp = HSP([frag]) 432 hsp_list.append(hsp) 433 # set or reset the state to none 434 state = _STATE_NONE 435 parsed_hsp = {'query': {}, 'hit': {}} 436 # create and append a new HSP if line starts with '>--' 437 elif self.line.startswith('>--'): 438 # set seq attributes of previous hsp 439 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 440 # and create a new one 441 frag = HSPFragment(hit_id, query_id) 442 hsp = HSP([frag]) 443 hsp_list.append(hsp) 444 # set the state ~ none yet 445 state = _STATE_NONE 446 parsed_hsp = {'query': {}, 'hit': {}} 447 # this is either query or hit data in the HSP, depending on the state 448 elif self.line.startswith('>'): 449 if state == _STATE_NONE: 450 # make sure it's the correct query 451 assert query_id.startswith(self.line[1:].split(' ')[0]), \ 452 "%r vs %r" % (query_id, self.line) 453 state = _STATE_QUERY_BLOCK 454 parsed_hsp['query']['seq'] = '' 455 elif state == _STATE_QUERY_BLOCK: 456 # make sure it's the correct hit 457 assert hit_id.startswith(self.line[1:].split(' ')[0]) 458 state = _STATE_HIT_BLOCK 459 parsed_hsp['hit']['seq'] = '' 460 # check for conservation block 461 elif self.line.startswith('; al_cons'): 462 state = _STATE_CONS_BLOCK 463 hsp.fragment.aln_annotation['similarity'] = '' 464 elif self.line.startswith(';'): 465 # Fasta outputs do not make a clear distinction between Hit 466 # and HSPs, so we check the attribute names to determine 467 # whether it belongs to a Hit or HSP 468 regx = re.search(_RE_ATTR, self.line.strip()) 469 name = regx.group(1) 470 value = regx.group(2) 471 472 # for values before the '>...' query block 473 if state == _STATE_NONE: 474 if name in _HSP_ATTR_MAP: 475 attr_name, caster = _HSP_ATTR_MAP[name] 476 if caster is not str: 477 value = caster(value) 478 if name in ['_ident', '_sim']: 479 value *= 100 480 setattr(hsp, attr_name, value) 481 # otherwise, pool the values for processing later 482 elif state == _STATE_QUERY_BLOCK: 483 parsed_hsp['query'][name] = value 484 elif state == _STATE_HIT_BLOCK: 485 if name == '_len': 486 seq_len = int(value) 487 else: 488 parsed_hsp['hit'][name] = value 489 # for values in the hit block 490 else: 491 raise ValueError("Unexpected line: %r" % self.line) 492 # otherwise, it must be lines containing the sequences 493 else: 494 assert '>' not in self.line 495 # if we're in hit, parse into hsp.hit 496 if state == _STATE_HIT_BLOCK: 497 parsed_hsp['hit']['seq'] += self.line.strip() 498 elif state == _STATE_QUERY_BLOCK: 499 parsed_hsp['query']['seq'] += self.line.strip() 500 elif state == _STATE_CONS_BLOCK: 501 hsp.fragment.aln_annotation['similarity'] += \ 502 self.line.strip('\r\n') 503 # we should not get here! 504 else: 505 raise ValueError("Unexpected line: %r" % self.line) 506 507 self.line = self.handle.readline()
508 509
510 -class FastaM10Indexer(SearchIndexer):
511 """Indexer class for Bill Pearson's FASTA suite's -m 10 output.""" 512 513 _parser = FastaM10Parser 514
515 - def __init__(self, filename):
516 SearchIndexer.__init__(self, filename) 517 self._handle = UndoHandle(self._handle)
518
519 - def __iter__(self):
520 handle = self._handle 521 handle.seek(0) 522 start_offset = handle.tell() 523 qresult_key = None 524 query_mark = _as_bytes('>>>') 525 526 while True: 527 line = handle.readline() 528 peekline = handle.peekline() 529 end_offset = handle.tell() 530 531 if not line.startswith(query_mark) and query_mark in line: 532 regx = re.search(_RE_ID_DESC_SEQLEN_IDX, line) 533 qresult_key = _bytes_to_string(regx.group(1)) 534 start_offset = end_offset - len(line) 535 # yield whenever we encounter a new query or at the end of the file 536 if qresult_key is not None: 537 if (not peekline.startswith(query_mark) and 538 query_mark in peekline) or not line: 539 yield qresult_key, start_offset, end_offset - start_offset 540 if not line: 541 break 542 start_offset = end_offset
543
544 - def get_raw(self, offset):
545 """Return the raw record from the file as a bytes string.""" 546 handle = self._handle 547 qresult_raw = _as_bytes('') 548 query_mark = _as_bytes('>>>') 549 550 # read header first 551 handle.seek(0) 552 while True: 553 line = handle.readline() 554 peekline = handle.peekline() 555 qresult_raw += line 556 if not peekline.startswith(query_mark) and query_mark in peekline: 557 break 558 559 # and read the qresult raw string 560 handle.seek(offset) 561 while True: 562 # preserve whitespace, don't use read_forward 563 line = handle.readline() 564 peekline = handle.peekline() 565 qresult_raw += line 566 567 # break when we've reached qresult end 568 if (not peekline.startswith(query_mark) and query_mark in peekline) or \ 569 not line: 570 break 571 572 # append mock end marker to qresult_raw, since it's not always present 573 return qresult_raw + _as_bytes('>>><<<\n')
574 575 576 # if not used as a module, run the doctest 577 if __name__ == "__main__": 578 from Bio._utils import run_doctest 579 run_doctest() 580