Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.FastaIO

  1  # Adapted from Bio.AlignIO.FastaIO copyright 2008-2011 by Peter Cock. 
  2  # Copyright 2012 by Wibowo Arindrarto. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.SearchIO support for Bill Pearson's FASTA tools. 
  9   
 10  This module adds support for parsing FASTA outputs. FASTA is a suite of 
 11  programs that finds regions of local or global similarity between protein 
 12  or nucleotide sequences, either by searching databases or identifying 
 13  local duplications. 
 14   
 15  Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions: 
 16   
 17      - flavors: fasta, ssearch, tfastx 
 18      - versions: 35, 36 
 19   
 20  Other flavors and/or versions may introduce some bugs. Please file a bug report 
 21  if you see such problems to Biopython's bug tracker. 
 22   
 23  More information on FASTA are available through these links: 
 24   
 25      - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml 
 26      - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf 
 27   
 28   
 29  Supported Formats 
 30  ================= 
 31   
 32  Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by 
 33  the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular 
 34  format using the -m 8 flag) may be parseable but using SearchIO's other parsers 
 35  (in this case, using the 'blast-tab' parser). 
 36   
 37   
 38  fasta-m10 
 39  ========= 
 40   
 41  Note that in FASTA -m 10 outputs, HSPs from different strands are considered to 
 42  be from different hits. They are listed as two separate entries in the hit 
 43  table. FastaIO recognizes this and will group HSPs with the same hit ID into a 
 44  single Hit object, regardless of strand. 
 45   
 46  FASTA also sometimes output extra sequences adjacent to the HSP match. These 
 47  extra sequences are discarded by FastaIO. Only regions containing the actual 
 48  sequence match are extracted. 
 49   
 50  The following object attributes are provided: 
 51   
 52  +-----------------+-------------------------+----------------------------------+ 
 53  | Object          | Attribute               | Value                            | 
 54  +=================+=========================+==================================+ 
 55  | QueryResult     | description             | query sequence description       | 
 56  |                 +-------------------------+----------------------------------+ 
 57  |                 | id                      | query sequence ID                | 
 58  |                 +-------------------------+----------------------------------+ 
 59  |                 | program                 | FASTA flavor                     | 
 60  |                 +-------------------------+----------------------------------+ 
 61  |                 | seq_len                 | full length of query sequence    | 
 62  |                 +-------------------------+----------------------------------+ 
 63  |                 | target                  | target search database           | 
 64  |                 +-------------------------+----------------------------------+ 
 65  |                 | version                 | FASTA version                    | 
 66  +-----------------+-------------------------+----------------------------------+ 
 67  | Hit             | seq_len                 | full length of the hit sequence  | 
 68  +-----------------+-------------------------+----------------------------------+ 
 69  | HSP             | bitscore                | \*_bits line                     | 
 70  |                 +-------------------------+----------------------------------+ 
 71  |                 | evalue                  | \*_expect line                   | 
 72  |                 +-------------------------+----------------------------------+ 
 73  |                 | ident_pct               | \*_ident line                    | 
 74  |                 +-------------------------+----------------------------------+ 
 75  |                 | init1_score             | \*_init1 line                    | 
 76  |                 +-------------------------+----------------------------------+ 
 77  |                 | initn_score             | \*_initn line                    | 
 78  |                 +-------------------------+----------------------------------+ 
 79  |                 | opt_score               | \*_opt line, \*_s-w opt line     | 
 80  |                 +-------------------------+----------------------------------+ 
 81  |                 | pos_pct                 | \*_sim line                      | 
 82  |                 +-------------------------+----------------------------------+ 
 83  |                 | sw_score                | \*_score line                    | 
 84  |                 +-------------------------+----------------------------------+ 
 85  |                 | z_score                 | \*_z-score line                  | 
 86  +-----------------+-------------------------+----------------------------------+ 
 87  | HSPFragment     | aln_annotation          | al_cons block, if present        | 
 88  | (also via HSP)  +-------------------------+----------------------------------+ 
 89  |                 | hit                     | hit sequence                     | 
 90  |                 +-------------------------+----------------------------------+ 
 91  |                 | hit_end                 | hit sequence end coordinate      | 
 92  |                 +-------------------------+----------------------------------+ 
 93  |                 | hit_start               | hit sequence start coordinate    | 
 94  |                 +-------------------------+----------------------------------+ 
 95  |                 | hit_strand              | hit sequence strand              | 
 96  |                 +-------------------------+----------------------------------+ 
 97  |                 | query                   | query sequence                   | 
 98  |                 +-------------------------+----------------------------------+ 
 99  |                 | query_end               | query sequence end coordinate    | 
100  |                 +-------------------------+----------------------------------+ 
101  |                 | query_start             | query sequence start coordinate  | 
102  |                 +-------------------------+----------------------------------+ 
103  |                 | query_strand            | query sequence strand            | 
104  +-----------------+-------------------------+----------------------------------+ 
105   
106  """ 
107   
108  import re 
109   
110  from Bio._py3k import _as_bytes, _bytes_to_string 
111  from Bio.Alphabet import generic_dna, generic_protein 
112  from Bio.File import UndoHandle 
113  from Bio.SearchIO._index import SearchIndexer 
114  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
115   
116   
117  __all__ = ('FastaM10Parser', 'FastaM10Indexer') 
118   
119   
120  # precompile regex patterns 
121  # regex for program name 
122  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search') 
123  # regex for sequence ID and length ~ deals with both \n and \r\n 
124  _PTR_ID_DESC_SEQLEN = r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$' 
125  _RE_ID_DESC_SEQLEN = re.compile(_PTR_ID_DESC_SEQLEN) 
126  _RE_ID_DESC_SEQLEN_IDX = re.compile(_as_bytes(_PTR_ID_DESC_SEQLEN)) 
127  # regex for qresult, hit, or hsp attribute value 
128  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$') 
129  # regex for capturing excess start and end sequences in alignments 
130  _RE_START_EXC = re.compile(r'^-*') 
131  _RE_END_EXC = re.compile(r'-*$') 
132   
133  # attribute name mappings 
134  _HSP_ATTR_MAP = { 
135      '_initn': ('initn_score', int), 
136      '_init1': ('init1_score', int), 
137      '_opt': ('opt_score', int), 
138      '_s-w opt': ('opt_score', int), 
139      '_z-score': ('z_score', float), 
140      '_bits': ('bitscore', float), 
141      '_expect': ('evalue', float), 
142      '_score': ('sw_score', int), 
143      '_ident': ('ident_pct', float), 
144      '_sim': ('pos_pct', float), 
145  } 
146   
147  # state flags 
148  _STATE_NONE = 0 
149  _STATE_QUERY_BLOCK = 1 
150  _STATE_HIT_BLOCK = 2 
151  _STATE_CONS_BLOCK = 3 
152   
153   
154 -def _set_qresult_hits(qresult, hit_rows=()):
155 """Helper function for appending Hits without alignments into QueryResults.""" 156 for hit_row in hit_rows: 157 hit_id, remainder = hit_row.split(' ', 1) 158 # TODO: parse hit and hsp properties properly; by dealing with: 159 # - any character in the description (brackets, spaces, etc.) 160 # - possible [f] or [r] presence (for frame info) 161 # - possible presence of E2() column 162 # - possible incomplete hit_id due to column length limit 163 # The current method only looks at the Hit ID, none of the things above 164 if hit_id not in qresult: 165 frag = HSPFragment(hit_id, qresult.id) 166 hsp = HSP([frag]) 167 hit = Hit([hsp]) 168 qresult.append(hit) 169 170 return qresult
171 172
173 -def _set_hsp_seqs(hsp, parsed, program):
174 """Helper function for the main parsing code. 175 176 :param hsp: HSP whose properties will be set 177 :type hsp: HSP 178 :param parsed: parsed values of the HSP attributes 179 :type parsed: dictionary {string: object} 180 :param program: program name 181 :type program: string 182 183 """ 184 # get aligned sequences and check if they have equal lengths 185 start = 0 186 for seq_type in ('hit', 'query'): 187 if 'tfast' not in program: 188 pseq = parsed[seq_type] 189 # adjust start and end coordinates based on the amount of 190 # filler characters 191 start, stop = _get_aln_slice_coords(pseq) 192 start_adj = len(re.search(_RE_START_EXC, pseq['seq']).group(0)) 193 stop_adj = len(re.search(_RE_END_EXC, pseq['seq']).group(0)) 194 start = start + start_adj 195 stop = stop + start_adj - stop_adj 196 parsed[seq_type]['seq'] = pseq['seq'][start:stop] 197 assert len(parsed['query']['seq']) == len(parsed['hit']['seq']), "%r %r" \ 198 % (len(parsed['query']['seq']), len(parsed['hit']['seq'])) 199 if 'similarity' in hsp.aln_annotation: 200 # only using 'start' since FASTA seems to have trimmed the 'excess' 201 # end part 202 hsp.aln_annotation['similarity'] = hsp.aln_annotation['similarity'][start:] 203 # hit or query works equally well here 204 assert len(hsp.aln_annotation['similarity']) == len(parsed['hit']['seq']) 205 206 # query and hit sequence types must be the same 207 assert parsed['query']['_type'] == parsed['hit']['_type'] 208 type_val = parsed['query']['_type'] # hit works fine too 209 alphabet = generic_dna if type_val == 'D' else generic_protein 210 setattr(hsp.fragment, 'alphabet', alphabet) 211 212 for seq_type in ('hit', 'query'): 213 # get and set start and end coordinates 214 start = int(parsed[seq_type]['_start']) 215 end = int(parsed[seq_type]['_stop']) 216 217 setattr(hsp.fragment, seq_type + '_start', min(start, end) - 1) 218 setattr(hsp.fragment, seq_type + '_end', max(start, end)) 219 # set seq and alphabet 220 setattr(hsp.fragment, seq_type, parsed[seq_type]['seq']) 221 222 if alphabet is not generic_protein: 223 # get strand from coordinate; start <= end is plus 224 # start > end is minus 225 if start <= end: 226 setattr(hsp.fragment, seq_type + '_strand', 1) 227 else: 228 setattr(hsp.fragment, seq_type + '_strand', -1) 229 else: 230 setattr(hsp.fragment, seq_type + '_strand', 0)
231 232
233 -def _get_aln_slice_coords(parsed_hsp):
234 """Helper function for the main parsing code. 235 236 To get the actual pairwise alignment sequences, we must first 237 translate the un-gapped sequence based coordinates into positions 238 in the gapped sequence (which may have a flanking region shown 239 using leading - characters). To date, I have never seen any 240 trailing flanking region shown in the m10 file, but the 241 following code should also cope with that. 242 243 Note that this code seems to work fine even when the "sq_offset" 244 entries are prsent as a result of using the -X command line option. 245 """ 246 seq = parsed_hsp['seq'] 247 seq_stripped = seq.strip('-') 248 disp_start = int(parsed_hsp['_display_start']) 249 start = int(parsed_hsp['_start']) 250 stop = int(parsed_hsp['_stop']) 251 252 if start <= stop: 253 start = start - disp_start 254 stop = stop - disp_start + 1 255 else: 256 start = disp_start - start 257 stop = disp_start - stop + 1 258 stop += seq_stripped.count('-') 259 assert 0 <= start and start < stop and stop <= len(seq_stripped), \ 260 "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \ 261 % (seq, start, stop, parsed_hsp) 262 return start, stop
263 264
265 -class FastaM10Parser(object):
266 """Parser for Bill Pearson's FASTA suite's -m 10 output.""" 267
268 - def __init__(self, handle, __parse_hit_table=False):
269 self.handle = UndoHandle(handle) 270 self._preamble = self._parse_preamble()
271
272 - def __iter__(self):
273 for qresult in self._parse_qresult(): 274 # re-set desc, for hsp query description 275 qresult.description = qresult.description 276 yield qresult
277
278 - def _parse_preamble(self):
279 """Parses the Fasta preamble for Fasta flavor and version.""" 280 preamble = {} 281 while True: 282 self.line = self.handle.readline() 283 # this should be the line just before the first qresult 284 if self.line.startswith('Query'): 285 break 286 # try to match for version line 287 elif self.line.startswith(' version'): 288 preamble['version'] = self.line.split(' ')[2] 289 else: 290 # try to match for flavor line 291 flav_match = re.match(_RE_FLAVS, self.line.lower()) 292 if flav_match: 293 preamble['program'] = flav_match.group(0) 294 295 return preamble
296
297 - def __parse_hit_table(self):
298 """Parses hit table rows.""" 299 # move to the first row 300 self.line = self.handle.readline() 301 # parse hit table until we see an empty line 302 hit_rows = [] 303 while self.line and not self.line.strip(): 304 hit_rows.append(self.line.strip()) 305 self.line = self.handle.readline() 306 return hit_rows
307
308 - def _parse_qresult(self):
309 # initial qresult value 310 qresult = None 311 hit_rows = [] 312 # state values 313 state_QRES_NEW = 1 314 state_QRES_HITTAB = 3 315 state_QRES_CONTENT = 5 316 state_QRES_END = 7 317 318 while True: 319 320 # one line before the hit table 321 if self.line.startswith('The best scores are:'): 322 qres_state = state_QRES_HITTAB 323 # the end of a query or the file altogether 324 elif self.line.strip() == '>>>///' or not self.line: 325 qres_state = state_QRES_END 326 # the beginning of a new query 327 elif not self.line.startswith('>>>') and '>>>' in self.line: 328 qres_state = state_QRES_NEW 329 # the beginning of the query info and its hits + hsps 330 elif self.line.startswith('>>>') and not \ 331 self.line.strip() == '>>><<<': 332 qres_state = state_QRES_CONTENT 333 # default qres mark 334 else: 335 qres_state = None 336 337 if qres_state is not None: 338 if qres_state == state_QRES_HITTAB: 339 # parse hit table if flag is set 340 hit_rows = self.__parse_hit_table() 341 342 elif qres_state == state_QRES_END: 343 yield _set_qresult_hits(qresult, hit_rows) 344 break 345 346 elif qres_state == state_QRES_NEW: 347 # if qresult is filled, yield it first 348 if qresult is not None: 349 yield _set_qresult_hits(qresult, hit_rows) 350 regx = re.search(_RE_ID_DESC_SEQLEN, self.line) 351 query_id = regx.group(1) 352 seq_len = regx.group(3) 353 desc = regx.group(2) 354 qresult = QueryResult(id=query_id) 355 qresult.seq_len = int(seq_len) 356 # get target from the next line 357 self.line = self.handle.readline() 358 qresult.target = [x for x in self.line.split(' ') if x][1].strip() 359 if desc is not None: 360 qresult.description = desc 361 # set values from preamble 362 for key, value in self._preamble.items(): 363 setattr(qresult, key, value) 364 365 elif qres_state == state_QRES_CONTENT: 366 assert self.line[3:].startswith(qresult.id), self.line 367 for hit, strand in self._parse_hit(query_id): 368 # HACK: re-set desc, for hsp hit and query description 369 hit.description = hit.description 370 hit.query_description = qresult.description 371 # if hit is not in qresult, append it 372 if hit.id not in qresult: 373 qresult.append(hit) 374 # otherwise, it might be the same hit with a different strand 375 else: 376 # make sure strand is different and then append hsp to 377 # existing hit 378 for hsp in hit.hsps: 379 assert strand != hsp.query_strand 380 qresult[hit.id].append(hsp) 381 382 self.line = self.handle.readline()
383
384 - def _parse_hit(self, query_id):
385 while True: 386 self.line = self.handle.readline() 387 if self.line.startswith('>>'): 388 break 389 390 state = _STATE_NONE 391 strand = None 392 hsp_list = [] 393 hsp = None 394 parsed_hsp = None 395 hit_desc = None 396 seq_len = None 397 while True: 398 peekline = self.handle.peekline() 399 # yield hit if we've reached the start of a new query or 400 # the end of the search 401 if peekline.strip() in [">>><<<", ">>>///"] or \ 402 (not peekline.startswith('>>>') and '>>>' in peekline): 403 # append last parsed_hsp['hit']['seq'] line 404 if state == _STATE_HIT_BLOCK: 405 parsed_hsp['hit']['seq'] += self.line.strip() 406 elif state == _STATE_CONS_BLOCK: 407 hsp.aln_annotation['similarity'] += \ 408 self.line.strip('\r\n') 409 # process HSP alignment and coordinates 410 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 411 hit = Hit(hsp_list) 412 hit.description = hit_desc 413 hit.seq_len = seq_len 414 yield hit, strand 415 hsp_list = [] 416 break 417 # yield hit and create a new one if we're still in the same query 418 elif self.line.startswith('>>'): 419 # try yielding, if we have hsps 420 if hsp_list: 421 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 422 hit = Hit(hsp_list) 423 hit.description = hit_desc 424 hit.seq_len = seq_len 425 yield hit, strand 426 hsp_list = [] 427 # try to get the hit id and desc, and handle cases without descs 428 try: 429 hit_id, hit_desc = self.line[2:].strip().split(' ', 1) 430 except ValueError: 431 hit_id = self.line[2:].strip().split(' ', 1)[0] 432 hit_desc = '' 433 # create the HSP object for Hit 434 frag = HSPFragment(hit_id, query_id) 435 hsp = HSP([frag]) 436 hsp_list.append(hsp) 437 # set or reset the state to none 438 state = _STATE_NONE 439 parsed_hsp = {'query': {}, 'hit': {}} 440 # create and append a new HSP if line starts with '>--' 441 elif self.line.startswith('>--'): 442 # set seq attributes of previous hsp 443 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 444 # and create a new one 445 frag = HSPFragment(hit_id, query_id) 446 hsp = HSP([frag]) 447 hsp_list.append(hsp) 448 # set the state ~ none yet 449 state = _STATE_NONE 450 parsed_hsp = {'query': {}, 'hit': {}} 451 # this is either query or hit data in the HSP, depending on the state 452 elif self.line.startswith('>'): 453 if state == _STATE_NONE: 454 # make sure it's the correct query 455 assert query_id.startswith(self.line[1:].split(' ')[0]), \ 456 "%r vs %r" % (query_id, self.line) 457 state = _STATE_QUERY_BLOCK 458 parsed_hsp['query']['seq'] = '' 459 elif state == _STATE_QUERY_BLOCK: 460 # make sure it's the correct hit 461 assert hit_id.startswith(self.line[1:].split(' ')[0]) 462 state = _STATE_HIT_BLOCK 463 parsed_hsp['hit']['seq'] = '' 464 # check for conservation block 465 elif self.line.startswith('; al_cons'): 466 state = _STATE_CONS_BLOCK 467 hsp.fragment.aln_annotation['similarity'] = '' 468 elif self.line.startswith(';'): 469 # Fasta outputs do not make a clear distinction between Hit 470 # and HSPs, so we check the attribute names to determine 471 # whether it belongs to a Hit or HSP 472 regx = re.search(_RE_ATTR, self.line.strip()) 473 name = regx.group(1) 474 value = regx.group(2) 475 476 # for values before the '>...' query block 477 if state == _STATE_NONE: 478 if name in _HSP_ATTR_MAP: 479 attr_name, caster = _HSP_ATTR_MAP[name] 480 if caster is not str: 481 value = caster(value) 482 if name in ['_ident', '_sim']: 483 value *= 100 484 setattr(hsp, attr_name, value) 485 # otherwise, pool the values for processing later 486 elif state == _STATE_QUERY_BLOCK: 487 parsed_hsp['query'][name] = value 488 elif state == _STATE_HIT_BLOCK: 489 if name == '_len': 490 seq_len = int(value) 491 else: 492 parsed_hsp['hit'][name] = value 493 # for values in the hit block 494 else: 495 raise ValueError("Unexpected line: %r" % self.line) 496 # otherwise, it must be lines containing the sequences 497 else: 498 assert '>' not in self.line 499 # if we're in hit, parse into hsp.hit 500 if state == _STATE_HIT_BLOCK: 501 parsed_hsp['hit']['seq'] += self.line.strip() 502 elif state == _STATE_QUERY_BLOCK: 503 parsed_hsp['query']['seq'] += self.line.strip() 504 elif state == _STATE_CONS_BLOCK: 505 hsp.fragment.aln_annotation['similarity'] += \ 506 self.line.strip('\r\n') 507 # we should not get here! 508 else: 509 raise ValueError("Unexpected line: %r" % self.line) 510 511 self.line = self.handle.readline()
512 513
514 -class FastaM10Indexer(SearchIndexer):
515 """Indexer class for Bill Pearson's FASTA suite's -m 10 output.""" 516 517 _parser = FastaM10Parser 518
519 - def __init__(self, filename):
520 SearchIndexer.__init__(self, filename) 521 self._handle = UndoHandle(self._handle)
522
523 - def __iter__(self):
524 handle = self._handle 525 handle.seek(0) 526 start_offset = handle.tell() 527 qresult_key = None 528 query_mark = b">>>" 529 530 while True: 531 line = handle.readline() 532 peekline = handle.peekline() 533 end_offset = handle.tell() 534 535 if not line.startswith(query_mark) and query_mark in line: 536 regx = re.search(_RE_ID_DESC_SEQLEN_IDX, line) 537 qresult_key = _bytes_to_string(regx.group(1)) 538 start_offset = end_offset - len(line) 539 # yield whenever we encounter a new query or at the end of the file 540 if qresult_key is not None: 541 if (not peekline.startswith(query_mark) and query_mark in peekline) or not line: 542 yield qresult_key, start_offset, end_offset - start_offset 543 if not line: 544 break 545 start_offset = end_offset
546
547 - def get_raw(self, offset):
548 """Return the raw record from the file as a bytes string.""" 549 handle = self._handle 550 qresult_raw = b"" 551 query_mark = b">>>" 552 553 # read header first 554 handle.seek(0) 555 while True: 556 line = handle.readline() 557 peekline = handle.peekline() 558 qresult_raw += line 559 if not peekline.startswith(query_mark) and query_mark in peekline: 560 break 561 562 # and read the qresult raw string 563 handle.seek(offset) 564 while True: 565 # preserve whitespace, don't use read_forward 566 line = handle.readline() 567 peekline = handle.peekline() 568 qresult_raw += line 569 570 # break when we've reached qresult end 571 if (not peekline.startswith(query_mark) and query_mark in peekline) or \ 572 not line: 573 break 574 575 # append mock end marker to qresult_raw, since it's not always present 576 return qresult_raw + b">>><<<\n"
577 578 579 # if not used as a module, run the doctest 580 if __name__ == "__main__": 581 from Bio._utils import run_doctest 582 run_doctest() 583