Package Bio :: Package SearchIO :: Module BlatIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.BlatIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO parser for BLAT output formats. 
  7   
  8  This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment 
  9  Tool) is a sequence similarity search program initially built for annotating 
 10  the human genome. 
 11   
 12  Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout 
 13  version 3. It should be able to parse psLayout version 4 without problems. 
 14   
 15  More information on BLAT is available from these sites: 
 16   
 17      - Publication: http://genome.cshlp.org/content/12/4/656 
 18      - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 
 19      - Source download: http://www.soe.ucsc.edu/~kent/src 
 20      - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/ 
 21      - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 
 22   
 23   
 24  Supported Formats 
 25  ================= 
 26   
 27  BlatIO supports parsing, indexing, and writing for both PSL and PSLX output 
 28  formats, with or without header. To parse, index, or write PSLX files, use the 
 29  'pslx' keyword argument and set it to True. 
 30   
 31      # blat-psl defaults to PSL files 
 32      >>> from Bio import SearchIO 
 33      >>> psl = 'Blat/psl_34_004.psl' 
 34      >>> qresult = SearchIO.read(psl, 'blat-psl') 
 35      >>> qresult 
 36      QueryResult(id='hg19_dna', 10 hits) 
 37   
 38      # set the pslx flag to parse PSLX files 
 39      >>> pslx = 'Blat/pslx_34_004.pslx' 
 40      >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True) 
 41      >>> qresult 
 42      QueryResult(id='hg19_dna', 10 hits) 
 43   
 44  For parsing and indexing, you do not need to specify whether the file has a 
 45  header or not. For writing, if you want to write a header, you can set the 
 46  'header' keyword argument to True. This will write a 'psLayout version 3' header 
 47  to your output file. 
 48   
 49      from Bio import SearchIO 
 50      qresult = SearchIO.read(psl, 'blat-psl') 
 51      SearchIO.write(qresult, 'header.psl', header=True) 
 52      <stdout> (1, 10, 19, 23) 
 53   
 54  Note that the number of HSPFragments written may exceed the number of HSP 
 55  objects. This is because in PSL files, it is possible to have single matches 
 56  consisting of noncontiguous sequence fragments. This is where the HSPFragment 
 57  object comes into play. These fragments are grouped into a single HSP because 
 58  they share the same statistics (e.g. match numbers, BLAT score, etc.). However, 
 59  they do not share the same sequence attributes, such as the start and end 
 60  coordinates, making them distinct objects. 
 61   
 62  In addition to parsing PSL(X) files, BlatIO also computes the percent identities 
 63  and scores of your search results. This is done using the calculation formula 
 64  posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score 
 65  and percent identity calculation done by UCSC's web BLAT service. 
 66   
 67  Since BlatIO parses the file in a single pass, it expects all results from 
 68  the same query to be in consecutive rows. If the results from one query are 
 69  spread in nonconsecutive rows, BlatIO will consider them to be separate 
 70  QueryResult objects. 
 71   
 72  In most cases, the PSL(X) format uses the same coordinate system as Python 
 73  (zero-based, half open). These coordinates are anchored on the plus strand. 
 74  However, if the query aligns on the minus strand, BLAT will anchor the qStarts 
 75  coordinates on the minus strand instead. BlatIO is aware of this, and will 
 76  re-anchor the qStarts coordinates to the plus strand whenever it sees a minus 
 77  strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will 
 78  reanchor qStarts to the minus strand again. 
 79   
 80  BlatIO provides the following attribute-column mapping: 
 81   
 82  +----------------+-------------------------+-----------------------------------+ 
 83  | Object         | Attribute               | Column Name, Value                | 
 84  +================+=========================+===================================+ 
 85  | QueryResutl    | id                      | Q name, query sequence ID         | 
 86  |                +-------------------------+-----------------------------------+ 
 87  |                | seq_len                 | Q size, query sequence full       | 
 88  |                |                         | length                            | 
 89  +----------------+-------------------------+-----------------------------------+ 
 90  | Hit            | id                      | T name, hit sequence ID           | 
 91  |                +-------------------------+-----------------------------------+ 
 92  |                | seq_len                 | T size, hit sequence full length  | 
 93  +----------------+-------------------------+-----------------------------------+ 
 94  | HSP            | hit_end                 | T end, end coordinate of the last | 
 95  |                |                         | hit fragment                      | 
 96  |                +-------------------------+-----------------------------------+ 
 97  |                | hit_gap_num             | T gap bases, number of bases      | 
 98  |                |                         | inserted in hit                   | 
 99  |                +-------------------------+-----------------------------------+ 
100  |                | hit_gapopen_num         | T gap count, number of hit gap    | 
101  |                |                         | inserts                           | 
102  |                +-------------------------+-----------------------------------+ 
103  |                | hit_span_all            | blockSizes, sizes of each         | 
104  |                |                         | fragment                          | 
105  |                +-------------------------+-----------------------------------+ 
106  |                | hit_start               | T start, start coordinate of the  | 
107  |                |                         | first hit fragment                | 
108  |                +-------------------------+-----------------------------------+ 
109  |                | hit_start_all           | tStarts, start coordinate of each | 
110  |                |                         | hit fragment                      | 
111  |                +-------------------------+-----------------------------------+ 
112  |                | match_num               | match, number of non-repeat       | 
113  |                |                         | matches                           | 
114  |                +-------------------------+-----------------------------------+ 
115  |                | mismatch_num            | mismatch, number of mismatches    | 
116  |                +-------------------------+-----------------------------------+ 
117  |                | match_rep_num           | rep. match, number of matches     | 
118  |                |                         | that are part of repeats          | 
119  |                +-------------------------+-----------------------------------+ 
120  |                | n_num                   | N's, number of N bases            | 
121  |                +-------------------------+-----------------------------------+ 
122  |                | query_end               | Q end, end coordinate of the last | 
123  |                +-------------------------+-----------------------------------+ 
124  |                |                         | query fragment                    | 
125  |                | query_gap_num           | Q gap bases, number of bases      | 
126  |                |                         | inserted in query                 | 
127  |                +-------------------------+-----------------------------------+ 
128  |                | query_gapopen_num       | Q gap count, number of query gap  | 
129  |                |                         | inserts                           | 
130  |                +-------------------------+-----------------------------------+ 
131  |                | query_span_all          | blockSizes, sizes of each         | 
132  |                |                         | fragment                          | 
133  |                +-------------------------+-----------------------------------+ 
134  |                | query_start             | Q start, start coordinate of the  | 
135  |                |                         | first query block                 | 
136  |                +-------------------------+-----------------------------------+ 
137  |                | query_start_all         | qStarts, start coordinate of each | 
138  |                |                         | query fragment                    | 
139  |                +-------------------------+-----------------------------------+ 
140  |                | len [1]                 | block count, the number of blocks | 
141  |                |                         | in the alignment                  | 
142  +----------------+-------------------------+-----------------------------------+ 
143  | HSPFragment    | hit                     | hit sequence, if present          | 
144  |                +-------------------------+-----------------------------------+ 
145  |                | hit_strand              | strand, hit sequence strand       | 
146  |                +-------------------------+-----------------------------------+ 
147  |                | query                   | query sequence, if present        | 
148  |                +-------------------------+-----------------------------------+ 
149  |                | query_strand            | strand, query sequence strand     | 
150  +----------------+-------------------------+-----------------------------------+ 
151   
152  In addition to the column mappings above, BlatIO also provides the following 
153  object attributes: 
154   
155  +----------------+-------------------------+-----------------------------------+ 
156  | Object         | Attribute               | Value                             | 
157  +================+=========================+===================================+ 
158  | HSP            | gapopen_num             | Q gap count + T gap count, total  | 
159  |                |                         |  number of gap openings           | 
160  |                +-------------------------+-----------------------------------+ 
161  |                | ident_num               | matches + repmatches, total       | 
162  |                |                         | number of identical residues      | 
163  |                +-------------------------+-----------------------------------+ 
164  |                | ident_pct               | percent identity, calculated      | 
165  |                |                         | using UCSC's formula              | 
166  |                +-------------------------+-----------------------------------+ 
167  |                | query_is_protein        | boolean, whether the query        | 
168  |                |                         | sequence is a protein             | 
169  |                +-------------------------+-----------------------------------+ 
170  |                | score                   | HSP score, calculated using       | 
171  |                |                         | UCSC's formula                    | 
172  +----------------+-------------------------+-----------------------------------+ 
173   
174  Finally, the default HSP and HSPFragment properties are also provided. See the 
175  HSP and HSPFragment documentation for more details on these properties. 
176   
177   
178  .. [1] You can obtain the number of blocks / fragments in the HSP by invoking 
179     ``len`` on the HSP 
180   
181  """ 
182  import re 
183  from math import log 
184   
185  from Bio._py3k import _as_bytes, _bytes_to_string 
186  from Bio._py3k import zip 
187   
188  from Bio.Alphabet import generic_dna 
189  from Bio.SearchIO._index import SearchIndexer 
190  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
191   
192   
193  __all__ = ('BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter') 
194   
195   
196  # precompile regex patterns 
197  _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+' 
198  _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK) 
199  _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK)) 
200   
201   
202 -def _list_from_csv(csv_string, caster=None):
203 """Transforms the given comma-separated string into a list. 204 205 :param csv_string: comma-separated input string 206 :type csv_string: string 207 :param caster: function used to cast each item in the input string 208 to its intended type 209 :type caster: callable, accepts string, returns object 210 211 """ 212 if caster is None: 213 return [x for x in csv_string.split(',') if x] 214 else: 215 return [caster(x) for x in csv_string.split(',') if x]
216 217
218 -def _reorient_starts(starts, blksizes, seqlen, strand):
219 """Reorients block starts into the opposite strand's coordinates. 220 221 :param starts: start coordinates 222 :type starts: list [int] 223 :param blksizes: block sizes 224 :type blksizes: list [int] 225 :param seqlen: sequence length 226 :type seqlen: int 227 :param strand: sequence strand 228 :type strand: int, choice of -1, 0, or 1 229 230 """ 231 assert len(starts) == len(blksizes), \ 232 "Unequal start coordinates and block sizes list (%r vs %r)" \ 233 % (len(starts), len(blksizes)) 234 # see: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 235 # no need to reorient if it's already the positive strand 236 if strand >= 0: 237 return starts 238 else: 239 # the plus-oriented coordinate is calculated by this: 240 # plus_coord = length - minus_coord - block_size 241 return [seqlen - start - blksize for 242 start, blksize in zip(starts, blksizes)]
243 244
245 -def _is_protein(psl):
246 # check if query is protein or not 247 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 248 if len(psl['strand']) == 2: 249 if psl['strand'][1] == '+': 250 return psl['tend'] == psl['tstarts'][-1] + \ 251 3 * psl['blocksizes'][-1] 252 elif psl['strand'][1] == '-': 253 return psl['tstart'] == psl['tsize'] - \ 254 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1]) 255 256 return False
257 258
259 -def _calc_millibad(psl, is_protein):
260 # calculates millibad 261 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 262 size_mul = 3 if is_protein else 1 263 millibad = 0 264 265 qali_size = size_mul * (psl['qend'] - psl['qstart']) 266 tali_size = psl['tend'] - psl['tstart'] 267 ali_size = min(qali_size, tali_size) 268 if ali_size <= 0: 269 return 0 270 271 size_dif = qali_size - tali_size 272 size_dif = 0 if size_dif < 0 else size_dif 273 274 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches']) 275 if total != 0: 276 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] + 277 round(3 * log(1 + size_dif)))) / total 278 279 return millibad
280 281
282 -def _calc_score(psl, is_protein):
283 # calculates score 284 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 285 size_mul = 3 if is_protein else 1 286 return size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) - \ 287 size_mul * psl['mismatches'] - psl['qnuminsert'] - psl['tnuminsert']
288 289
290 -def _create_hsp(hid, qid, psl):
291 # protein flag 292 is_protein = _is_protein(psl) 293 # strand 294 # if query is protein, strand is 0 295 if is_protein: 296 qstrand = 0 297 else: 298 qstrand = 1 if psl['strand'][0] == '+' else -1 299 # try to get hit strand, if it exists 300 try: 301 hstrand = 1 if psl['strand'][1] == '+' else -1 302 except IndexError: 303 hstrand = 1 # hit strand defaults to plus 304 305 blocksize_multiplier = 3 if is_protein else 1 306 # query block starts 307 qstarts = _reorient_starts(psl['qstarts'], 308 psl['blocksizes'], psl['qsize'], qstrand) 309 # hit block starts 310 if len(psl['strand']) == 2: 311 hstarts = _reorient_starts(psl['tstarts'], 312 [blocksize_multiplier * i for i in psl['blocksizes']], 313 psl['tsize'], hstrand) 314 else: 315 hstarts = psl['tstarts'] 316 # set query and hit coords 317 # this assumes each block has no gaps (which seems to be the case) 318 assert len(qstarts) == len(hstarts) == len(psl['blocksizes']) 319 query_range_all = list(zip(qstarts, [x + y for x, y in 320 zip(qstarts, psl['blocksizes'])])) 321 hit_range_all = list(zip(hstarts, [x + y * blocksize_multiplier for x, y in 322 zip(hstarts, psl['blocksizes'])])) 323 # check length of sequences and coordinates, all must match 324 if 'tseqs' in psl and 'qseqs' in psl: 325 assert len(psl['tseqs']) == len(psl['qseqs']) == \ 326 len(query_range_all) == len(hit_range_all) 327 else: 328 assert len(query_range_all) == len(hit_range_all) 329 330 frags = [] 331 # iterating over query_range_all, but hit_range_all works just as well 332 for idx, qcoords in enumerate(query_range_all): 333 hseqlist = psl.get('tseqs') 334 hseq = '' if not hseqlist else hseqlist[idx] 335 qseqlist = psl.get('qseqs') 336 qseq = '' if not qseqlist else qseqlist[idx] 337 frag = HSPFragment(hid, qid, hit=hseq, query=qseq) 338 # set alphabet 339 frag.alphabet = generic_dna 340 # set coordinates 341 frag.query_start = qcoords[0] 342 frag.query_end = qcoords[1] 343 frag.hit_start = hit_range_all[idx][0] 344 frag.hit_end = hit_range_all[idx][1] 345 # and strands 346 frag.query_strand = qstrand 347 frag.hit_strand = hstrand 348 frags.append(frag) 349 350 # create hsp object 351 hsp = HSP(frags) 352 # check if start and end are set correctly 353 assert hsp.query_start == psl['qstart'] 354 assert hsp.query_end == psl['qend'] 355 assert hsp.hit_start == psl['tstart'] 356 assert hsp.hit_end == psl['tend'] 357 # and check block spans as well 358 hit_spans = [span / blocksize_multiplier for span in hsp.hit_span_all] 359 assert hit_spans == hsp.query_span_all == psl['blocksizes'] 360 # set its attributes 361 hsp.match_num = psl['matches'] 362 hsp.mismatch_num = psl['mismatches'] 363 hsp.match_rep_num = psl['repmatches'] 364 hsp.n_num = psl['ncount'] 365 hsp.query_gapopen_num = psl['qnuminsert'] 366 hsp.query_gap_num = psl['qbaseinsert'] 367 hsp.hit_gapopen_num = psl['tnuminsert'] 368 hsp.hit_gap_num = psl['tbaseinsert'] 369 370 hsp.ident_num = psl['matches'] + psl['repmatches'] 371 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert'] 372 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert'] 373 hsp.query_is_protein = is_protein 374 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1 375 hsp.score = _calc_score(psl, is_protein) 376 # helper flag, for writing 377 hsp._has_hit_strand = len(psl['strand']) == 2 378 379 return hsp
380 381
382 -class BlatPslParser(object):
383 """Parser for the BLAT PSL format.""" 384
385 - def __init__(self, handle, pslx=False):
386 self.handle = handle 387 self.line = self.handle.readline() 388 self.pslx = pslx
389
390 - def __iter__(self):
391 # break out if it's an empty file 392 if not self.line: 393 raise StopIteration 394 395 # read through header 396 # this assumes that the result row match the regex 397 while not re.search(_RE_ROW_CHECK, self.line.strip()): 398 self.line = self.handle.readline() 399 if not self.line: 400 raise StopIteration 401 402 # parse into query results 403 for qresult in self._parse_qresult(): 404 qresult.program = 'blat' 405 yield qresult
406
407 - def _parse_row(self):
408 """Returns a dictionary of parsed column values.""" 409 assert self.line 410 cols = [x for x in self.line.strip().split('\t') if x] 411 self._validate_cols(cols) 412 413 psl = {} 414 psl['qname'] = cols[9] # qName 415 psl['qsize'] = int(cols[10]) # qSize 416 psl['tname'] = cols[13] # tName 417 psl['tsize'] = int(cols[14]) # tSize 418 psl['matches'] = int(cols[0]) # matches 419 psl['mismatches'] = int(cols[1]) # misMatches 420 psl['repmatches'] = int(cols[2]) # repMatches 421 psl['ncount'] = int(cols[3]) # nCount 422 psl['qnuminsert'] = int(cols[4]) # qNumInsert 423 psl['qbaseinsert'] = int(cols[5]) # qBaseInsert 424 psl['tnuminsert'] = int(cols[6]) # tNumInsert 425 psl['tbaseinsert'] = int(cols[7]) # tBaseInsert 426 psl['strand'] = cols[8] # strand 427 psl['qstart'] = int(cols[11]) # qStart 428 psl['qend'] = int(cols[12]) # qEnd 429 psl['tstart'] = int(cols[15]) # tStart 430 psl['tend'] = int(cols[16]) # tEnd 431 psl['blockcount'] = int(cols[17]) # blockCount 432 psl['blocksizes'] = _list_from_csv(cols[18], int) # blockSizes 433 psl['qstarts'] = _list_from_csv(cols[19], int) # qStarts 434 psl['tstarts'] = _list_from_csv(cols[20], int) # tStarts 435 if self.pslx: 436 psl['qseqs'] = _list_from_csv(cols[21]) # query sequence 437 psl['tseqs'] = _list_from_csv(cols[22]) # hit sequence 438 439 return psl
440
441 - def _validate_cols(self, cols):
442 if not self.pslx: 443 assert len(cols) == 21, "Invalid PSL line: %r. " \ 444 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols)) 445 else: 446 assert len(cols) == 23, "Invalid PSLX line: %r. " \ 447 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
448
449 - def _parse_qresult(self):
450 """Generator function that returns QueryResult objects.""" 451 # state values, determines what to do for each line 452 state_EOF = 0 453 state_QRES_NEW = 1 454 state_QRES_SAME = 3 455 state_HIT_NEW = 2 456 state_HIT_SAME = 4 457 # initial dummy values 458 qres_state = None 459 file_state = None 460 cur_qid, cur_hid = None, None 461 prev_qid, prev_hid = None, None 462 cur, prev = None, None 463 hit_list, hsp_list = [], [] 464 465 while True: 466 # store previous line's parsed values for all lines after the first 467 if cur is not None: 468 prev = cur 469 prev_qid = cur_qid 470 prev_hid = cur_hid 471 # only parse the result row if it's not EOF 472 if self.line: 473 cur = self._parse_row() 474 cur_qid = cur['qname'] 475 cur_hid = cur['tname'] 476 else: 477 file_state = state_EOF 478 # mock values, since we have nothing to parse 479 cur_qid, cur_hid = None, None 480 481 # get the state of hit and qresult 482 if prev_qid != cur_qid: 483 qres_state = state_QRES_NEW 484 else: 485 qres_state = state_QRES_SAME 486 # new hits are hits with different ids or hits in a new qresult 487 if prev_hid != cur_hid or qres_state == state_QRES_NEW: 488 hit_state = state_HIT_NEW 489 else: 490 hit_state = state_HIT_SAME 491 492 if prev is not None: 493 # create fragment and HSP and set their attributes 494 hsp = _create_hsp(prev_hid, prev_qid, prev) 495 hsp_list.append(hsp) 496 497 if hit_state == state_HIT_NEW: 498 # create Hit and set its attributes 499 hit = Hit(hsp_list) 500 hit.seq_len = prev['tsize'] 501 hit_list.append(hit) 502 hsp_list = [] 503 504 # create qresult and yield if we're at a new qresult or at EOF 505 if qres_state == state_QRES_NEW or file_state == state_EOF: 506 qresult = QueryResult(id=prev_qid) 507 for hit in hit_list: 508 qresult.absorb(hit) 509 qresult.seq_len = prev['qsize'] 510 yield qresult 511 # if we're at EOF, break 512 if file_state == state_EOF: 513 break 514 hit_list = [] 515 516 self.line = self.handle.readline()
517 518
519 -class BlatPslIndexer(SearchIndexer):
520 """Indexer class for BLAT PSL output.""" 521 522 _parser = BlatPslParser 523
524 - def __init__(self, filename, pslx=False):
525 SearchIndexer.__init__(self, filename, pslx=pslx)
526
527 - def __iter__(self):
528 """Iterates over the file handle; yields key, start offset, and length.""" 529 handle = self._handle 530 handle.seek(0) 531 # denotes column location for query identifier 532 query_id_idx = 9 533 qresult_key = None 534 tab_char = b"\t" 535 536 start_offset = handle.tell() 537 line = handle.readline() 538 # read through header 539 # this assumes that the result row match the regex 540 while not re.search(_RE_ROW_CHECK_IDX, line.strip()): 541 start_offset = handle.tell() 542 line = handle.readline() 543 if not line: 544 raise StopIteration 545 546 # and index the qresults 547 while True: 548 end_offset = handle.tell() 549 550 cols = [x for x in line.strip().split(tab_char) if x] 551 if qresult_key is None: 552 qresult_key = cols[query_id_idx] 553 else: 554 curr_key = cols[query_id_idx] 555 556 if curr_key != qresult_key: 557 yield _bytes_to_string(qresult_key), start_offset, \ 558 end_offset - start_offset 559 qresult_key = curr_key 560 start_offset = end_offset - len(line) 561 562 line = handle.readline() 563 if not line: 564 yield _bytes_to_string(qresult_key), start_offset, \ 565 end_offset - start_offset 566 break
567
568 - def get_raw(self, offset):
569 """Returns raw bytes string of a QueryResult object from the given offset.""" 570 handle = self._handle 571 handle.seek(offset) 572 query_id_idx = 9 573 qresult_key = None 574 qresult_raw = b"" 575 tab_char = b"\t" 576 577 while True: 578 line = handle.readline() 579 if not line: 580 break 581 cols = [x for x in line.strip().split(tab_char) if x] 582 if qresult_key is None: 583 qresult_key = cols[query_id_idx] 584 else: 585 curr_key = cols[query_id_idx] 586 if curr_key != qresult_key: 587 break 588 qresult_raw += line 589 590 return qresult_raw
591 592
593 -class BlatPslWriter(object):
594 """Writer for the blat-psl format.""" 595
596 - def __init__(self, handle, header=False, pslx=False):
597 self.handle = handle 598 # flag for writing header or not 599 self.header = header 600 self.pslx = pslx
601
602 - def write_file(self, qresults):
603 handle = self.handle 604 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0 605 606 if self.header: 607 handle.write(self._build_header()) 608 609 for qresult in qresults: 610 if qresult: 611 handle.write(self._build_row(qresult)) 612 qresult_counter += 1 613 hit_counter += len(qresult) 614 hsp_counter += sum(len(hit) for hit in qresult) 615 frag_counter += sum(len(hit.fragments) for hit in qresult) 616 617 return qresult_counter, hit_counter, hsp_counter, frag_counter
618
619 - def _build_header(self):
620 # for now, always use the psLayout version 3 621 header = 'psLayout version 3\n' 622 623 # adapted from BLAT's source: lib/psl.c#L496 624 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT " 625 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT " 626 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \ 627 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname " 628 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount" 629 "\n%s\n" % ('-' * 159) 630 631 return header
632
633 - def _build_row(self, qresult):
634 """Returns a string or one row or more of the QueryResult object.""" 635 # For now, our writer writes the row according to the order in 636 # the QueryResult and Hit objects. 637 # This is different from BLAT's native output, where the rows are 638 # grouped by strand. 639 # Should we tweak the behavior to better mimic the native output? 640 qresult_lines = [] 641 642 for hit in qresult: 643 for hsp in hit.hsps: 644 645 query_is_protein = getattr(hsp, "query_is_protein", False) 646 blocksize_multiplier = 3 if query_is_protein else 1 647 648 line = [] 649 line.append(hsp.match_num) 650 line.append(hsp.mismatch_num) 651 line.append(hsp.match_rep_num) 652 line.append(hsp.n_num) 653 line.append(hsp.query_gapopen_num) 654 line.append(hsp.query_gap_num) 655 line.append(hsp.hit_gapopen_num) 656 line.append(hsp.hit_gap_num) 657 658 # check spans 659 eff_query_spans = [blocksize_multiplier * s for s in hsp.query_span_all] 660 if hsp.hit_span_all != eff_query_spans: 661 raise ValueError("HSP hit span and query span values do not match.") 662 block_sizes = hsp.query_span_all 663 664 # set strand and starts 665 if hsp[0].query_strand >= 0: # since it may be a protein seq 666 strand = '+' 667 else: 668 strand = '-' 669 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all], 670 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand) 671 672 if hsp[0].hit_strand == 1: 673 hstrand = 1 674 # only write hit strand if it was present in the source file 675 if hsp._has_hit_strand: 676 strand += '+' 677 else: 678 hstrand = -1 679 strand += '-' 680 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all], 681 hsp.hit_span_all, hit.seq_len, hstrand) 682 683 line.append(strand) 684 line.append(qresult.id) 685 line.append(qresult.seq_len) 686 line.append(hsp.query_start) 687 line.append(hsp.query_end) 688 line.append(hit.id) 689 line.append(hit.seq_len) 690 line.append(hsp.hit_start) 691 line.append(hsp.hit_end) 692 line.append(len(hsp)) 693 line.append(','.join((str(x) for x in block_sizes)) + ',') 694 line.append(','.join((str(x) for x in qstarts)) + ',') 695 line.append(','.join((str(x) for x in hstarts)) + ',') 696 697 if self.pslx: 698 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',') 699 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',') 700 701 qresult_lines.append('\t'.join((str(x) for x in line))) 702 703 return '\n'.join(qresult_lines) + '\n'
704 705 706 # if not used as a module, run the doctest 707 if __name__ == "__main__": 708 from Bio._utils import run_doctest 709 run_doctest() 710