Package Bio :: Package SearchIO :: Module BlatIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.BlatIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO parser for BLAT output formats. 
  7   
  8  This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment 
  9  Tool) is a sequence similarity search program initially built for annotating 
 10  the human genome. 
 11   
 12  Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout 
 13  version 3. It should be able to parse psLayout version 4 without problems. 
 14   
 15  More information on BLAT is available from these sites: 
 16   
 17      - Publication: http://genome.cshlp.org/content/12/4/656 
 18      - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 
 19      - Source download: http://www.soe.ucsc.edu/~kent/src 
 20      - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/ 
 21      - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 
 22   
 23   
 24  Supported Formats 
 25  ================= 
 26   
 27  BlatIO supports parsing, indexing, and writing for both PSL and PSLX output 
 28  formats, with or without header. To parse, index, or write PSLX files, use the 
 29  'pslx' keyword argument and set it to True. 
 30   
 31      # blat-psl defaults to PSL files 
 32      >>> from Bio import SearchIO 
 33      >>> psl = 'Blat/psl_34_004.psl' 
 34      >>> qresult = SearchIO.read(psl, 'blat-psl') 
 35      >>> qresult 
 36      QueryResult(id='hg19_dna', 10 hits) 
 37   
 38      # set the pslx flag to parse PSLX files 
 39      >>> pslx = 'Blat/pslx_34_004.pslx' 
 40      >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True) 
 41      >>> qresult 
 42      QueryResult(id='hg19_dna', 10 hits) 
 43   
 44  For parsing and indexing, you do not need to specify whether the file has a 
 45  header or not. For writing, if you want to write a header, you can set the 
 46  'header' keyword argument to True. This will write a 'psLayout version 3' header 
 47  to your output file. 
 48   
 49      from Bio import SearchIO 
 50      qresult = SearchIO.read(psl, 'blat-psl') 
 51      SearchIO.write(qresult, 'header.psl', header=True) 
 52      <stdout> (1, 10, 19, 23) 
 53   
 54  Note that the number of HSPFragments written may exceed the number of HSP 
 55  objects. This is because in PSL files, it is possible to have single matches 
 56  consisting of noncontiguous sequence fragments. This is where the HSPFragment 
 57  object comes into play. These fragments are grouped into a single HSP because 
 58  they share the same statistics (e.g. match numbers, BLAT score, etc.). However, 
 59  they do not share the same sequence attributes, such as the start and end 
 60  coordinates, making them distinct objects. 
 61   
 62  In addition to parsing PSL(X) files, BlatIO also computes the percent identities 
 63  and scores of your search results. This is done using the calculation formula 
 64  posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score 
 65  and percent identity calculation done by UCSC's web BLAT service. 
 66   
 67  Since BlatIO parses the file in a single pass, it expects all results from 
 68  the same query to be in consecutive rows. If the results from one query are 
 69  spread in nonconsecutive rows, BlatIO will consider them to be separate 
 70  QueryResult objects. 
 71   
 72  In most cases, the PSL(X) format uses the same coordinate system as Python 
 73  (zero-based, half open). These coordinates are anchored on the plus strand. 
 74  However, if the query aligns on the minus strand, BLAT will anchor the qStarts 
 75  coordinates on the minus strand instead. BlatIO is aware of this, and will 
 76  re-anchor the qStarts coordinates to the plus strand whenever it sees a minus 
 77  strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will 
 78  reanchor qStarts to the minus strand again. 
 79   
 80  BlatIO provides the following attribute-column mapping: 
 81   
 82  +----------------+-------------------------+-----------------------------------+ 
 83  | Object         | Attribute               | Column Name, Value                | 
 84  +================+=========================+===================================+ 
 85  | QueryResutl    | id                      | Q name, query sequence ID         | 
 86  |                +-------------------------+-----------------------------------+ 
 87  |                | seq_len                 | Q size, query sequence full       | 
 88  |                |                         | length                            | 
 89  +----------------+-------------------------+-----------------------------------+ 
 90  | Hit            | id                      | T name, hit sequence ID           | 
 91  |                +-------------------------+-----------------------------------+ 
 92  |                | seq_len                 | T size, hit sequence full length  | 
 93  +----------------+-------------------------+-----------------------------------+ 
 94  | HSP            | hit_end                 | T end, end coordinate of the last | 
 95  |                |                         | hit fragment                      | 
 96  |                +-------------------------+-----------------------------------+ 
 97  |                | hit_gap_num             | T gap bases, number of bases      | 
 98  |                |                         | inserted in hit                   | 
 99  |                +-------------------------+-----------------------------------+ 
100  |                | hit_gapopen_num         | T gap count, number of hit gap    | 
101  |                |                         | inserts                           | 
102  |                +-------------------------+-----------------------------------+ 
103  |                | hit_span_all            | blockSizes, sizes of each         | 
104  |                |                         | fragment                          | 
105  |                +-------------------------+-----------------------------------+ 
106  |                | hit_start               | T start, start coordinate of the  | 
107  |                |                         | first hit fragment                | 
108  |                +-------------------------+-----------------------------------+ 
109  |                | hit_start_all           | tStarts, start coordinate of each | 
110  |                |                         | hit fragment                      | 
111  |                +-------------------------+-----------------------------------+ 
112  |                | match_num               | match, number of non-repeat       | 
113  |                |                         | matches                           | 
114  |                +-------------------------+-----------------------------------+ 
115  |                | mismatch_num            | mismatch, number of mismatches    | 
116  |                +-------------------------+-----------------------------------+ 
117  |                | match_rep_num           | rep. match, number of matches     | 
118  |                |                         | that are part of repeats          | 
119  |                +-------------------------+-----------------------------------+ 
120  |                | n_num                   | N's, number of N bases            | 
121  |                +-------------------------+-----------------------------------+ 
122  |                | query_end               | Q end, end coordinate of the last | 
123  |                +-------------------------+-----------------------------------+ 
124  |                |                         | query fragment                    | 
125  |                | query_gap_num           | Q gap bases, number of bases      | 
126  |                |                         | inserted in query                 | 
127  |                +-------------------------+-----------------------------------+ 
128  |                | query_gapopen_num       | Q gap count, number of query gap  | 
129  |                |                         | inserts                           | 
130  |                +-------------------------+-----------------------------------+ 
131  |                | query_span_all          | blockSizes, sizes of each         | 
132  |                |                         | fragment                          | 
133  |                +-------------------------+-----------------------------------+ 
134  |                | query_start             | Q start, start coordinate of the  | 
135  |                |                         | first query block                 | 
136  |                +-------------------------+-----------------------------------+ 
137  |                | query_start_all         | qStarts, start coordinate of each | 
138  |                |                         | query fragment                    | 
139  |                +-------------------------+-----------------------------------+ 
140  |                | len [1]                 | block count, the number of blocks | 
141  |                |                         | in the alignment                  | 
142  +----------------+-------------------------+-----------------------------------+ 
143  | HSPFragment    | hit                     | hit sequence, if present          | 
144  |                +-------------------------+-----------------------------------+ 
145  |                | hit_strand              | strand, hit sequence strand       | 
146  |                +-------------------------+-----------------------------------+ 
147  |                | query                   | query sequence, if present        | 
148  |                +-------------------------+-----------------------------------+ 
149  |                | query_strand            | strand, query sequence strand     | 
150  +----------------+-------------------------+-----------------------------------+ 
151   
152  In addition to the column mappings above, BlatIO also provides the following 
153  object attributes: 
154   
155  +----------------+-------------------------+-----------------------------------+ 
156  | Object         | Attribute               | Value                             | 
157  +================+=========================+===================================+ 
158  | HSP            | gapopen_num             | Q gap count + T gap count, total  | 
159  |                |                         |  number of gap openings           | 
160  |                +-------------------------+-----------------------------------+ 
161  |                | ident_num               | matches + repmatches, total       | 
162  |                |                         | number of identical residues      | 
163  |                +-------------------------+-----------------------------------+ 
164  |                | ident_pct               | percent identity, calculated      | 
165  |                |                         | using UCSC's formula              | 
166  |                +-------------------------+-----------------------------------+ 
167  |                | query_is_protein        | boolean, whether the query        | 
168  |                |                         | sequence is a protein             | 
169  |                +-------------------------+-----------------------------------+ 
170  |                | score                   | HSP score, calculated using       | 
171  |                |                         | UCSC's formula                    | 
172  +----------------+-------------------------+-----------------------------------+ 
173   
174  Finally, the default HSP and HSPFragment properties are also provided. See the 
175  HSP and HSPFragment documentation for more details on these properties. 
176   
177   
178  .. [1] You can obtain the number of blocks / fragments in the HSP by invoking 
179     ``len`` on the HSP 
180   
181  """ 
182  import re 
183  from math import log 
184   
185  from Bio._py3k import _as_bytes, _bytes_to_string 
186  from Bio._py3k import zip 
187   
188  from Bio.Alphabet import generic_dna 
189  from Bio.SearchIO._index import SearchIndexer 
190  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
191   
192   
193  __all__ = ['BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter'] 
194   
195   
196  # precompile regex patterns 
197  _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+' 
198  _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK) 
199  _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK)) 
200   
201   
202 -def _list_from_csv(csv_string, caster=None):
203 """Transforms the given comma-separated string into a list. 204 205 :param csv_string: comma-separated input string 206 :type csv_string: string 207 :param caster: function used to cast each item in the input string 208 to its intended type 209 :type caster: callable, accepts string, returns object 210 211 """ 212 if caster is None: 213 return [x for x in csv_string.split(',') if x] 214 else: 215 return [caster(x) for x in csv_string.split(',') if x]
216 217
218 -def _reorient_starts(starts, blksizes, seqlen, strand):
219 """Reorients block starts into the opposite strand's coordinates. 220 221 :param starts: start coordinates 222 :type starts: list [int] 223 :param blksizes: block sizes 224 :type blksizes: list [int] 225 :param seqlen: sequence length 226 :type seqlen: int 227 :param strand: sequence strand 228 :type strand: int, choice of -1, 0, or 1 229 230 """ 231 assert len(starts) == len(blksizes), \ 232 "Unequal start coordinates and block sizes list (%r vs %r)" \ 233 % (len(starts), len(blksizes)) 234 # see: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 235 # no need to reorient if it's already the positive strand 236 if strand >= 0: 237 return starts 238 else: 239 # the plus-oriented coordinate is calculated by this: 240 # plus_coord = length - minus_coord - block_size 241 return [seqlen - start - blksize for 242 start, blksize in zip(starts, blksizes)]
243 244
245 -def _is_protein(psl):
246 # check if query is protein or not 247 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 248 if len(psl['strand']) == 2: 249 if psl['strand'][1] == '+': 250 return psl['tend'] == psl['tstarts'][-1] + \ 251 3 * psl['blocksizes'][-1] 252 elif psl['strand'][1] == '-': 253 return psl['tstart'] == psl['tsize'] - \ 254 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1]) 255 256 return False
257 258
259 -def _calc_millibad(psl, is_protein):
260 # calculates millibad 261 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 262 size_mul = 3 if is_protein else 1 263 millibad = 0 264 265 qali_size = size_mul * (psl['qend'] - psl['qstart']) 266 tali_size = psl['tend'] - psl['tstart'] 267 ali_size = min(qali_size, tali_size) 268 if ali_size <= 0: 269 return 0 270 271 size_dif = qali_size - tali_size 272 size_dif = 0 if size_dif < 0 else size_dif 273 274 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches']) 275 if total != 0: 276 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] + 277 round(3 * log(1 + size_dif)))) / total 278 279 return millibad
280 281
282 -def _calc_score(psl, is_protein):
283 # calculates score 284 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 285 size_mul = 3 if is_protein else 1 286 return size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) - \ 287 size_mul * psl['mismatches'] - psl['qnuminsert'] - psl['tnuminsert']
288 289
290 -def _create_hsp(hid, qid, psl):
291 # protein flag 292 is_protein = _is_protein(psl) 293 # strand 294 # if query is protein, strand is 0 295 if is_protein: 296 qstrand = 0 297 else: 298 qstrand = 1 if psl['strand'][0] == '+' else -1 299 # try to get hit strand, if it exists 300 try: 301 hstrand = 1 if psl['strand'][1] == '+' else -1 302 except IndexError: 303 hstrand = 1 # hit strand defaults to plus 304 305 # query block starts 306 qstarts = _reorient_starts(psl['qstarts'], 307 psl['blocksizes'], psl['qsize'], qstrand) 308 # hit block starts 309 if len(psl['strand']) == 2: 310 hstarts = _reorient_starts(psl['tstarts'], 311 psl['blocksizes'], psl['tsize'], hstrand) 312 else: 313 hstarts = psl['tstarts'] 314 # set query and hit coords 315 # this assumes each block has no gaps (which seems to be the case) 316 assert len(qstarts) == len(hstarts) == len(psl['blocksizes']) 317 query_range_all = list(zip(qstarts, [x + y for x, y in 318 zip(qstarts, psl['blocksizes'])])) 319 hit_range_all = list(zip(hstarts, [x + y for x, y in 320 zip(hstarts, psl['blocksizes'])])) 321 # check length of sequences and coordinates, all must match 322 if 'tseqs' in psl and 'qseqs' in psl: 323 assert len(psl['tseqs']) == len(psl['qseqs']) == \ 324 len(query_range_all) == len(hit_range_all) 325 else: 326 assert len(query_range_all) == len(hit_range_all) 327 328 frags = [] 329 # iterating over query_range_all, but hit_range_all works just as well 330 for idx, qcoords in enumerate(query_range_all): 331 hseqlist = psl.get('tseqs') 332 hseq = '' if not hseqlist else hseqlist[idx] 333 qseqlist = psl.get('qseqs') 334 qseq = '' if not qseqlist else qseqlist[idx] 335 frag = HSPFragment(hid, qid, hit=hseq, query=qseq) 336 # set alphabet 337 frag.alphabet = generic_dna 338 # set coordinates 339 frag.query_start = qcoords[0] 340 frag.query_end = qcoords[1] 341 frag.hit_start = hit_range_all[idx][0] 342 frag.hit_end = hit_range_all[idx][1] 343 # and strands 344 frag.query_strand = qstrand 345 frag.hit_strand = hstrand 346 frags.append(frag) 347 348 # create hsp object 349 hsp = HSP(frags) 350 # check if start and end are set correctly 351 assert hsp.query_start == psl['qstart'] 352 assert hsp.query_end == psl['qend'] 353 assert hsp.hit_start == psl['tstart'] 354 assert hsp.hit_end == psl['tend'] 355 # and check block spans as well 356 assert hsp.query_span_all == hsp.hit_span_all == psl['blocksizes'] 357 # set its attributes 358 hsp.match_num = psl['matches'] 359 hsp.mismatch_num = psl['mismatches'] 360 hsp.match_rep_num = psl['repmatches'] 361 hsp.n_num = psl['ncount'] 362 hsp.query_gapopen_num = psl['qnuminsert'] 363 hsp.query_gap_num = psl['qbaseinsert'] 364 hsp.hit_gapopen_num = psl['tnuminsert'] 365 hsp.hit_gap_num = psl['tbaseinsert'] 366 367 hsp.ident_num = psl['matches'] + psl['repmatches'] 368 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert'] 369 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert'] 370 hsp.query_is_protein = is_protein 371 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1 372 hsp.score = _calc_score(psl, is_protein) 373 # helper flag, for writing 374 hsp._has_hit_strand = len(psl['strand']) == 2 375 376 return hsp
377 378
379 -class BlatPslParser(object):
380 381 """Parser for the BLAT PSL format.""" 382
383 - def __init__(self, handle, pslx=False):
384 self.handle = handle 385 self.line = self.handle.readline() 386 self.pslx = pslx
387
388 - def __iter__(self):
389 # break out if it's an empty file 390 if not self.line: 391 raise StopIteration 392 393 # read through header 394 # this assumes that the result row match the regex 395 while not re.search(_RE_ROW_CHECK, self.line.strip()): 396 self.line = self.handle.readline() 397 if not self.line: 398 raise StopIteration 399 400 # parse into query results 401 for qresult in self._parse_qresult(): 402 qresult.program = 'blat' 403 yield qresult
404
405 - def _parse_row(self):
406 """Returns a dictionary of parsed column values.""" 407 assert self.line 408 cols = [x for x in self.line.strip().split('\t') if x] 409 self._validate_cols(cols) 410 411 psl = {} 412 psl['qname'] = cols[9] # qName 413 psl['qsize'] = int(cols[10]) # qSize 414 psl['tname'] = cols[13] # tName 415 psl['tsize'] = int(cols[14]) # tSize 416 psl['matches'] = int(cols[0]) # matches 417 psl['mismatches'] = int(cols[1]) # misMatches 418 psl['repmatches'] = int(cols[2]) # repMatches 419 psl['ncount'] = int(cols[3]) # nCount 420 psl['qnuminsert'] = int(cols[4]) # qNumInsert 421 psl['qbaseinsert'] = int(cols[5]) # qBaseInsert 422 psl['tnuminsert'] = int(cols[6]) # tNumInsert 423 psl['tbaseinsert'] = int(cols[7]) # tBaseInsert 424 psl['strand'] = cols[8] # strand 425 psl['qstart'] = int(cols[11]) # qStart 426 psl['qend'] = int(cols[12]) # qEnd 427 psl['tstart'] = int(cols[15]) # tStart 428 psl['tend'] = int(cols[16]) # tEnd 429 psl['blockcount'] = int(cols[17]) # blockCount 430 psl['blocksizes'] = _list_from_csv(cols[18], int) # blockSizes 431 psl['qstarts'] = _list_from_csv(cols[19], int) # qStarts 432 psl['tstarts'] = _list_from_csv(cols[20], int) # tStarts 433 if self.pslx: 434 psl['qseqs'] = _list_from_csv(cols[21]) # query sequence 435 psl['tseqs'] = _list_from_csv(cols[22]) # hit sequence 436 437 return psl
438
439 - def _validate_cols(self, cols):
440 if not self.pslx: 441 assert len(cols) == 21, "Invalid PSL line: %r. " \ 442 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols)) 443 else: 444 assert len(cols) == 23, "Invalid PSLX line: %r. " \ 445 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
446
447 - def _parse_qresult(self):
448 """Generator function that returns QueryResult objects.""" 449 # state values, determines what to do for each line 450 state_EOF = 0 451 state_QRES_NEW = 1 452 state_QRES_SAME = 3 453 state_HIT_NEW = 2 454 state_HIT_SAME = 4 455 # initial dummy values 456 qres_state = None 457 file_state = None 458 cur_qid, cur_hid = None, None 459 prev_qid, prev_hid = None, None 460 cur, prev = None, None 461 hit_list, hsp_list = [], [] 462 463 while True: 464 # store previous line's parsed values for all lines after the first 465 if cur is not None: 466 prev = cur 467 prev_qid = cur_qid 468 prev_hid = cur_hid 469 # only parse the result row if it's not EOF 470 if self.line: 471 cur = self._parse_row() 472 cur_qid = cur['qname'] 473 cur_hid = cur['tname'] 474 else: 475 file_state = state_EOF 476 # mock values, since we have nothing to parse 477 cur_qid, cur_hid = None, None 478 479 # get the state of hit and qresult 480 if prev_qid != cur_qid: 481 qres_state = state_QRES_NEW 482 else: 483 qres_state = state_QRES_SAME 484 # new hits are hits with different ids or hits in a new qresult 485 if prev_hid != cur_hid or qres_state == state_QRES_NEW: 486 hit_state = state_HIT_NEW 487 else: 488 hit_state = state_HIT_SAME 489 490 if prev is not None: 491 # create fragment and HSP and set their attributes 492 hsp = _create_hsp(prev_hid, prev_qid, prev) 493 hsp_list.append(hsp) 494 495 if hit_state == state_HIT_NEW: 496 # create Hit and set its attributes 497 hit = Hit(hsp_list) 498 hit.seq_len = prev['tsize'] 499 hit_list.append(hit) 500 hsp_list = [] 501 502 # create qresult and yield if we're at a new qresult or at EOF 503 if qres_state == state_QRES_NEW or file_state == state_EOF: 504 qresult = QueryResult(id=prev_qid) 505 for hit in hit_list: 506 qresult.absorb(hit) 507 qresult.seq_len = prev['qsize'] 508 yield qresult 509 # if we're at EOF, break 510 if file_state == state_EOF: 511 break 512 hit_list = [] 513 514 self.line = self.handle.readline()
515 516
517 -class BlatPslIndexer(SearchIndexer):
518 519 """Indexer class for BLAT PSL output.""" 520 521 _parser = BlatPslParser 522
523 - def __init__(self, filename, pslx=False):
524 SearchIndexer.__init__(self, filename, pslx=pslx)
525
526 - def __iter__(self):
527 """Iterates over the file handle; yields key, start offset, and length.""" 528 handle = self._handle 529 handle.seek(0) 530 # denotes column location for query identifier 531 query_id_idx = 9 532 qresult_key = None 533 tab_char = _as_bytes('\t') 534 535 start_offset = handle.tell() 536 line = handle.readline() 537 # read through header 538 # this assumes that the result row match the regex 539 while not re.search(_RE_ROW_CHECK_IDX, line.strip()): 540 start_offset = handle.tell() 541 line = handle.readline() 542 if not line: 543 raise StopIteration 544 545 # and index the qresults 546 while True: 547 end_offset = handle.tell() 548 549 cols = [x for x in line.strip().split(tab_char) if x] 550 if qresult_key is None: 551 qresult_key = cols[query_id_idx] 552 else: 553 curr_key = cols[query_id_idx] 554 555 if curr_key != qresult_key: 556 yield _bytes_to_string(qresult_key), start_offset, \ 557 end_offset - start_offset 558 qresult_key = curr_key 559 start_offset = end_offset - len(line) 560 561 line = handle.readline() 562 if not line: 563 yield _bytes_to_string(qresult_key), start_offset, \ 564 end_offset - start_offset 565 break
566
567 - def get_raw(self, offset):
568 """Returns raw bytes string of a QueryResult object from the given offset.""" 569 handle = self._handle 570 handle.seek(offset) 571 query_id_idx = 9 572 qresult_key = None 573 qresult_raw = _as_bytes('') 574 tab_char = _as_bytes('\t') 575 576 while True: 577 line = handle.readline() 578 if not line: 579 break 580 cols = [x for x in line.strip().split(tab_char) if x] 581 if qresult_key is None: 582 qresult_key = cols[query_id_idx] 583 else: 584 curr_key = cols[query_id_idx] 585 if curr_key != qresult_key: 586 break 587 qresult_raw += line 588 589 return qresult_raw
590 591
592 -class BlatPslWriter(object):
593 594 """Writer for the blat-psl format.""" 595
596 - def __init__(self, handle, header=False, pslx=False):
597 self.handle = handle 598 # flag for writing header or not 599 self.header = header 600 self.pslx = pslx
601
602 - def write_file(self, qresults):
603 handle = self.handle 604 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0 605 606 if self.header: 607 handle.write(self._build_header()) 608 609 for qresult in qresults: 610 if qresult: 611 handle.write(self._build_row(qresult)) 612 qresult_counter += 1 613 hit_counter += len(qresult) 614 hsp_counter += sum(len(hit) for hit in qresult) 615 frag_counter += sum(len(hit.fragments) for hit in qresult) 616 617 return qresult_counter, hit_counter, hsp_counter, frag_counter
618
619 - def _build_header(self):
620 # for now, always use the psLayout version 3 621 header = 'psLayout version 3\n' 622 623 # adapted from BLAT's source: lib/psl.c#L496 624 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT " 625 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT " 626 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \ 627 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname " 628 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount" 629 "\n%s\n" % ('-' * 159) 630 631 return header
632
633 - def _build_row(self, qresult):
634 """Returns a string or one row or more of the QueryResult object.""" 635 # For now, our writer writes the row according to the order in 636 # the QueryResult and Hit objects. 637 # This is different from BLAT's native output, where the rows are 638 # grouped by strand. 639 # Should we tweak the behavior to better mimic the native output? 640 qresult_lines = [] 641 642 for hit in qresult: 643 for hsp in hit.hsps: 644 645 line = [] 646 line.append(hsp.match_num) 647 line.append(hsp.mismatch_num) 648 line.append(hsp.match_rep_num) 649 line.append(hsp.n_num) 650 line.append(hsp.query_gapopen_num) 651 line.append(hsp.query_gap_num) 652 line.append(hsp.hit_gapopen_num) 653 line.append(hsp.hit_gap_num) 654 655 # check spans 656 assert hsp.query_span_all == hsp.hit_span_all 657 block_sizes = hsp.query_span_all 658 659 # set strand and starts 660 if hsp[0].query_strand >= 0: # since it may be a protein seq 661 strand = '+' 662 else: 663 strand = '-' 664 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all], 665 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand) 666 667 if hsp[0].hit_strand == 1: 668 hstrand = 1 669 # only write hit strand if it was present in the source file 670 if hsp._has_hit_strand: 671 strand += '+' 672 else: 673 hstrand = -1 674 strand += '-' 675 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all], 676 hsp.hit_span_all, hit.seq_len, hstrand) 677 678 line.append(strand) 679 line.append(qresult.id) 680 line.append(qresult.seq_len) 681 line.append(hsp.query_start) 682 line.append(hsp.query_end) 683 line.append(hit.id) 684 line.append(hit.seq_len) 685 line.append(hsp.hit_start) 686 line.append(hsp.hit_end) 687 line.append(len(hsp)) 688 line.append(','.join((str(x) for x in block_sizes)) + ',') 689 line.append(','.join((str(x) for x in qstarts)) + ',') 690 line.append(','.join((str(x) for x in hstarts)) + ',') 691 692 if self.pslx: 693 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',') 694 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',') 695 696 qresult_lines.append('\t'.join((str(x) for x in line))) 697 698 return '\n'.join(qresult_lines) + '\n'
699 700 701 # if not used as a module, run the doctest 702 if __name__ == "__main__": 703 from Bio._utils import run_doctest 704 run_doctest() 705