Package Bio :: Package SearchIO :: Module BlatIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.BlatIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO parser for BLAT output formats. 
  7   
  8  This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment 
  9  Tool) is a sequence similarity search program initially built for annotating 
 10  the human genome. 
 11   
 12  Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout 
 13  version 3. It should be able to parse psLayout version 4 without problems. 
 14   
 15  More information on BLAT is available from these sites: 
 16   
 17      - Publication: http://genome.cshlp.org/content/12/4/656 
 18      - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 
 19      - Source download: http://www.soe.ucsc.edu/~kent/src 
 20      - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/ 
 21      - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 
 22   
 23   
 24  Supported Formats 
 25  ================= 
 26   
 27  BlatIO supports parsing, indexing, and writing for both PSL and PSLX output 
 28  formats, with or without header. To parse, index, or write PSLX files, use the 
 29  'pslx' keyword argument and set it to True. 
 30   
 31      # blat-psl defaults to PSL files 
 32      >>> from Bio import SearchIO 
 33      >>> psl = 'Blat/psl_34_004.psl' 
 34      >>> qresult = SearchIO.read(psl, 'blat-psl') 
 35      >>> qresult 
 36      QueryResult(id='hg19_dna', 10 hits) 
 37   
 38      # set the pslx flag to parse PSLX files 
 39      >>> pslx = 'Blat/pslx_34_004.pslx' 
 40      >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True) 
 41      >>> qresult 
 42      QueryResult(id='hg19_dna', 10 hits) 
 43   
 44  For parsing and indexing, you do not need to specify whether the file has a 
 45  header or not. For writing, if you want to write a header, you can set the 
 46  'header' keyword argument to True. This will write a 'psLayout version 3' header 
 47  to your output file. 
 48   
 49      from Bio import SearchIO 
 50      qresult = SearchIO.read(psl, 'blat-psl') 
 51      SearchIO.write(qresult, 'header.psl', header=True) 
 52      <stdout> (1, 10, 19, 23) 
 53   
 54  Note that the number of HSPFragments written may exceed the number of HSP 
 55  objects. This is because in PSL files, it is possible to have single matches 
 56  consisting of noncontiguous sequence fragments. This is where the HSPFragment 
 57  object comes into play. These fragments are grouped into a single HSP because 
 58  they share the same statistics (e.g. match numbers, BLAT score, etc.). However, 
 59  they do not share the same sequence attributes, such as the start and end 
 60  coordinates, making them distinct objects. 
 61   
 62  In addition to parsing PSL(X) files, BlatIO also computes the percent identities 
 63  and scores of your search results. This is done using the calculation formula 
 64  posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score 
 65  and percent identity calculation done by UCSC's web BLAT service. 
 66   
 67  Since BlatIO parses the file in a single pass, it expects all results from 
 68  the same query to be in consecutive rows. If the results from one query are 
 69  spread in nonconsecutive rows, BlatIO will consider them to be separate 
 70  QueryResult objects. 
 71   
 72  In most cases, the PSL(X) format uses the same coordinate system as Python 
 73  (zero-based, half open). These coordinates are anchored on the plus strand. 
 74  However, if the query aligns on the minus strand, BLAT will anchor the qStarts 
 75  coordinates on the minus strand instead. BlatIO is aware of this, and will 
 76  re-anchor the qStarts coordinates to the plus strand whenever it sees a minus 
 77  strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will 
 78  reanchor qStarts to the minus strand again. 
 79   
 80  BlatIO provides the following attribute-column mapping: 
 81   
 82  +----------------+-------------------------+-----------------------------------+ 
 83  | Object         | Attribute               | Column Name, Value                | 
 84  +================+=========================+===================================+ 
 85  | QueryResutl    | id                      | Q name, query sequence ID         | 
 86  |                +-------------------------+-----------------------------------+ 
 87  |                | seq_len                 | Q size, query sequence full       | 
 88  |                |                         | length                            | 
 89  +----------------+-------------------------+-----------------------------------+ 
 90  | Hit            | id                      | T name, hit sequence ID           | 
 91  |                +-------------------------+-----------------------------------+ 
 92  |                | seq_len                 | T size, hit sequence full length  | 
 93  +----------------+-------------------------+-----------------------------------+ 
 94  | HSP            | hit_end                 | T end, end coordinate of the last | 
 95  |                |                         | hit fragment                      | 
 96  |                +-------------------------+-----------------------------------+ 
 97  |                | hit_gap_num             | T gap bases, number of bases      | 
 98  |                |                         | inserted in hit                   | 
 99  |                +-------------------------+-----------------------------------+ 
100  |                | hit_gapopen_num         | T gap count, number of hit gap    | 
101  |                |                         | inserts                           | 
102  |                +-------------------------+-----------------------------------+ 
103  |                | hit_span_all            | blockSizes, sizes of each         | 
104  |                |                         | fragment                          | 
105  |                +-------------------------+-----------------------------------+ 
106  |                | hit_start               | T start, start coordinate of the  | 
107  |                |                         | first hit fragment                | 
108  |                +-------------------------+-----------------------------------+ 
109  |                | hit_start_all           | tStarts, start coordinate of each | 
110  |                |                         | hit fragment                      | 
111  |                +-------------------------+-----------------------------------+ 
112  |                | match_num               | match, number of non-repeat       | 
113  |                |                         | matches                           | 
114  |                +-------------------------+-----------------------------------+ 
115  |                | mismatch_num            | mismatch, number of mismatches    | 
116  |                +-------------------------+-----------------------------------+ 
117  |                | match_rep_num           | rep. match, number of matches     | 
118  |                |                         | that are part of repeats          | 
119  |                +-------------------------+-----------------------------------+ 
120  |                | n_num                   | N's, number of N bases            | 
121  |                +-------------------------+-----------------------------------+ 
122  |                | query_end               | Q end, end coordinate of the last | 
123  |                +-------------------------+-----------------------------------+ 
124  |                |                         | query fragment                    | 
125  |                | query_gap_num           | Q gap bases, number of bases      | 
126  |                |                         | inserted in query                 | 
127  |                +-------------------------+-----------------------------------+ 
128  |                | query_gapopen_num       | Q gap count, number of query gap  | 
129  |                |                         | inserts                           | 
130  |                +-------------------------+-----------------------------------+ 
131  |                | query_span_all          | blockSizes, sizes of each         | 
132  |                |                         | fragment                          | 
133  |                +-------------------------+-----------------------------------+ 
134  |                | query_start             | Q start, start coordinate of the  | 
135  |                |                         | first query block                 | 
136  |                +-------------------------+-----------------------------------+ 
137  |                | query_start_all         | qStarts, start coordinate of each | 
138  |                |                         | query fragment                    | 
139  |                +-------------------------+-----------------------------------+ 
140  |                | len [*]_                | block count, the number of blocks | 
141  |                |                         | in the alignment                  | 
142  +----------------+-------------------------+-----------------------------------+ 
143  | HSPFragment    | hit                     | hit sequence, if present          | 
144  |                +-------------------------+-----------------------------------+ 
145  |                | hit_strand              | strand, hit sequence strand       | 
146  |                +-------------------------+-----------------------------------+ 
147  |                | query                   | query sequence, if present        | 
148  |                +-------------------------+-----------------------------------+ 
149  |                | query_strand            | strand, query sequence strand     | 
150  +----------------+-------------------------+-----------------------------------+ 
151   
152  In addition to the column mappings above, BlatIO also provides the following 
153  object attributes: 
154   
155  +----------------+-------------------------+-----------------------------------+ 
156  | Object         | Attribute               | Value                             | 
157  +================+=========================+===================================+ 
158  | HSP            | gapopen_num             | Q gap count + T gap count, total  | 
159  |                |                         |  number of gap openings           | 
160  |                +-------------------------+-----------------------------------+ 
161  |                | ident_num               | matches + repmatches, total       | 
162  |                |                         | number of identical residues      | 
163  |                +-------------------------+-----------------------------------+ 
164  |                | ident_pct               | percent identity, calculated      | 
165  |                |                         | using UCSC's formula              | 
166  |                +-------------------------+-----------------------------------+ 
167  |                | query_is_protein        | boolean, whether the query        | 
168  |                |                         | sequence is a protein             | 
169  |                +-------------------------+-----------------------------------+ 
170  |                | score                   | HSP score, calculated using       | 
171  |                |                         | UCSC's formula                    | 
172  +----------------+-------------------------+-----------------------------------+ 
173   
174  Finally, the default HSP and HSPFragment properties are also provided. See the 
175  HSP and HSPFragment documentation for more details on these properties. 
176   
177   
178  .. [*] You can obtain the number of blocks / fragments in the HSP by invoking 
179     ``len`` on the HSP 
180   
181  """ 
182  import re 
183  from math import log 
184   
185  from Bio._py3k import _as_bytes, _bytes_to_string 
186  from Bio._py3k import zip 
187   
188  from Bio.Alphabet import generic_dna 
189  from Bio.SearchIO._index import SearchIndexer 
190  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
191   
192   
193  __all__ = ('BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter') 
194   
195   
196  # precompile regex patterns 
197  _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+' 
198  _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK) 
199  _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK)) 
200   
201   
202 -def _list_from_csv(csv_string, caster=None):
203 """Transform the given comma-separated string into a list (PRIVATE). 204 205 :param csv_string: comma-separated input string 206 :type csv_string: string 207 :param caster: function used to cast each item in the input string 208 to its intended type 209 :type caster: callable, accepts string, returns object 210 211 """ 212 if caster is None: 213 return [x for x in csv_string.split(',') if x] 214 else: 215 return [caster(x) for x in csv_string.split(',') if x]
216 217
218 -def _reorient_starts(starts, blksizes, seqlen, strand):
219 """Reorients block starts into the opposite strand's coordinates (PRIVATE). 220 221 :param starts: start coordinates 222 :type starts: list [int] 223 :param blksizes: block sizes 224 :type blksizes: list [int] 225 :param seqlen: sequence length 226 :type seqlen: int 227 :param strand: sequence strand 228 :type strand: int, choice of -1, 0, or 1 229 230 """ 231 assert len(starts) == len(blksizes), \ 232 "Unequal start coordinates and block sizes list (%r vs %r)" \ 233 % (len(starts), len(blksizes)) 234 # see: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 235 # no need to reorient if it's already the positive strand 236 if strand >= 0: 237 return starts 238 else: 239 # the plus-oriented coordinate is calculated by this: 240 # plus_coord = length - minus_coord - block_size 241 return [seqlen - start - blksize for 242 start, blksize in zip(starts, blksizes)]
243 244
245 -def _is_protein(psl):
246 # check if query is protein or not 247 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 248 if len(psl['strand']) == 2: 249 if psl['strand'][1] == '+': 250 return psl['tend'] == psl['tstarts'][-1] + \ 251 3 * psl['blocksizes'][-1] 252 elif psl['strand'][1] == '-': 253 return psl['tstart'] == psl['tsize'] - \ 254 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1]) 255 256 return False
257 258
259 -def _calc_millibad(psl, is_protein):
260 # calculates millibad 261 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 262 size_mul = 3 if is_protein else 1 263 millibad = 0 264 265 qali_size = size_mul * (psl['qend'] - psl['qstart']) 266 tali_size = psl['tend'] - psl['tstart'] 267 ali_size = min(qali_size, tali_size) 268 if ali_size <= 0: 269 return 0 270 271 size_dif = qali_size - tali_size 272 size_dif = 0 if size_dif < 0 else size_dif 273 274 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches']) 275 if total != 0: 276 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] + 277 round(3 * log(1 + size_dif)))) / total 278 279 return millibad
280 281
282 -def _calc_score(psl, is_protein):
283 # calculates score 284 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 285 size_mul = 3 if is_protein else 1 286 return (size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) 287 - size_mul * psl['mismatches'] 288 - psl['qnuminsert'] 289 - psl['tnuminsert'])
290 291
292 -def _create_hsp(hid, qid, psl):
293 # protein flag 294 is_protein = _is_protein(psl) 295 # strand 296 # if query is protein, strand is 0 297 if is_protein: 298 qstrand = 0 299 else: 300 qstrand = 1 if psl['strand'][0] == '+' else -1 301 # try to get hit strand, if it exists 302 try: 303 hstrand = 1 if psl['strand'][1] == '+' else -1 304 except IndexError: 305 hstrand = 1 # hit strand defaults to plus 306 307 blocksize_multiplier = 3 if is_protein else 1 308 # query block starts 309 qstarts = _reorient_starts(psl['qstarts'], 310 psl['blocksizes'], psl['qsize'], qstrand) 311 # hit block starts 312 if len(psl['strand']) == 2: 313 hstarts = _reorient_starts(psl['tstarts'], 314 [blocksize_multiplier * i for i in psl['blocksizes']], 315 psl['tsize'], hstrand) 316 else: 317 hstarts = psl['tstarts'] 318 # set query and hit coords 319 # this assumes each block has no gaps (which seems to be the case) 320 assert len(qstarts) == len(hstarts) == len(psl['blocksizes']) 321 query_range_all = list(zip(qstarts, [x + y for x, y in 322 zip(qstarts, psl['blocksizes'])])) 323 hit_range_all = list(zip(hstarts, [x + y * blocksize_multiplier for x, y in 324 zip(hstarts, psl['blocksizes'])])) 325 # check length of sequences and coordinates, all must match 326 if 'tseqs' in psl and 'qseqs' in psl: 327 assert len(psl['tseqs']) == len(psl['qseqs']) == \ 328 len(query_range_all) == len(hit_range_all) 329 else: 330 assert len(query_range_all) == len(hit_range_all) 331 332 frags = [] 333 # iterating over query_range_all, but hit_range_all works just as well 334 for idx, qcoords in enumerate(query_range_all): 335 hseqlist = psl.get('tseqs') 336 hseq = '' if not hseqlist else hseqlist[idx] 337 qseqlist = psl.get('qseqs') 338 qseq = '' if not qseqlist else qseqlist[idx] 339 frag = HSPFragment(hid, qid, hit=hseq, query=qseq) 340 # set alphabet 341 frag.alphabet = generic_dna 342 # set coordinates 343 frag.query_start = qcoords[0] 344 frag.query_end = qcoords[1] 345 frag.hit_start = hit_range_all[idx][0] 346 frag.hit_end = hit_range_all[idx][1] 347 # and strands 348 frag.query_strand = qstrand 349 frag.hit_strand = hstrand 350 frags.append(frag) 351 352 # create hsp object 353 hsp = HSP(frags) 354 # check if start and end are set correctly 355 assert hsp.query_start == psl['qstart'] 356 assert hsp.query_end == psl['qend'] 357 assert hsp.hit_start == psl['tstart'] 358 assert hsp.hit_end == psl['tend'] 359 # and check block spans as well 360 hit_spans = [span / blocksize_multiplier for span in hsp.hit_span_all] 361 assert hit_spans == hsp.query_span_all == psl['blocksizes'] 362 # set its attributes 363 hsp.match_num = psl['matches'] 364 hsp.mismatch_num = psl['mismatches'] 365 hsp.match_rep_num = psl['repmatches'] 366 hsp.n_num = psl['ncount'] 367 hsp.query_gapopen_num = psl['qnuminsert'] 368 hsp.query_gap_num = psl['qbaseinsert'] 369 hsp.hit_gapopen_num = psl['tnuminsert'] 370 hsp.hit_gap_num = psl['tbaseinsert'] 371 372 hsp.ident_num = psl['matches'] + psl['repmatches'] 373 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert'] 374 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert'] 375 hsp.query_is_protein = is_protein 376 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1 377 hsp.score = _calc_score(psl, is_protein) 378 # helper flag, for writing 379 hsp._has_hit_strand = len(psl['strand']) == 2 380 381 return hsp
382 383
384 -class BlatPslParser(object):
385 """Parser for the BLAT PSL format.""" 386
387 - def __init__(self, handle, pslx=False):
388 """Initialize the class.""" 389 self.handle = handle 390 self.line = self.handle.readline() 391 self.pslx = pslx
392
393 - def __iter__(self):
394 # break out if it's an empty file 395 if not self.line: 396 return 397 398 # read through header 399 # this assumes that the result row match the regex 400 while not re.search(_RE_ROW_CHECK, self.line.strip()): 401 self.line = self.handle.readline() 402 if not self.line: 403 return 404 405 # parse into query results 406 for qresult in self._parse_qresult(): 407 qresult.program = 'blat' 408 yield qresult
409
410 - def _parse_row(self):
411 """Return a dictionary of parsed column values (PRIVATE).""" 412 assert self.line 413 cols = [x for x in self.line.strip().split('\t') if x] 414 self._validate_cols(cols) 415 416 psl = {} 417 psl['qname'] = cols[9] # qName 418 psl['qsize'] = int(cols[10]) # qSize 419 psl['tname'] = cols[13] # tName 420 psl['tsize'] = int(cols[14]) # tSize 421 psl['matches'] = int(cols[0]) # matches 422 psl['mismatches'] = int(cols[1]) # misMatches 423 psl['repmatches'] = int(cols[2]) # repMatches 424 psl['ncount'] = int(cols[3]) # nCount 425 psl['qnuminsert'] = int(cols[4]) # qNumInsert 426 psl['qbaseinsert'] = int(cols[5]) # qBaseInsert 427 psl['tnuminsert'] = int(cols[6]) # tNumInsert 428 psl['tbaseinsert'] = int(cols[7]) # tBaseInsert 429 psl['strand'] = cols[8] # strand 430 psl['qstart'] = int(cols[11]) # qStart 431 psl['qend'] = int(cols[12]) # qEnd 432 psl['tstart'] = int(cols[15]) # tStart 433 psl['tend'] = int(cols[16]) # tEnd 434 psl['blockcount'] = int(cols[17]) # blockCount 435 psl['blocksizes'] = _list_from_csv(cols[18], int) # blockSizes 436 psl['qstarts'] = _list_from_csv(cols[19], int) # qStarts 437 psl['tstarts'] = _list_from_csv(cols[20], int) # tStarts 438 if self.pslx: 439 psl['qseqs'] = _list_from_csv(cols[21]) # query sequence 440 psl['tseqs'] = _list_from_csv(cols[22]) # hit sequence 441 442 return psl
443
444 - def _validate_cols(self, cols):
445 if not self.pslx: 446 assert len(cols) == 21, "Invalid PSL line: %r. " \ 447 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols)) 448 else: 449 assert len(cols) == 23, "Invalid PSLX line: %r. " \ 450 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
451
452 - def _parse_qresult(self):
453 """Yield QueryResult objects (PRIVATE).""" 454 # state values, determines what to do for each line 455 state_EOF = 0 456 state_QRES_NEW = 1 457 state_QRES_SAME = 3 458 state_HIT_NEW = 2 459 state_HIT_SAME = 4 460 # initial dummy values 461 qres_state = None 462 file_state = None 463 cur_qid, cur_hid = None, None 464 prev_qid, prev_hid = None, None 465 cur, prev = None, None 466 hit_list, hsp_list = [], [] 467 468 while True: 469 # store previous line's parsed values for all lines after the first 470 if cur is not None: 471 prev = cur 472 prev_qid = cur_qid 473 prev_hid = cur_hid 474 # only parse the result row if it's not EOF 475 if self.line: 476 cur = self._parse_row() 477 cur_qid = cur['qname'] 478 cur_hid = cur['tname'] 479 else: 480 file_state = state_EOF 481 # mock values, since we have nothing to parse 482 cur_qid, cur_hid = None, None 483 484 # get the state of hit and qresult 485 if prev_qid != cur_qid: 486 qres_state = state_QRES_NEW 487 else: 488 qres_state = state_QRES_SAME 489 # new hits are hits with different ids or hits in a new qresult 490 if prev_hid != cur_hid or qres_state == state_QRES_NEW: 491 hit_state = state_HIT_NEW 492 else: 493 hit_state = state_HIT_SAME 494 495 if prev is not None: 496 # create fragment and HSP and set their attributes 497 hsp = _create_hsp(prev_hid, prev_qid, prev) 498 hsp_list.append(hsp) 499 500 if hit_state == state_HIT_NEW: 501 # create Hit and set its attributes 502 hit = Hit(hsp_list) 503 hit.seq_len = prev['tsize'] 504 hit_list.append(hit) 505 hsp_list = [] 506 507 # create qresult and yield if we're at a new qresult or at EOF 508 if qres_state == state_QRES_NEW or file_state == state_EOF: 509 qresult = QueryResult(id=prev_qid) 510 for hit in hit_list: 511 qresult.absorb(hit) 512 qresult.seq_len = prev['qsize'] 513 yield qresult 514 # if we're at EOF, break 515 if file_state == state_EOF: 516 break 517 hit_list = [] 518 519 self.line = self.handle.readline()
520 521
522 -class BlatPslIndexer(SearchIndexer):
523 """Indexer class for BLAT PSL output.""" 524 525 _parser = BlatPslParser 526
527 - def __init__(self, filename, pslx=False):
528 """Initialize the class.""" 529 SearchIndexer.__init__(self, filename, pslx=pslx)
530
531 - def __iter__(self):
532 """Iterate over the file handle; yields key, start offset, and length.""" 533 handle = self._handle 534 handle.seek(0) 535 # denotes column location for query identifier 536 query_id_idx = 9 537 qresult_key = None 538 tab_char = b"\t" 539 540 start_offset = handle.tell() 541 line = handle.readline() 542 # read through header 543 # this assumes that the result row match the regex 544 while not re.search(_RE_ROW_CHECK_IDX, line.strip()): 545 start_offset = handle.tell() 546 line = handle.readline() 547 if not line: 548 return 549 550 # and index the qresults 551 while True: 552 end_offset = handle.tell() 553 554 cols = [x for x in line.strip().split(tab_char) if x] 555 if qresult_key is None: 556 qresult_key = cols[query_id_idx] 557 else: 558 curr_key = cols[query_id_idx] 559 560 if curr_key != qresult_key: 561 yield _bytes_to_string(qresult_key), start_offset, \ 562 end_offset - start_offset 563 qresult_key = curr_key 564 start_offset = end_offset - len(line) 565 566 line = handle.readline() 567 if not line: 568 yield _bytes_to_string(qresult_key), start_offset, \ 569 end_offset - start_offset 570 break
571
572 - def get_raw(self, offset):
573 """Return raw bytes string of a QueryResult object from the given offset.""" 574 handle = self._handle 575 handle.seek(offset) 576 query_id_idx = 9 577 qresult_key = None 578 qresult_raw = b"" 579 tab_char = b"\t" 580 581 while True: 582 line = handle.readline() 583 if not line: 584 break 585 cols = [x for x in line.strip().split(tab_char) if x] 586 if qresult_key is None: 587 qresult_key = cols[query_id_idx] 588 else: 589 curr_key = cols[query_id_idx] 590 if curr_key != qresult_key: 591 break 592 qresult_raw += line 593 594 return qresult_raw
595 596
597 -class BlatPslWriter(object):
598 """Writer for the blat-psl format.""" 599
600 - def __init__(self, handle, header=False, pslx=False):
601 """Initialize the class.""" 602 self.handle = handle 603 # flag for writing header or not 604 self.header = header 605 self.pslx = pslx
606
607 - def write_file(self, qresults):
608 handle = self.handle 609 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0 610 611 if self.header: 612 handle.write(self._build_header()) 613 614 for qresult in qresults: 615 if qresult: 616 handle.write(self._build_row(qresult)) 617 qresult_counter += 1 618 hit_counter += len(qresult) 619 hsp_counter += sum(len(hit) for hit in qresult) 620 frag_counter += sum(len(hit.fragments) for hit in qresult) 621 622 return qresult_counter, hit_counter, hsp_counter, frag_counter
623
624 - def _build_header(self):
625 # for now, always use the psLayout version 3 626 header = 'psLayout version 3\n' 627 628 # adapted from BLAT's source: lib/psl.c#L496 629 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT " 630 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT " 631 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \ 632 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname " 633 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount" 634 "\n%s\n" % ('-' * 159) 635 636 return header
637
638 - def _build_row(self, qresult):
639 """Return a string or one row or more of the QueryResult object (PRIVATE).""" 640 # For now, our writer writes the row according to the order in 641 # the QueryResult and Hit objects. 642 # This is different from BLAT's native output, where the rows are 643 # grouped by strand. 644 # Should we tweak the behavior to better mimic the native output? 645 qresult_lines = [] 646 647 for hit in qresult: 648 for hsp in hit.hsps: 649 650 query_is_protein = getattr(hsp, "query_is_protein", False) 651 blocksize_multiplier = 3 if query_is_protein else 1 652 653 line = [] 654 line.append(hsp.match_num) 655 line.append(hsp.mismatch_num) 656 line.append(hsp.match_rep_num) 657 line.append(hsp.n_num) 658 line.append(hsp.query_gapopen_num) 659 line.append(hsp.query_gap_num) 660 line.append(hsp.hit_gapopen_num) 661 line.append(hsp.hit_gap_num) 662 663 # check spans 664 eff_query_spans = [blocksize_multiplier * s for s in hsp.query_span_all] 665 if hsp.hit_span_all != eff_query_spans: 666 raise ValueError("HSP hit span and query span values do not match.") 667 block_sizes = hsp.query_span_all 668 669 # set strand and starts 670 if hsp[0].query_strand >= 0: # since it may be a protein seq 671 strand = '+' 672 else: 673 strand = '-' 674 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all], 675 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand) 676 677 if hsp[0].hit_strand == 1: 678 hstrand = 1 679 # only write hit strand if it was present in the source file 680 if hsp._has_hit_strand: 681 strand += '+' 682 else: 683 hstrand = -1 684 strand += '-' 685 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all], 686 hsp.hit_span_all, hit.seq_len, hstrand) 687 688 line.append(strand) 689 line.append(qresult.id) 690 line.append(qresult.seq_len) 691 line.append(hsp.query_start) 692 line.append(hsp.query_end) 693 line.append(hit.id) 694 line.append(hit.seq_len) 695 line.append(hsp.hit_start) 696 line.append(hsp.hit_end) 697 line.append(len(hsp)) 698 line.append(','.join((str(x) for x in block_sizes)) + ',') 699 line.append(','.join((str(x) for x in qstarts)) + ',') 700 line.append(','.join((str(x) for x in hstarts)) + ',') 701 702 if self.pslx: 703 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',') 704 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',') 705 706 qresult_lines.append('\t'.join((str(x) for x in line))) 707 708 return '\n'.join(qresult_lines) + '\n'
709 710 711 # if not used as a module, run the doctest 712 if __name__ == "__main__": 713 from Bio._utils import run_doctest 714 run_doctest() 715