Package Bio :: Package SearchIO :: Module BlatIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.BlatIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO parser for BLAT output formats. 
  7   
  8  This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment 
  9  Tool) is a sequence similarity search program initially built for annotating 
 10  the human genome. 
 11   
 12  Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout 
 13  version 3. It should be able to parse psLayout version 4 without problems. 
 14   
 15  More information on BLAT is available from these sites: 
 16   
 17      - Publication: http://genome.cshlp.org/content/12/4/656 
 18      - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 
 19      - Source download: http://www.soe.ucsc.edu/~kent/src 
 20      - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/ 
 21      - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 
 22   
 23   
 24  Supported Formats 
 25  ================= 
 26   
 27  BlatIO supports parsing, indexing, and writing for both PSL and PSLX output 
 28  formats, with or without header. To parse, index, or write PSLX files, use the 
 29  'pslx' keyword argument and set it to True. 
 30   
 31      # blat-psl defaults to PSL files 
 32      >>> from Bio import SearchIO 
 33      >>> psl = 'Blat/psl_34_004.psl' 
 34      >>> qresult = SearchIO.read(psl, 'blat-psl') 
 35      >>> qresult 
 36      QueryResult(id='hg19_dna', 10 hits) 
 37   
 38      # set the pslx flag to parse PSLX files 
 39      >>> pslx = 'Blat/pslx_34_004.pslx' 
 40      >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True) 
 41      >>> qresult 
 42      QueryResult(id='hg19_dna', 10 hits) 
 43   
 44  For parsing and indexing, you do not need to specify whether the file has a 
 45  header or not. For writing, if you want to write a header, you can set the 
 46  'header' keyword argument to True. This will write a 'psLayout version 3' header 
 47  to your output file. 
 48   
 49      from Bio import SearchIO 
 50      qresult = SearchIO.read(psl, 'blat-psl') 
 51      SearchIO.write(qresult, 'header.psl', header=True) 
 52      <stdout> (1, 10, 19, 23) 
 53   
 54  Note that the number of HSPFragments written may exceed the number of HSP 
 55  objects. This is because in PSL files, it is possible to have single matches 
 56  consisting of noncontiguous sequence fragments. This is where the HSPFragment 
 57  object comes into play. These fragments are grouped into a single HSP because 
 58  they share the same statistics (e.g. match numbers, BLAT score, etc.). However, 
 59  they do not share the same sequence attributes, such as the start and end 
 60  coordinates, making them distinct objects. 
 61   
 62  In addition to parsing PSL(X) files, BlatIO also computes the percent identities 
 63  and scores of your search results. This is done using the calculation formula 
 64  posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score 
 65  and percent identity calculation done by UCSC's web BLAT service. 
 66   
 67  Since BlatIO parses the file in a single pass, it expects all results from 
 68  the same query to be in consecutive rows. If the results from one query are 
 69  spread in nonconsecutive rows, BlatIO will consider them to be separate 
 70  QueryResult objects. 
 71   
 72  In most cases, the PSL(X) format uses the same coordinate system as Python 
 73  (zero-based, half open). These coordinates are anchored on the plus strand. 
 74  However, if the query aligns on the minus strand, BLAT will anchor the qStarts 
 75  coordinates on the minus strand instead. BlatIO is aware of this, and will 
 76  re-anchor the qStarts coordinates to the plus strand whenever it sees a minus 
 77  strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will 
 78  reanchor qStarts to the minus strand again. 
 79   
 80  BlatIO provides the following attribute-column mapping: 
 81   
 82  +----------------+-------------------------+-----------------------------------+ 
 83  | Object         | Attribute               | Column Name, Value                | 
 84  +================+=========================+===================================+ 
 85  | QueryResutl    | id                      | Q name, query sequence ID         | 
 86  |                +-------------------------+-----------------------------------+ 
 87  |                | seq_len                 | Q size, query sequence full       | 
 88  |                |                         | length                            | 
 89  +----------------+-------------------------+-----------------------------------+ 
 90  | Hit            | id                      | T name, hit sequence ID           | 
 91  |                +-------------------------+-----------------------------------+ 
 92  |                | seq_len                 | T size, hit sequence full length  | 
 93  +----------------+-------------------------+-----------------------------------+ 
 94  | HSP            | hit_end                 | T end, end coordinate of the last | 
 95  |                |                         | hit fragment                      | 
 96  |                +-------------------------+-----------------------------------+ 
 97  |                | hit_gap_num             | T gap bases, number of bases      | 
 98  |                |                         | inserted in hit                   | 
 99  |                +-------------------------+-----------------------------------+ 
100  |                | hit_gapopen_num         | T gap count, number of hit gap    | 
101  |                |                         | inserts                           | 
102  |                +-------------------------+-----------------------------------+ 
103  |                | hit_span_all            | blockSizes, sizes of each         | 
104  |                |                         | fragment                          | 
105  |                +-------------------------+-----------------------------------+ 
106  |                | hit_start               | T start, start coordinate of the  | 
107  |                |                         | first hit fragment                | 
108  |                +-------------------------+-----------------------------------+ 
109  |                | hit_start_all           | tStarts, start coordinate of each | 
110  |                |                         | hit fragment                      | 
111  |                +-------------------------+-----------------------------------+ 
112  |                | match_num               | match, number of non-repeat       | 
113  |                |                         | matches                           | 
114  |                +-------------------------+-----------------------------------+ 
115  |                | mismatch_num            | mismatch, number of mismatches    | 
116  |                +-------------------------+-----------------------------------+ 
117  |                | match_rep_num           | rep. match, number of matches     | 
118  |                |                         | that are part of repeats          | 
119  |                +-------------------------+-----------------------------------+ 
120  |                | n_num                   | N's, number of N bases            | 
121  |                +-------------------------+-----------------------------------+ 
122  |                | query_end               | Q end, end coordinate of the last | 
123  |                +-------------------------+-----------------------------------+ 
124  |                |                         | query fragment                    | 
125  |                | query_gap_num           | Q gap bases, number of bases      | 
126  |                |                         | inserted in query                 | 
127  |                +-------------------------+-----------------------------------+ 
128  |                | query_gapopen_num       | Q gap count, number of query gap  | 
129  |                |                         | inserts                           | 
130  |                +-------------------------+-----------------------------------+ 
131  |                | query_span_all          | blockSizes, sizes of each         | 
132  |                |                         | fragment                          | 
133  |                +-------------------------+-----------------------------------+ 
134  |                | query_start             | Q start, start coordinate of the  | 
135  |                |                         | first query block                 | 
136  |                +-------------------------+-----------------------------------+ 
137  |                | query_start_all         | qStarts, start coordinate of each | 
138  |                |                         | query fragment                    | 
139  |                +-------------------------+-----------------------------------+ 
140  |                | len [1]                 | block count, the number of blocks | 
141  |                |                         | in the alignment                  | 
142  +----------------+-------------------------+-----------------------------------+ 
143  | HSPFragment    | hit                     | hit sequence, if present          | 
144  |                +-------------------------+-----------------------------------+ 
145  |                | hit_strand              | strand, hit sequence strand       | 
146  |                +-------------------------+-----------------------------------+ 
147  |                | query                   | query sequence, if present        | 
148  |                +-------------------------+-----------------------------------+ 
149  |                | query_strand            | strand, query sequence strand     | 
150  +----------------+-------------------------+-----------------------------------+ 
151   
152  In addition to the column mappings above, BlatIO also provides the following 
153  object attributes: 
154   
155  +----------------+-------------------------+-----------------------------------+ 
156  | Object         | Attribute               | Value                             | 
157  +================+=========================+===================================+ 
158  | HSP            | gapopen_num             | Q gap count + T gap count, total  | 
159  |                |                         |  number of gap openings           | 
160  |                +-------------------------+-----------------------------------+ 
161  |                | ident_num               | matches + repmatches, total       | 
162  |                |                         | number of identical residues      | 
163  |                +-------------------------+-----------------------------------+ 
164  |                | ident_pct               | percent identity, calculated      | 
165  |                |                         | using UCSC's formula              | 
166  |                +-------------------------+-----------------------------------+ 
167  |                | query_is_protein        | boolean, whether the query        | 
168  |                |                         | sequence is a protein             | 
169  |                +-------------------------+-----------------------------------+ 
170  |                | score                   | HSP score, calculated using       | 
171  |                |                         | UCSC's formula                    | 
172  +----------------+-------------------------+-----------------------------------+ 
173   
174  Finally, the default HSP and HSPFragment properties are also provided. See the 
175  HSP and HSPFragment documentation for more details on these properties. 
176   
177   
178  .. [1] You can obtain the number of blocks / fragments in the HSP by invoking 
179     ``len`` on the HSP 
180   
181  """ 
182  import re 
183  from math import log 
184   
185  from Bio._py3k import _as_bytes, _bytes_to_string 
186  from Bio._py3k import zip 
187   
188  from Bio.Alphabet import generic_dna 
189  from Bio.SearchIO._index import SearchIndexer 
190  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
191   
192   
193  __all__ = ['BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter'] 
194   
195  __docformat__ = "restructuredtext en" 
196   
197   
198  # precompile regex patterns 
199  _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+' 
200  _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK) 
201  _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK)) 
202   
203   
204 -def _list_from_csv(csv_string, caster=None):
205 """Transforms the given comma-separated string into a list. 206 207 :param csv_string: comma-separated input string 208 :type csv_string: string 209 :param caster: function used to cast each item in the input string 210 to its intended type 211 :type caster: callable, accepts string, returns object 212 213 """ 214 if caster is None: 215 return [x for x in csv_string.split(',') if x] 216 else: 217 return [caster(x) for x in csv_string.split(',') if x]
218 219
220 -def _reorient_starts(starts, blksizes, seqlen, strand):
221 """Reorients block starts into the opposite strand's coordinates. 222 223 :param starts: start coordinates 224 :type starts: list [int] 225 :param blksizes: block sizes 226 :type blksizes: list [int] 227 :param seqlen: sequence length 228 :type seqlen: int 229 :param strand: sequence strand 230 :type strand: int, choice of -1, 0, or 1 231 232 """ 233 assert len(starts) == len(blksizes), \ 234 "Unequal start coordinates and block sizes list (%r vs %r)" \ 235 % (len(starts), len(blksizes)) 236 # see: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 237 # no need to reorient if it's already the positive strand 238 if strand >= 0: 239 return starts 240 else: 241 # the plus-oriented coordinate is calculated by this: 242 # plus_coord = length - minus_coord - block_size 243 return [seqlen - start - blksize for 244 start, blksize in zip(starts, blksizes)]
245 246
247 -def _is_protein(psl):
248 # check if query is protein or not 249 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 250 if len(psl['strand']) == 2: 251 if psl['strand'][1] == '+': 252 return psl['tend'] == psl['tstarts'][-1] + \ 253 3 * psl['blocksizes'][-1] 254 elif psl['strand'][1] == '-': 255 return psl['tstart'] == psl['tsize'] - \ 256 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1]) 257 258 return False
259 260
261 -def _calc_millibad(psl, is_protein):
262 # calculates millibad 263 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 264 size_mul = 3 if is_protein else 1 265 millibad = 0 266 267 qali_size = size_mul * (psl['qend'] - psl['qstart']) 268 tali_size = psl['tend'] - psl['tstart'] 269 ali_size = min(qali_size, tali_size) 270 if ali_size <= 0: 271 return 0 272 273 size_dif = qali_size - tali_size 274 size_dif = 0 if size_dif < 0 else size_dif 275 276 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches']) 277 if total != 0: 278 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] + 279 round(3 * log(1 + size_dif)))) / total 280 281 return millibad
282 283
284 -def _calc_score(psl, is_protein):
285 # calculates score 286 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 287 size_mul = 3 if is_protein else 1 288 return size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) - \ 289 size_mul * psl['mismatches'] - psl['qnuminsert'] - psl['tnuminsert']
290 291
292 -def _create_hsp(hid, qid, psl):
293 # protein flag 294 is_protein = _is_protein(psl) 295 # strand 296 # if query is protein, strand is 0 297 if is_protein: 298 qstrand = 0 299 else: 300 qstrand = 1 if psl['strand'][0] == '+' else -1 301 # try to get hit strand, if it exists 302 try: 303 hstrand = 1 if psl['strand'][1] == '+' else -1 304 except IndexError: 305 hstrand = 1 # hit strand defaults to plus 306 307 # query block starts 308 qstarts = _reorient_starts(psl['qstarts'], 309 psl['blocksizes'], psl['qsize'], qstrand) 310 # hit block starts 311 if len(psl['strand']) == 2: 312 hstarts = _reorient_starts(psl['tstarts'], 313 psl['blocksizes'], psl['tsize'], hstrand) 314 else: 315 hstarts = psl['tstarts'] 316 # set query and hit coords 317 # this assumes each block has no gaps (which seems to be the case) 318 assert len(qstarts) == len(hstarts) == len(psl['blocksizes']) 319 query_range_all = list(zip(qstarts, [x + y for x, y in 320 zip(qstarts, psl['blocksizes'])])) 321 hit_range_all = list(zip(hstarts, [x + y for x, y in 322 zip(hstarts, psl['blocksizes'])])) 323 # check length of sequences and coordinates, all must match 324 if 'tseqs' in psl and 'qseqs' in psl: 325 assert len(psl['tseqs']) == len(psl['qseqs']) == \ 326 len(query_range_all) == len(hit_range_all) 327 else: 328 assert len(query_range_all) == len(hit_range_all) 329 330 frags = [] 331 # iterating over query_range_all, but hit_range_all works just as well 332 for idx, qcoords in enumerate(query_range_all): 333 hseqlist = psl.get('tseqs') 334 hseq = '' if not hseqlist else hseqlist[idx] 335 qseqlist = psl.get('qseqs') 336 qseq = '' if not qseqlist else qseqlist[idx] 337 frag = HSPFragment(hid, qid, hit=hseq, query=qseq) 338 # set alphabet 339 frag.alphabet = generic_dna 340 # set coordinates 341 frag.query_start = qcoords[0] 342 frag.query_end = qcoords[1] 343 frag.hit_start = hit_range_all[idx][0] 344 frag.hit_end = hit_range_all[idx][1] 345 # and strands 346 frag.query_strand = qstrand 347 frag.hit_strand = hstrand 348 frags.append(frag) 349 350 # create hsp object 351 hsp = HSP(frags) 352 # check if start and end are set correctly 353 assert hsp.query_start == psl['qstart'] 354 assert hsp.query_end == psl['qend'] 355 assert hsp.hit_start == psl['tstart'] 356 assert hsp.hit_end == psl['tend'] 357 # and check block spans as well 358 assert hsp.query_span_all == hsp.hit_span_all == psl['blocksizes'] 359 # set its attributes 360 hsp.match_num = psl['matches'] 361 hsp.mismatch_num = psl['mismatches'] 362 hsp.match_rep_num = psl['repmatches'] 363 hsp.n_num = psl['ncount'] 364 hsp.query_gapopen_num = psl['qnuminsert'] 365 hsp.query_gap_num = psl['qbaseinsert'] 366 hsp.hit_gapopen_num = psl['tnuminsert'] 367 hsp.hit_gap_num = psl['tbaseinsert'] 368 369 hsp.ident_num = psl['matches'] + psl['repmatches'] 370 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert'] 371 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert'] 372 hsp.query_is_protein = is_protein 373 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1 374 hsp.score = _calc_score(psl, is_protein) 375 # helper flag, for writing 376 hsp._has_hit_strand = len(psl['strand']) == 2 377 378 return hsp
379 380
381 -class BlatPslParser(object):
382 383 """Parser for the BLAT PSL format.""" 384
385 - def __init__(self, handle, pslx=False):
386 self.handle = handle 387 self.line = self.handle.readline() 388 self.pslx = pslx
389
390 - def __iter__(self):
391 # break out if it's an empty file 392 if not self.line: 393 raise StopIteration 394 395 # read through header 396 # this assumes that the result row match the regex 397 while not re.search(_RE_ROW_CHECK, self.line.strip()): 398 self.line = self.handle.readline() 399 if not self.line: 400 raise StopIteration 401 402 # parse into query results 403 for qresult in self._parse_qresult(): 404 qresult.program = 'blat' 405 yield qresult
406
407 - def _parse_row(self):
408 """Returns a dictionary of parsed column values.""" 409 assert self.line 410 cols = [x for x in self.line.strip().split('\t') if x] 411 self._validate_cols(cols) 412 413 psl = {} 414 psl['qname'] = cols[9] # qName 415 psl['qsize'] = int(cols[10]) # qSize 416 psl['tname'] = cols[13] # tName 417 psl['tsize'] = int(cols[14]) # tSize 418 psl['matches'] = int(cols[0]) # matches 419 psl['mismatches'] = int(cols[1]) # misMatches 420 psl['repmatches'] = int(cols[2]) # repMatches 421 psl['ncount'] = int(cols[3]) # nCount 422 psl['qnuminsert'] = int(cols[4]) # qNumInsert 423 psl['qbaseinsert'] = int(cols[5]) # qBaseInsert 424 psl['tnuminsert'] = int(cols[6]) # tNumInsert 425 psl['tbaseinsert'] = int(cols[7]) # tBaseInsert 426 psl['strand'] = cols[8] # strand 427 psl['qstart'] = int(cols[11]) # qStart 428 psl['qend'] = int(cols[12]) # qEnd 429 psl['tstart'] = int(cols[15]) # tStart 430 psl['tend'] = int(cols[16]) # tEnd 431 psl['blockcount'] = int(cols[17]) # blockCount 432 psl['blocksizes'] = _list_from_csv(cols[18], int) # blockSizes 433 psl['qstarts'] = _list_from_csv(cols[19], int) # qStarts 434 psl['tstarts'] = _list_from_csv(cols[20], int) # tStarts 435 if self.pslx: 436 psl['qseqs'] = _list_from_csv(cols[21]) # query sequence 437 psl['tseqs'] = _list_from_csv(cols[22]) # hit sequence 438 439 return psl
440
441 - def _validate_cols(self, cols):
442 if not self.pslx: 443 assert len(cols) == 21, "Invalid PSL line: %r. " \ 444 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols)) 445 else: 446 assert len(cols) == 23, "Invalid PSLX line: %r. " \ 447 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
448
449 - def _parse_qresult(self):
450 """Generator function that returns QueryResult objects.""" 451 # state values, determines what to do for each line 452 state_EOF = 0 453 state_QRES_NEW = 1 454 state_QRES_SAME = 3 455 state_HIT_NEW = 2 456 state_HIT_SAME = 4 457 # initial dummy values 458 qres_state = None 459 file_state = None 460 prev_qid, prev_hid = None, None 461 cur, prev = None, None 462 hit_list, hsp_list = [], [] 463 464 while True: 465 # store previous line's parsed values for all lines after the first 466 if cur is not None: 467 prev = cur 468 prev_qid = cur_qid 469 prev_hid = cur_hid 470 # only parse the result row if it's not EOF 471 if self.line: 472 cur = self._parse_row() 473 cur_qid = cur['qname'] 474 cur_hid = cur['tname'] 475 else: 476 file_state = state_EOF 477 # mock values, since we have nothing to parse 478 cur_qid, cur_hid = None, None 479 480 # get the state of hit and qresult 481 if prev_qid != cur_qid: 482 qres_state = state_QRES_NEW 483 else: 484 qres_state = state_QRES_SAME 485 # new hits are hits with different ids or hits in a new qresult 486 if prev_hid != cur_hid or qres_state == state_QRES_NEW: 487 hit_state = state_HIT_NEW 488 else: 489 hit_state = state_HIT_SAME 490 491 if prev is not None: 492 # create fragment and HSP and set their attributes 493 hsp = _create_hsp(prev_hid, prev_qid, prev) 494 hsp_list.append(hsp) 495 496 if hit_state == state_HIT_NEW: 497 # create Hit and set its attributes 498 hit = Hit(hsp_list) 499 hit.seq_len = prev['tsize'] 500 hit_list.append(hit) 501 hsp_list = [] 502 503 # create qresult and yield if we're at a new qresult or at EOF 504 if qres_state == state_QRES_NEW or file_state == state_EOF: 505 qresult = QueryResult(id=prev_qid) 506 for hit in hit_list: 507 qresult.absorb(hit) 508 qresult.seq_len = prev['qsize'] 509 yield qresult 510 # if we're at EOF, break 511 if file_state == state_EOF: 512 break 513 hit_list = [] 514 515 self.line = self.handle.readline()
516 517
518 -class BlatPslIndexer(SearchIndexer):
519 520 """Indexer class for BLAT PSL output.""" 521 522 _parser = BlatPslParser 523
524 - def __init__(self, filename, pslx=False):
525 SearchIndexer.__init__(self, filename, pslx=pslx)
526
527 - def __iter__(self):
528 """Iterates over the file handle; yields key, start offset, and length.""" 529 handle = self._handle 530 handle.seek(0) 531 # denotes column location for query identifier 532 query_id_idx = 9 533 qresult_key = None 534 tab_char = _as_bytes('\t') 535 536 start_offset = handle.tell() 537 line = handle.readline() 538 # read through header 539 # this assumes that the result row match the regex 540 while not re.search(_RE_ROW_CHECK_IDX, line.strip()): 541 start_offset = handle.tell() 542 line = handle.readline() 543 if not line: 544 raise StopIteration 545 546 # and index the qresults 547 while True: 548 end_offset = handle.tell() 549 550 cols = [x for x in line.strip().split(tab_char) if x] 551 if qresult_key is None: 552 qresult_key = cols[query_id_idx] 553 else: 554 curr_key = cols[query_id_idx] 555 556 if curr_key != qresult_key: 557 yield _bytes_to_string(qresult_key), start_offset, \ 558 end_offset - start_offset 559 qresult_key = curr_key 560 start_offset = end_offset - len(line) 561 562 line = handle.readline() 563 if not line: 564 yield _bytes_to_string(qresult_key), start_offset, \ 565 end_offset - start_offset 566 break
567
568 - def get_raw(self, offset):
569 """Returns the raw string of a QueryResult object from the given offset.""" 570 handle = self._handle 571 handle.seek(offset) 572 query_id_idx = 9 573 qresult_key = None 574 qresult_raw = _as_bytes('') 575 tab_char = _as_bytes('\t') 576 577 while True: 578 line = handle.readline() 579 if not line: 580 break 581 cols = [x for x in line.strip().split(tab_char) if x] 582 if qresult_key is None: 583 qresult_key = cols[query_id_idx] 584 else: 585 curr_key = cols[query_id_idx] 586 if curr_key != qresult_key: 587 break 588 qresult_raw += line 589 590 return qresult_raw
591 592
593 -class BlatPslWriter(object):
594 595 """Writer for the blat-psl format.""" 596
597 - def __init__(self, handle, header=False, pslx=False):
598 self.handle = handle 599 # flag for writing header or not 600 self.header = header 601 self.pslx = pslx
602
603 - def write_file(self, qresults):
604 handle = self.handle 605 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0 606 607 if self.header: 608 handle.write(self._build_header()) 609 610 for qresult in qresults: 611 if qresult: 612 handle.write(self._build_row(qresult)) 613 qresult_counter += 1 614 hit_counter += len(qresult) 615 hsp_counter += sum(len(hit) for hit in qresult) 616 frag_counter += sum(len(hit.fragments) for hit in qresult) 617 618 return qresult_counter, hit_counter, hsp_counter, frag_counter
619
620 - def _build_header(self):
621 # for now, always use the psLayout version 3 622 header = 'psLayout version 3\n' 623 624 # adapted from BLAT's source: lib/psl.c#L496 625 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT " 626 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT " 627 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \ 628 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname " 629 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount" 630 "\n%s\n" % ('-' * 159) 631 632 return header
633
634 - def _build_row(self, qresult):
635 """Returns a string or one row or more of the QueryResult object.""" 636 # For now, our writer writes the row according to the order in 637 # the QueryResult and Hit objects. 638 # This is different from BLAT's native output, where the rows are 639 # grouped by strand. 640 # Should we tweak the behavior to better mimic the native output? 641 qresult_lines = [] 642 643 for hit in qresult: 644 for hsp in hit.hsps: 645 646 line = [] 647 line.append(hsp.match_num) 648 line.append(hsp.mismatch_num) 649 line.append(hsp.match_rep_num) 650 line.append(hsp.n_num) 651 line.append(hsp.query_gapopen_num) 652 line.append(hsp.query_gap_num) 653 line.append(hsp.hit_gapopen_num) 654 line.append(hsp.hit_gap_num) 655 656 # check spans 657 assert hsp.query_span_all == hsp.hit_span_all 658 block_sizes = hsp.query_span_all 659 660 # set strand and starts 661 if hsp[0].query_strand >= 0: # since it may be a protein seq 662 strand = '+' 663 else: 664 strand = '-' 665 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all], 666 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand) 667 668 if hsp[0].hit_strand == 1: 669 hstrand = 1 670 # only write hit strand if it was present in the source file 671 if hsp._has_hit_strand: 672 strand += '+' 673 else: 674 hstrand = -1 675 strand += '-' 676 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all], 677 hsp.hit_span_all, hit.seq_len, hstrand) 678 679 line.append(strand) 680 line.append(qresult.id) 681 line.append(qresult.seq_len) 682 line.append(hsp.query_start) 683 line.append(hsp.query_end) 684 line.append(hit.id) 685 line.append(hit.seq_len) 686 line.append(hsp.hit_start) 687 line.append(hsp.hit_end) 688 line.append(len(hsp)) 689 line.append(','.join((str(x) for x in block_sizes)) + ',') 690 line.append(','.join((str(x) for x in qstarts)) + ',') 691 line.append(','.join((str(x) for x in hstarts)) + ',') 692 693 if self.pslx: 694 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',') 695 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',') 696 697 qresult_lines.append('\t'.join((str(x) for x in line))) 698 699 return '\n'.join(qresult_lines) + '\n'
700 701 702 # if not used as a module, run the doctest 703 if __name__ == "__main__": 704 from Bio._utils import run_doctest 705 run_doctest() 706