Package Bio :: Module pairwise2
[hide private]
[frames] | no frames]

Source Code for Module Bio.pairwise2

  1  # Copyright 2002 by Jeffrey Chang. 
  2  # Copyright 2016 by Markus Piotrowski. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """This package implements pairwise sequence alignment using a dynamic 
  9  programming algorithm. 
 10   
 11  This provides functions to get global and local alignments between two 
 12  sequences. A global alignment finds the best concordance between all 
 13  characters in two sequences. A local alignment finds just the 
 14  subsequences that align the best. 
 15   
 16  When doing alignments, you can specify the match score and gap 
 17  penalties.  The match score indicates the compatibility between an 
 18  alignment of two characters in the sequences. Highly compatible 
 19  characters should be given positive scores, and incompatible ones 
 20  should be given negative scores or 0.  The gap penalties should be 
 21  negative. 
 22   
 23  The names of the alignment functions in this module follow the 
 24  convention 
 25  <alignment type>XX 
 26  where <alignment type> is either "global" or "local" and XX is a 2 
 27  character code indicating the parameters it takes.  The first 
 28  character indicates the parameters for matches (and mismatches), and 
 29  the second indicates the parameters for gap penalties. 
 30   
 31  The match parameters are:: 
 32   
 33      CODE  DESCRIPTION 
 34      x     No parameters. Identical characters have score of 1, otherwise 0. 
 35      m     A match score is the score of identical chars, otherwise mismatch 
 36            score. 
 37      d     A dictionary returns the score of any pair of characters. 
 38      c     A callback function returns scores. 
 39   
 40  The gap penalty parameters are:: 
 41   
 42      CODE  DESCRIPTION 
 43      x     No gap penalties. 
 44      s     Same open and extend gap penalties for both sequences. 
 45      d     The sequences have different open and extend gap penalties. 
 46      c     A callback function returns the gap penalties. 
 47   
 48  All the different alignment functions are contained in an object 
 49  ``align``. For example: 
 50   
 51      >>> from Bio import pairwise2 
 52      >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") 
 53   
 54  will return a list of the alignments between the two strings. For a nice 
 55  printout, use the ``format_alignment`` method of the module: 
 56   
 57      >>> from Bio.pairwise2 import format_alignment 
 58      >>> print(format_alignment(*alignments[0])) 
 59      ACCGT 
 60      ||||| 
 61      A-CG- 
 62        Score=3 
 63      <BLANKLINE> 
 64   
 65  All alignment functions have the following arguments: 
 66   
 67  - Two sequences: strings, Biopython sequence objects or lists. 
 68    Lists are useful for suppling sequences which contain residues that are 
 69    encoded by more than one letter. 
 70   
 71  - ``penalize_extend_when_opening``: boolean (default: False). 
 72    Whether to count an extension penalty when opening a gap. If false, a gap of 
 73    1 is only penalized an "open" penalty, otherwise it is penalized 
 74    "open+extend". 
 75   
 76  - ``penalize_end_gaps``: boolean. 
 77    Whether to count the gaps at the ends of an alignment. By default, they are 
 78    counted for global alignments but not for local ones. Setting 
 79    ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the 
 80    two sequences separately whether gaps at the end of the alignment should be 
 81    counted. 
 82   
 83  - ``gap_char``: string (default: ``'-'``). 
 84    Which character to use as a gap character in the alignment returned. If your 
 85    input sequences are lists, you must change this to ``['-']``. 
 86   
 87  - ``force_generic``: boolean (default: False). 
 88    Always use the generic, non-cached, dynamic programming function (slow!). 
 89    For debugging. 
 90   
 91  - ``score_only``: boolean (default: False). 
 92    Only get the best score, don't recover any alignments. The return value of 
 93    the function is the score. Faster and uses less memory. 
 94   
 95  - ``one_alignment_only``: boolean (default: False). 
 96    Only recover one alignment. 
 97   
 98  The other parameters of the alignment function depend on the function called. 
 99  Some examples: 
100   
101  - Find the best global alignment between the two sequences. Identical 
102    characters are given 1 point. No points are deducted for mismatches or gaps. 
103   
104      >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): 
105      ...     print(format_alignment(*a)) 
106      ACCGT 
107      ||||| 
108      A-CG- 
109        Score=3 
110      <BLANKLINE> 
111      ACCGT 
112      ||||| 
113      AC-G- 
114        Score=3 
115      <BLANKLINE> 
116   
117  - Same thing as before, but with a local alignment. 
118   
119      >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): 
120      ...     print(format_alignment(*a)) 
121      ACCGT 
122      |||| 
123      A-CG- 
124        Score=3 
125      <BLANKLINE> 
126      ACCGT 
127      |||| 
128      AC-G- 
129        Score=3 
130      <BLANKLINE> 
131   
132  - Do a global alignment. Identical characters are given 2 points, 1 point is 
133    deducted for each non-identical character. Don't penalize gaps. 
134   
135      >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): 
136      ...     print(format_alignment(*a)) 
137      ACCGT 
138      ||||| 
139      A-CG- 
140        Score=6 
141      <BLANKLINE> 
142      ACCGT 
143      ||||| 
144      AC-G- 
145        Score=6 
146      <BLANKLINE> 
147   
148  - Same as above, except now 0.5 points are deducted when opening a gap, and 
149    0.1 points are deducted when extending it. 
150   
151      >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): 
152      ...     print(format_alignment(*a)) 
153      ACCGT 
154      ||||| 
155      A-CG- 
156        Score=5 
157      <BLANKLINE> 
158      ACCGT 
159      ||||| 
160      AC-G- 
161        Score=5 
162      <BLANKLINE> 
163   
164  - Depending on the penalties, a gap in one sequence may be followed by a gap in 
165    the other sequence.If you don't like this behaviour, increase the gap-open 
166    penalty: 
167   
168      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): 
169      ...     print(format_alignment(*a)) 
170      A- 
171      || 
172      -T 
173        Score=-2 
174      <BLANKLINE> 
175      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): 
176      ...     print(format_alignment(*a)) 
177      A 
178      | 
179      T 
180        Score=-4 
181      <BLANKLINE> 
182   
183  - The alignment function can also use known matrices already included in 
184    Biopython (``MatrixInfo`` from ``Bio.SubsMat``): 
185   
186      >>> from Bio.SubsMat import MatrixInfo as matlist 
187      >>> matrix = matlist.blosum62 
188      >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): 
189      ...     print(format_alignment(*a)) 
190      KEVLA 
191      ||||| 
192      -EVL- 
193        Score=13 
194      <BLANKLINE> 
195   
196  - With the parameter ``c`` you can define your own match- and gap functions. 
197    E.g. to define an affine logarithmic gap function and using it: 
198   
199      >>> from math import log 
200      >>> def gap_function(x, y):  # x is gap position in seq, y is gap length 
201      ...     if y == 0:  # No gap 
202      ...         return 0 
203      ...     elif y == 1:  # Gap open penalty 
204      ...         return -2 
205      ...     return - (2 + y/4.0 + log(y)/2.0) 
206      ... 
207      >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, 
208      ...                                      gap_function, gap_function) 
209   
210    You can define different gap functions for each sequence. 
211    Self-defined match functions must take the two residues to be compared and 
212    return a score. 
213   
214  To see a description of the parameters for a function, please look at 
215  the docstring for the function via the help function, e.g. 
216  type ``help(pairwise2.align.localds``) at the Python prompt. 
217   
218  """ 
219  from __future__ import print_function 
220   
221  import warnings 
222   
223  from Bio import BiopythonWarning 
224   
225   
226  MAX_ALIGNMENTS = 1000   # maximum alignments recovered in traceback 
227   
228   
229 -class align(object):
230 """This class provides functions that do alignments.""" 231
232 - class alignment_function(object):
233 """This class is callable impersonates an alignment function. 234 The constructor takes the name of the function. This class 235 will decode the name of the function to figure out how to 236 interpret the parameters. 237 238 """ 239 # match code -> tuple of (parameters, docstring) 240 match2args = { 241 'x': ([], ''), 242 'm': (['match', 'mismatch'], 243 "match is the score to given to identical characters. " 244 "mismatch is the score given to non-identical ones."), 245 'd': (['match_dict'], 246 "match_dict is a dictionary where the keys are tuples " 247 "of pairs of characters and the values are the scores, " 248 "e.g. ('A', 'C') : 2.5."), 249 'c': (['match_fn'], 250 "match_fn is a callback function that takes two " 251 "characters and returns the score between them."), 252 } 253 # penalty code -> tuple of (parameters, docstring) 254 penalty2args = { 255 'x': ([], ''), 256 's': (['open', 'extend'], 257 "open and extend are the gap penalties when a gap is " 258 "opened and extended. They should be negative."), 259 'd': (['openA', 'extendA', 'openB', 'extendB'], 260 "openA and extendA are the gap penalties for sequenceA, " 261 "and openB and extendB for sequeneB. The penalties " 262 "should be negative."), 263 'c': (['gap_A_fn', 'gap_B_fn'], 264 "gap_A_fn and gap_B_fn are callback functions that takes " 265 "(1) the index where the gap is opened, and (2) the length " 266 "of the gap. They should return a gap penalty."), 267 } 268
269 - def __init__(self, name):
270 # Check to make sure the name of the function is 271 # reasonable. 272 if name.startswith("global"): 273 if len(name) != 8: 274 raise AttributeError("function should be globalXX") 275 elif name.startswith("local"): 276 if len(name) != 7: 277 raise AttributeError("function should be localXX") 278 else: 279 raise AttributeError(name) 280 align_type, match_type, penalty_type = \ 281 name[:-2], name[-2], name[-1] 282 try: 283 match_args, match_doc = self.match2args[match_type] 284 except KeyError: 285 raise AttributeError("unknown match type %r" % match_type) 286 try: 287 penalty_args, penalty_doc = self.penalty2args[penalty_type] 288 except KeyError: 289 raise AttributeError("unknown penalty type %r" % penalty_type) 290 291 # Now get the names of the parameters to this function. 292 param_names = ['sequenceA', 'sequenceB'] 293 param_names.extend(match_args) 294 param_names.extend(penalty_args) 295 self.function_name = name 296 self.align_type = align_type 297 self.param_names = param_names 298 299 self.__name__ = self.function_name 300 # Set the doc string. 301 doc = "%s(%s) -> alignments\n" % ( 302 self.__name__, ', '.join(self.param_names)) 303 if match_doc: 304 doc += "\n%s\n" % match_doc 305 if penalty_doc: 306 doc += "\n%s\n" % penalty_doc 307 doc += ("""\ 308 \nalignments is a list of tuples (seqA, seqB, score, begin, end). 309 seqA and seqB are strings showing the alignment between the 310 sequences. score is the score of the alignment. begin and end 311 are indexes into seqA and seqB that indicate the where the 312 alignment occurs. 313 """) 314 self.__doc__ = doc
315
316 - def decode(self, *args, **keywds):
317 # Decode the arguments for the _align function. keywds 318 # will get passed to it, so translate the arguments to 319 # this function into forms appropriate for _align. 320 keywds = keywds.copy() 321 if len(args) != len(self.param_names): 322 raise TypeError("%s takes exactly %d argument (%d given)" 323 % (self.function_name, len(self.param_names), 324 len(args))) 325 i = 0 326 while i < len(self.param_names): 327 if self.param_names[i] in [ 328 'sequenceA', 'sequenceB', 329 'gap_A_fn', 'gap_B_fn', 'match_fn']: 330 keywds[self.param_names[i]] = args[i] 331 i += 1 332 elif self.param_names[i] == 'match': 333 assert self.param_names[i + 1] == 'mismatch' 334 match, mismatch = args[i], args[i + 1] 335 keywds['match_fn'] = identity_match(match, mismatch) 336 i += 2 337 elif self.param_names[i] == 'match_dict': 338 keywds['match_fn'] = dictionary_match(args[i]) 339 i += 1 340 elif self.param_names[i] == 'open': 341 assert self.param_names[i + 1] == 'extend' 342 open, extend = args[i], args[i + 1] 343 pe = keywds.get('penalize_extend_when_opening', 0) 344 keywds['gap_A_fn'] = affine_penalty(open, extend, pe) 345 keywds['gap_B_fn'] = affine_penalty(open, extend, pe) 346 i += 2 347 elif self.param_names[i] == 'openA': 348 assert self.param_names[i + 3] == 'extendB' 349 openA, extendA, openB, extendB = args[i:i + 4] 350 pe = keywds.get('penalize_extend_when_opening', 0) 351 keywds['gap_A_fn'] = affine_penalty(openA, extendA, pe) 352 keywds['gap_B_fn'] = affine_penalty(openB, extendB, pe) 353 i += 4 354 else: 355 raise ValueError("unknown parameter %r" 356 % self.param_names[i]) 357 358 # Here are the default parameters for _align. Assign 359 # these to keywds, unless already specified. 360 pe = keywds.get('penalize_extend_when_opening', 0) 361 default_params = [ 362 ('match_fn', identity_match(1, 0)), 363 ('gap_A_fn', affine_penalty(0, 0, pe)), 364 ('gap_B_fn', affine_penalty(0, 0, pe)), 365 ('penalize_extend_when_opening', 0), 366 ('penalize_end_gaps', self.align_type == 'global'), 367 ('align_globally', self.align_type == 'global'), 368 ('gap_char', '-'), 369 ('force_generic', 0), 370 ('score_only', 0), 371 ('one_alignment_only', 0), 372 ] 373 for name, default in default_params: 374 keywds[name] = keywds.get(name, default) 375 value = keywds['penalize_end_gaps'] 376 try: 377 n = len(value) 378 except TypeError: 379 keywds['penalize_end_gaps'] = tuple([value] * 2) 380 else: 381 assert n == 2 382 return keywds
383
384 - def __call__(self, *args, **keywds):
385 keywds = self.decode(*args, **keywds) 386 return _align(**keywds)
387
388 - def __getattr__(self, attr):
389 return self.alignment_function(attr)
390 align = align() 391 392
393 -def _align(sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 394 penalize_extend_when_opening, penalize_end_gaps, 395 align_globally, gap_char, force_generic, score_only, 396 one_alignment_only):
397 """Return a list of alignments between two sequences or its score""" 398 if not sequenceA or not sequenceB: 399 return [] 400 try: 401 sequenceA + gap_char 402 sequenceB + gap_char 403 except TypeError: 404 raise TypeError('both sequences must be of the same type, either ' + 405 'string/sequence object or list. Gap character must ' + 406 'fit the sequence type (string or list)') 407 408 if not isinstance(sequenceA, list): 409 sequenceA = str(sequenceA) 410 if not isinstance(sequenceB, list): 411 sequenceB = str(sequenceB) 412 413 if (not force_generic) and isinstance(gap_A_fn, affine_penalty) \ 414 and isinstance(gap_B_fn, affine_penalty): 415 open_A, extend_A = gap_A_fn.open, gap_A_fn.extend 416 open_B, extend_B = gap_B_fn.open, gap_B_fn.extend 417 x = _make_score_matrix_fast( 418 sequenceA, sequenceB, match_fn, open_A, extend_A, open_B, 419 extend_B, penalize_extend_when_opening, penalize_end_gaps, 420 align_globally, score_only) 421 else: 422 x = _make_score_matrix_generic( 423 sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 424 penalize_end_gaps, align_globally, score_only) 425 score_matrix, trace_matrix = x 426 427 # print("SCORE %s" % print_matrix(score_matrix)) 428 # print("TRACEBACK %s" % print_matrix(trace_matrix)) 429 430 # Look for the proper starting point. Get a list of all possible 431 # starting points. 432 starts = _find_start(score_matrix, align_globally) 433 # Find the highest score. 434 best_score = max([x[0] for x in starts]) 435 436 # If they only want the score, then return it. 437 if score_only: 438 return best_score 439 440 tolerance = 0 # XXX do anything with this? 441 # Now find all the positions within some tolerance of the best 442 # score. 443 starts = [(score, pos) for score, pos in starts 444 if rint(abs(score - best_score)) <= rint(tolerance)] 445 446 # Recover the alignments and return them. 447 return _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 448 trace_matrix, align_globally, gap_char, 449 one_alignment_only, gap_A_fn, gap_B_fn)
450 451
452 -def _make_score_matrix_generic(sequenceA, sequenceB, match_fn, gap_A_fn, 453 gap_B_fn, penalize_end_gaps, align_globally, 454 score_only):
455 """Generate a score and traceback matrix according to Needleman-Wunsch 456 457 This implementation allows the usage of general gap functions and is rather 458 slow. It is automatically called if you define your own gap functions. You 459 can force the usage of this method with ``force_generic=True``. 460 """ 461 # Create the score and traceback matrices. These should be in the 462 # shape: 463 # sequenceA (down) x sequenceB (across) 464 lenA, lenB = len(sequenceA), len(sequenceB) 465 score_matrix, trace_matrix = [], [] 466 for i in range(lenA + 1): 467 score_matrix.append([None] * (lenB + 1)) 468 if not score_only: 469 trace_matrix.append([None] * (lenB + 1)) 470 471 # Initialize first row and column with gap scores. This is like opening up 472 # i gaps at the beginning of sequence A or B. 473 for i in range(lenA + 1): 474 if penalize_end_gaps[1]: # [1]:gap in sequence B 475 score = gap_B_fn(0, i) 476 else: 477 score = 0 478 score_matrix[i][0] = score 479 for i in range(lenB + 1): 480 if penalize_end_gaps[0]: # [0]:gap in sequence A 481 score = gap_A_fn(0, i) 482 else: 483 score = 0 484 score_matrix[0][i] = score 485 486 # Fill in the score matrix. Each position in the matrix 487 # represents an alignment between a character from sequence A to 488 # one in sequence B. As I iterate through the matrix, find the 489 # alignment by choose the best of: 490 # 1) extending a previous alignment without gaps 491 # 2) adding a gap in sequenceA 492 # 3) adding a gap in sequenceB 493 for row in range(1, lenA + 1): 494 for col in range(1, lenB + 1): 495 # First, calculate the score that would occur by extending 496 # the alignment without gaps. 497 nogap_score = score_matrix[row - 1][col - 1] + \ 498 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 499 500 # Try to find a better score by opening gaps in sequenceA. 501 # Do this by checking alignments from each column in the row. 502 # Each column represents a different character to align from, 503 # and thus a different length gap. 504 # Although the gap function does not distinguish between opening 505 # and extending a gap, we distinguish them for the backtrace. 506 if not penalize_end_gaps[0] and row == lenA: 507 row_open = score_matrix[row][col - 1] 508 row_extend = max([score_matrix[row][x] for x in range(col)]) 509 else: 510 row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) 511 row_extend = max([score_matrix[row][x] + gap_A_fn(row, col - x) 512 for x in range(col)]) 513 514 # Try to find a better score by opening gaps in sequenceB. 515 if not penalize_end_gaps[1] and col == lenB: 516 col_open = score_matrix[row - 1][col] 517 col_extend = max([score_matrix[x][col] for x in range(row)]) 518 else: 519 col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) 520 col_extend = max([score_matrix[x][col] + gap_B_fn(col, row - x) 521 for x in range(row)]) 522 523 best_score = max(nogap_score, row_open, row_extend, col_open, 524 col_extend) 525 if not align_globally and best_score < 0: 526 score_matrix[row][col] = 0 527 else: 528 score_matrix[row][col] = best_score 529 530 # The backtrace is encoded binary. See _make_score_matrix_fast 531 # for details. 532 if not score_only: 533 trace_score = 0 534 if rint(nogap_score) == rint(best_score): 535 trace_score += 2 536 if rint(row_open) == rint(best_score): 537 trace_score += 1 538 if rint(row_extend) == rint(best_score): 539 trace_score += 8 540 if rint(col_open) == rint(best_score): 541 trace_score += 4 542 if rint(col_extend) == rint(best_score): 543 trace_score += 16 544 trace_matrix[row][col] = trace_score 545 546 return score_matrix, trace_matrix
547 548
549 -def _make_score_matrix_fast(sequenceA, sequenceB, match_fn, open_A, extend_A, 550 open_B, extend_B, penalize_extend_when_opening, 551 penalize_end_gaps, align_globally, score_only):
552 """Generate a score and traceback matrix according to Gotoh""" 553 # This is an implementation of the Needleman-Wunsch dynamic programming 554 # algorithm as modified by Gotoh, implementing affine gap penalties. 555 # In short, we have three matrices, holding scores for alignments ending 556 # in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in 557 # sequence B, respectively. However, we can combine them in one matrix, 558 # which holds the best scores, and store only those values from the 559 # other matrices that are actually used for the next step of calculation. 560 # The traceback matrix holds the positions for backtracing the alignment. 561 562 first_A_gap = calc_affine_penalty(1, open_A, extend_A, 563 penalize_extend_when_opening) 564 first_B_gap = calc_affine_penalty(1, open_B, extend_B, 565 penalize_extend_when_opening) 566 567 # Create the score and traceback matrices. These should be in the 568 # shape: 569 # sequenceA (down) x sequenceB (across) 570 lenA, lenB = len(sequenceA), len(sequenceB) 571 score_matrix, trace_matrix = [], [] 572 for i in range(lenA + 1): 573 score_matrix.append([None] * (lenB + 1)) 574 if not score_only: 575 trace_matrix.append([None] * (lenB + 1)) 576 577 # Initialize first row and column with gap scores. This is like opening up 578 # i gaps at the beginning of sequence A or B. 579 for i in range(lenA + 1): 580 if penalize_end_gaps[1]: # [1]:gap in sequence B 581 score = calc_affine_penalty(i, open_B, extend_B, 582 penalize_extend_when_opening) 583 else: 584 score = 0 585 score_matrix[i][0] = score 586 for i in range(lenB + 1): 587 if penalize_end_gaps[0]: # [0]:gap in sequence A 588 score = calc_affine_penalty(i, open_A, extend_A, 589 penalize_extend_when_opening) 590 else: 591 score = 0 592 score_matrix[0][i] = score 593 594 # Now initialize the col 'matrix'. Actually this is only a one dimensional 595 # list, since we only need the col scores from the last row. 596 col_score = [0] # Best score, if actual alignment ends with gap in seqB 597 for i in range(1, lenB + 1): 598 col_score.append(calc_affine_penalty(i, 2 * open_B, extend_B, 599 penalize_extend_when_opening)) 600 601 # The row 'matrix' is calculated on the fly. Here we only need the actual 602 # score. 603 # Now, filling up the score and traceback matrices: 604 for row in range(1, lenA + 1): 605 row_score = calc_affine_penalty(row, 2 * open_A, extend_A, 606 penalize_extend_when_opening) 607 for col in range(1, lenB + 1): 608 # Calculate the score that would occur by extending the 609 # alignment without gaps. 610 nogap_score = score_matrix[row - 1][col - 1] + \ 611 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 612 613 # Check the score that would occur if there were a gap in 614 # sequence A. This could come from opening a new gap or 615 # extending an existing one. 616 # A gap in sequence A can also be opened if it follows a gap in 617 # sequence B: A- 618 # -B 619 if not penalize_end_gaps[0] and row == lenA: 620 row_open = score_matrix[row][col - 1] 621 row_extend = row_score 622 else: 623 row_open = score_matrix[row][col - 1] + first_A_gap 624 row_extend = row_score + extend_A 625 row_score = max(row_open, row_extend) 626 627 # The same for sequence B: 628 if not penalize_end_gaps[1] and col == lenB: 629 col_open = score_matrix[row - 1][col] 630 col_extend = col_score[col] 631 else: 632 col_open = score_matrix[row - 1][col] + first_B_gap 633 col_extend = col_score[col] + extend_B 634 col_score[col] = max(col_open, col_extend) 635 636 best_score = max(nogap_score, col_score[col], row_score) 637 if not align_globally and best_score < 0: 638 score_matrix[row][col] = 0 639 else: 640 score_matrix[row][col] = best_score 641 642 # Now the trace_matrix. The edges of the backtrace are encoded 643 # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and 644 # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and 645 # 16 = extend gap in seqA. This values can be summed up. 646 # Thus, the trace score 7 means that the best score can either 647 # come from opening a gap in seqA (=1), pairing two characters 648 # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). 649 # However, if we only want the score we don't care about the trace. 650 if not score_only: 651 row_score_rint = rint(row_score) 652 col_score_rint = rint(col_score[col]) 653 row_trace_score = 0 654 col_trace_score = 0 655 if rint(row_open) == row_score_rint: 656 row_trace_score += 1 # Open gap in seqA 657 if rint(row_extend) == row_score_rint: 658 row_trace_score += 8 # Extend gap in seqA 659 if rint(col_open) == col_score_rint: 660 col_trace_score += 4 # Open gap in seqB 661 if rint(col_extend) == col_score_rint: 662 col_trace_score += 16 # Extend gap in seqB 663 664 trace_score = 0 665 best_score_rint = rint(best_score) 666 if rint(nogap_score) == best_score_rint: 667 trace_score += 2 # Align seqA with seqB 668 if row_score_rint == best_score_rint: 669 trace_score += row_trace_score 670 if col_score_rint == best_score_rint: 671 trace_score += col_trace_score 672 trace_matrix[row][col] = trace_score 673 674 return score_matrix, trace_matrix
675 676
677 -def _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 678 trace_matrix, align_globally, gap_char, 679 one_alignment_only, gap_A_fn, gap_B_fn):
680 """Do the backtracing and return a list of alignments""" 681 # Recover the alignments by following the traceback matrix. This 682 # is a recursive procedure, but it's implemented here iteratively 683 # with a stack. 684 lenA, lenB = len(sequenceA), len(sequenceB) 685 ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] 686 tracebacks = [] 687 in_process = [] 688 689 for start in starts: 690 score, (row, col) = start 691 begin = 0 692 if align_globally: 693 end = None 694 else: 695 # Local alignments should start with a positive score! 696 if score <= 0: 697 continue 698 # Local alignments should not end with a gap!: 699 trace = trace_matrix[row][col] 700 if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! 701 trace_matrix[row][col] = 2 702 # If not, don't start here! 703 else: 704 continue 705 end = -max(lenA - row, lenB - col) 706 if not end: 707 end = None 708 col_distance = lenB - col 709 row_distance = lenA - row 710 ali_seqA = ((col_distance - row_distance) * gap_char + 711 sequenceA[lenA - 1:row - 1:-1]) 712 ali_seqB = ((row_distance - col_distance) * gap_char + 713 sequenceB[lenB - 1:col - 1:-1]) 714 in_process += [(ali_seqA, ali_seqB, end, row, col, False, 715 trace_matrix[row][col])] 716 while in_process and len(tracebacks) < MAX_ALIGNMENTS: 717 # Although we allow a gap in seqB to be followed by a gap in seqA, 718 # we don't want to allow it the other way round, since this would 719 # give redundant alignments of type: A- vs. -A 720 # -B B- 721 # Thus we need to keep track if a gap in seqA was opened (col_gap) 722 # and stop the backtrace (dead_end) if a gap in seqB follows. 723 dead_end = False 724 ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() 725 726 while (row > 0 or col > 0) and not dead_end: 727 cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) 728 729 # If trace is empty we have reached at least one border of the 730 # matrix or the end of a local aligment. Just add the rest of 731 # the sequence(s) and fill with gaps if neccessary. 732 if not trace: 733 if col and col_gap: 734 dead_end = True 735 else: 736 ali_seqA, ali_seqB = _finish_backtrace( 737 sequenceA, sequenceB, ali_seqA, ali_seqB, 738 row, col, gap_char) 739 break 740 elif trace % 2 == 1: # = row open = open gap in seqA 741 trace -= 1 742 if col_gap: 743 dead_end = True 744 else: 745 col -= 1 746 ali_seqA += gap_char 747 ali_seqB += sequenceB[col] 748 col_gap = False 749 elif trace % 4 == 2: # = match/mismatch of seqA with seqB 750 trace -= 2 751 row -= 1 752 col -= 1 753 ali_seqA += sequenceA[row] 754 ali_seqB += sequenceB[col] 755 col_gap = False 756 elif trace % 8 == 4: # = col open = open gap in seqB 757 trace -= 4 758 row -= 1 759 ali_seqA += sequenceA[row] 760 ali_seqB += gap_char 761 col_gap = True 762 elif trace in (8, 24): # = row extend = extend gap in seqA 763 trace -= 8 764 if col_gap: 765 dead_end = True 766 else: 767 col_gap = False 768 # We need to find the starting point of the extended gap 769 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, 770 ali_seqB, end, row, col, col_gap, 771 gap_char, score_matrix, trace_matrix, 772 in_process, gap_A_fn, col, row, 'col') 773 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 774 elif trace == 16: # = col extend = extend gap in seqB 775 trace -= 16 776 col_gap = True 777 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, 778 end, row, col, col_gap, gap_char, 779 score_matrix, trace_matrix, in_process, 780 gap_B_fn, row, col, 'row') 781 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 782 783 if trace: # There is another path to follow... 784 cache += (trace,) 785 in_process.append(cache) 786 trace = trace_matrix[row][col] 787 if not align_globally and score_matrix[row][col] <= 0: 788 begin = max(row, col) 789 trace = 0 790 if not dead_end: 791 tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, 792 end)) 793 if one_alignment_only: 794 break 795 return _clean_alignments(tracebacks)
796 797
798 -def _find_start(score_matrix, align_globally):
799 """Return a list of starting points (score, (row, col)). 800 801 Indicating every possible place to start the tracebacks. 802 """ 803 nrows, ncols = len(score_matrix), len(score_matrix[0]) 804 # In this implementation of the global algorithm, the start will always be 805 # the bottom right corner of the matrix. 806 if align_globally: 807 starts = [(score_matrix[-1][-1], (nrows - 1, ncols - 1))] 808 else: 809 starts = [] 810 for row in range(nrows): 811 for col in range(ncols): 812 score = score_matrix[row][col] 813 starts.append((score, (row, col))) 814 return starts
815 816
817 -def _clean_alignments(alignments):
818 """Take a list of alignments and return a cleaned version.""" 819 # Remove duplicates, make sure begin and end are set correctly, remove 820 # empty alignments. 821 unique_alignments = [] 822 for align in alignments: 823 if align not in unique_alignments: 824 unique_alignments.append(align) 825 i = 0 826 while i < len(unique_alignments): 827 seqA, seqB, score, begin, end = unique_alignments[i] 828 # Make sure end is set reasonably. 829 if end is None: # global alignment 830 end = len(seqA) 831 elif end < 0: 832 end = end + len(seqA) 833 # If there's no alignment here, get rid of it. 834 if begin >= end: 835 del unique_alignments[i] 836 continue 837 unique_alignments[i] = seqA, seqB, score, begin, end 838 i += 1 839 return unique_alignments
840 841
842 -def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, 843 gap_char):
844 """Add remaining sequences and fill with gaps if neccessary""" 845 if row: 846 ali_seqA += sequenceA[row - 1::-1] 847 if col: 848 ali_seqB += sequenceB[col - 1::-1] 849 if row > col: 850 ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) 851 elif col > row: 852 ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) 853 return ali_seqA, ali_seqB
854 855
856 -def _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, end, row, col, 857 col_gap, gap_char, score_matrix, trace_matrix, in_process, 858 gap_fn, target, index, direction):
859 """Find the starting point(s) of the extended gap""" 860 dead_end = False 861 target_score = score_matrix[row][col] 862 for n in range(target): 863 if direction == 'col': 864 col -= 1 865 ali_seqA += gap_char 866 ali_seqB += sequenceB[col] 867 else: 868 row -= 1 869 ali_seqA += sequenceA[row] 870 ali_seqB += gap_char 871 actual_score = score_matrix[row][col] + gap_fn(index, n + 1) 872 if rint(actual_score) == rint(target_score) and n > 0: 873 if not trace_matrix[row][col]: 874 break 875 else: 876 in_process.append((ali_seqA[:], ali_seqB[:], end, row, col, 877 col_gap, trace_matrix[row][col])) 878 if not trace_matrix[row][col]: 879 dead_end = True 880 return ali_seqA, ali_seqB, row, col, in_process, dead_end
881 882 883 _PRECISION = 1000 884 885
886 -def rint(x, precision=_PRECISION):
887 return int(x * precision + 0.5)
888 889
890 -class identity_match(object):
891 """identity_match([match][, mismatch]) -> match_fn 892 893 Create a match function for use in an alignment. match and 894 mismatch are the scores to give when two residues are equal or 895 unequal. By default, match is 1 and mismatch is 0. 896 """
897 - def __init__(self, match=1, mismatch=0):
898 self.match = match 899 self.mismatch = mismatch
900
901 - def __call__(self, charA, charB):
902 if charA == charB: 903 return self.match 904 return self.mismatch
905 906
907 -class dictionary_match(object):
908 """dictionary_match(score_dict[, symmetric]) -> match_fn 909 910 Create a match function for use in an alignment. score_dict is a 911 dictionary where the keys are tuples (residue 1, residue 2) and 912 the values are the match scores between those residues. symmetric 913 is a flag that indicates whether the scores are symmetric. If 914 true, then if (res 1, res 2) doesn't exist, I will use the score 915 at (res 2, res 1). 916 """
917 - def __init__(self, score_dict, symmetric=1):
918 self.score_dict = score_dict 919 self.symmetric = symmetric
920
921 - def __call__(self, charA, charB):
922 if self.symmetric and (charA, charB) not in self.score_dict: 923 # If the score dictionary is symmetric, then look up the 924 # score both ways. 925 charB, charA = charA, charB 926 return self.score_dict[(charA, charB)]
927 928
929 -class affine_penalty(object):
930 """affine_penalty(open, extend[, penalize_extend_when_opening]) -> gap_fn 931 932 Create a gap function for use in an alignment. 933 """
934 - def __init__(self, open, extend, penalize_extend_when_opening=0):
935 if open > 0 or extend > 0: 936 raise ValueError("Gap penalties should be non-positive.") 937 if not penalize_extend_when_opening and (extend < open): 938 raise ValueError("Gap opening penalty should be higher than " + 939 "gap extension penalty (or equal)") 940 self.open, self.extend = open, extend 941 self.penalize_extend_when_opening = penalize_extend_when_opening
942
943 - def __call__(self, index, length):
944 return calc_affine_penalty( 945 length, self.open, self.extend, self.penalize_extend_when_opening)
946 947
948 -def calc_affine_penalty(length, open, extend, penalize_extend_when_opening):
949 if length <= 0: 950 return 0 951 penalty = open + extend * length 952 if not penalize_extend_when_opening: 953 penalty -= extend 954 return penalty
955 956 972 973
974 -def format_alignment(align1, align2, score, begin, end):
975 """format_alignment(align1, align2, score, begin, end) -> string 976 977 Format the alignment prettily into a string. 978 """ 979 s = [] 980 s.append("%s\n" % align1) 981 s.append("%s%s\n" % (" " * begin, "|" * (end - begin))) 982 s.append("%s\n" % align2) 983 s.append(" Score=%g\n" % score) 984 return ''.join(s)
985 986 987 # Try and load C implementations of functions. If I can't, 988 # then throw a warning and use the pure Python implementations. 989 try: 990 from .cpairwise2 import rint, _make_score_matrix_fast 991 except ImportError: 992 warnings.warn('Import of C module failed. Falling back to pure Python ' + 993 'implementation. This may be slooow...', BiopythonWarning) 994