Package Bio :: Module pairwise2
[hide private]
[frames] | no frames]

Source Code for Module Bio.pairwise2

   1  # Copyright 2002 by Jeffrey Chang. 
   2  # Copyright 2016 by Markus Piotrowski. 
   3  # All rights reserved. 
   4  # 
   5  # This file is part of the Biopython distribution and governed by your 
   6  # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". 
   7  # Please see the LICENSE file that should have been included as part of this 
   8  # package. 
   9  """Pairwise sequence alignment using a dynamic programming algorithm. 
  10   
  11  This provides functions to get global and local alignments between two 
  12  sequences. A global alignment finds the best concordance between all 
  13  characters in two sequences. A local alignment finds just the 
  14  subsequences that align the best. 
  15   
  16  When doing alignments, you can specify the match score and gap 
  17  penalties.  The match score indicates the compatibility between an 
  18  alignment of two characters in the sequences. Highly compatible 
  19  characters should be given positive scores, and incompatible ones 
  20  should be given negative scores or 0.  The gap penalties should be 
  21  negative. 
  22   
  23  The names of the alignment functions in this module follow the 
  24  convention 
  25  <alignment type>XX 
  26  where <alignment type> is either "global" or "local" and XX is a 2 
  27  character code indicating the parameters it takes.  The first 
  28  character indicates the parameters for matches (and mismatches), and 
  29  the second indicates the parameters for gap penalties. 
  30   
  31  The match parameters are:: 
  32   
  33      CODE  DESCRIPTION 
  34      x     No parameters. Identical characters have score of 1, otherwise 0. 
  35      m     A match score is the score of identical chars, otherwise mismatch 
  36            score. 
  37      d     A dictionary returns the score of any pair of characters. 
  38      c     A callback function returns scores. 
  39   
  40  The gap penalty parameters are:: 
  41   
  42      CODE  DESCRIPTION 
  43      x     No gap penalties. 
  44      s     Same open and extend gap penalties for both sequences. 
  45      d     The sequences have different open and extend gap penalties. 
  46      c     A callback function returns the gap penalties. 
  47   
  48  All the different alignment functions are contained in an object 
  49  ``align``. For example: 
  50   
  51      >>> from Bio import pairwise2 
  52      >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") 
  53   
  54  will return a list of the alignments between the two strings. For a nice 
  55  printout, use the ``format_alignment`` method of the module: 
  56   
  57      >>> from Bio.pairwise2 import format_alignment 
  58      >>> print(format_alignment(*alignments[0])) 
  59      ACCGT 
  60      | ||  
  61      A-CG- 
  62        Score=3 
  63      <BLANKLINE> 
  64   
  65  All alignment functions have the following arguments: 
  66   
  67  - Two sequences: strings, Biopython sequence objects or lists. 
  68    Lists are useful for supplying sequences which contain residues that are 
  69    encoded by more than one letter. 
  70   
  71  - ``penalize_extend_when_opening``: boolean (default: False). 
  72    Whether to count an extension penalty when opening a gap. If false, a gap of 
  73    1 is only penalized an "open" penalty, otherwise it is penalized 
  74    "open+extend". 
  75   
  76  - ``penalize_end_gaps``: boolean. 
  77    Whether to count the gaps at the ends of an alignment. By default, they are 
  78    counted for global alignments but not for local ones. Setting 
  79    ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the 
  80    two sequences separately whether gaps at the end of the alignment should be 
  81    counted. 
  82   
  83  - ``gap_char``: string (default: ``'-'``). 
  84    Which character to use as a gap character in the alignment returned. If your 
  85    input sequences are lists, you must change this to ``['-']``. 
  86   
  87  - ``force_generic``: boolean (default: False). 
  88    Always use the generic, non-cached, dynamic programming function (slow!). 
  89    For debugging. 
  90   
  91  - ``score_only``: boolean (default: False). 
  92    Only get the best score, don't recover any alignments. The return value of 
  93    the function is the score. Faster and uses less memory. 
  94   
  95  - ``one_alignment_only``: boolean (default: False). 
  96    Only recover one alignment. 
  97   
  98  The other parameters of the alignment function depend on the function called. 
  99  Some examples: 
 100   
 101  - Find the best global alignment between the two sequences. Identical 
 102    characters are given 1 point. No points are deducted for mismatches or gaps. 
 103   
 104      >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): 
 105      ...     print(format_alignment(*a)) 
 106      ACCGT 
 107      | ||  
 108      A-CG- 
 109        Score=3 
 110      <BLANKLINE> 
 111      ACCGT 
 112      || |  
 113      AC-G- 
 114        Score=3 
 115      <BLANKLINE> 
 116   
 117  - Same thing as before, but with a local alignment. 
 118   
 119      >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): 
 120      ...     print(format_alignment(*a)) 
 121      ACCGT 
 122      | || 
 123      A-CG- 
 124        Score=3 
 125      <BLANKLINE> 
 126      ACCGT 
 127      || | 
 128      AC-G- 
 129        Score=3 
 130      <BLANKLINE> 
 131   
 132  - Do a global alignment. Identical characters are given 2 points, 1 point is 
 133    deducted for each non-identical character. Don't penalize gaps. 
 134   
 135      >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): 
 136      ...     print(format_alignment(*a)) 
 137      ACCGT 
 138      | ||  
 139      A-CG- 
 140        Score=6 
 141      <BLANKLINE> 
 142      ACCGT 
 143      || |  
 144      AC-G- 
 145        Score=6 
 146      <BLANKLINE> 
 147   
 148  - Same as above, except now 0.5 points are deducted when opening a gap, and 
 149    0.1 points are deducted when extending it. 
 150   
 151      >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): 
 152      ...     print(format_alignment(*a)) 
 153      ACCGT 
 154      | ||  
 155      A-CG- 
 156        Score=5 
 157      <BLANKLINE> 
 158      ACCGT 
 159      || |  
 160      AC-G- 
 161        Score=5 
 162      <BLANKLINE> 
 163   
 164  - Depending on the penalties, a gap in one sequence may be followed by a gap in 
 165    the other sequence.If you don't like this behaviour, increase the gap-open 
 166    penalty: 
 167   
 168      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): 
 169      ...     print(format_alignment(*a)) 
 170      A- 
 171      <BLANKLINE> 
 172      -T 
 173        Score=-2 
 174      <BLANKLINE> 
 175      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): 
 176      ...     print(format_alignment(*a)) 
 177      A 
 178      . 
 179      T 
 180        Score=-4 
 181      <BLANKLINE> 
 182   
 183  - The alignment function can also use known matrices already included in 
 184    Biopython (``MatrixInfo`` from ``Bio.SubsMat``): 
 185   
 186      >>> from Bio.SubsMat import MatrixInfo as matlist 
 187      >>> matrix = matlist.blosum62 
 188      >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): 
 189      ...     print(format_alignment(*a)) 
 190      KEVLA 
 191       |||  
 192      -EVL- 
 193        Score=13 
 194      <BLANKLINE> 
 195   
 196  - With the parameter ``c`` you can define your own match- and gap functions. 
 197    E.g. to define an affine logarithmic gap function and using it: 
 198   
 199      >>> from math import log 
 200      >>> def gap_function(x, y):  # x is gap position in seq, y is gap length 
 201      ...     if y == 0:  # No gap 
 202      ...         return 0 
 203      ...     elif y == 1:  # Gap open penalty 
 204      ...         return -2 
 205      ...     return - (2 + y/4.0 + log(y)/2.0) 
 206      ... 
 207      >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, 
 208      ...                                      gap_function, gap_function) 
 209   
 210    You can define different gap functions for each sequence. 
 211    Self-defined match functions must take the two residues to be compared and 
 212    return a score. 
 213   
 214  To see a description of the parameters for a function, please look at 
 215  the docstring for the function via the help function, e.g. 
 216  type ``help(pairwise2.align.localds``) at the Python prompt. 
 217   
 218  """  # noqa: W291 
 219  from __future__ import print_function 
 220   
 221  import warnings 
 222   
 223  from Bio import BiopythonWarning 
 224   
 225   
 226  MAX_ALIGNMENTS = 1000   # maximum alignments recovered in traceback 
 227   
 228   
229 -class align(object):
230 """Provide functions that do alignments. 231 232 Alignment functions are called as: 233 234 pairwise2.align.globalXX 235 236 or 237 238 pairwise2.align.localXX 239 240 Where XX is a 2 character code indicating the match/mismatch parameters 241 (first character, either x, m, d or c) and the gap penalty parameters 242 (second character, either x, s, d, or c). 243 244 For a detailed description read the main module's docstring (e.g., 245 type ``help(pairwise2)``). 246 To see a description of the parameters for a function, please 247 look at the docstring for the function, e.g. type 248 ``help(pairwise2.align.localds``) at the Python prompt. 249 """ 250
251 - class alignment_function(object):
252 """Callable class which impersonates an alignment function. 253 254 The constructor takes the name of the function. This class 255 will decode the name of the function to figure out how to 256 interpret the parameters. 257 """ 258 259 # match code -> tuple of (parameters, docstring) 260 match2args = { 261 'x': ([], ''), 262 'm': (['match', 'mismatch'], 263 "match is the score to given to identical characters.\n" 264 "mismatch is the score given to non-identical ones."), 265 'd': (['match_dict'], 266 "match_dict is a dictionary where the keys are tuples\n" 267 "of pairs of characters and the values are the scores,\n" 268 "e.g. ('A', 'C') : 2.5."), 269 'c': (['match_fn'], 270 "match_fn is a callback function that takes two " 271 "characters and returns the score between them."), 272 } 273 # penalty code -> tuple of (parameters, docstring) 274 penalty2args = { 275 'x': ([], ''), 276 's': (['open', 'extend'], 277 "open and extend are the gap penalties when a gap is\n" 278 "opened and extended. They should be negative."), 279 'd': (['openA', 'extendA', 'openB', 'extendB'], 280 "openA and extendA are the gap penalties for sequenceA,\n" 281 "and openB and extendB for sequenceB. The penalties\n" 282 "should be negative."), 283 'c': (['gap_A_fn', 'gap_B_fn'], 284 "gap_A_fn and gap_B_fn are callback functions that takes\n" 285 "(1) the index where the gap is opened, and (2) the length\n" 286 "of the gap. They should return a gap penalty."), 287 } 288
289 - def __init__(self, name):
290 """Check to make sure the name of the function is reasonable.""" 291 if name.startswith("global"): 292 if len(name) != 8: 293 raise AttributeError("function should be globalXX") 294 elif name.startswith("local"): 295 if len(name) != 7: 296 raise AttributeError("function should be localXX") 297 else: 298 raise AttributeError(name) 299 align_type, match_type, penalty_type = \ 300 name[:-2], name[-2], name[-1] 301 try: 302 match_args, match_doc = self.match2args[match_type] 303 except KeyError: 304 raise AttributeError("unknown match type %r" % match_type) 305 try: 306 penalty_args, penalty_doc = self.penalty2args[penalty_type] 307 except KeyError: 308 raise AttributeError("unknown penalty type %r" % penalty_type) 309 310 # Now get the names of the parameters to this function. 311 param_names = ['sequenceA', 'sequenceB'] 312 param_names.extend(match_args) 313 param_names.extend(penalty_args) 314 self.function_name = name 315 self.align_type = align_type 316 self.param_names = param_names 317 318 self.__name__ = self.function_name 319 # Set the doc string. 320 doc = "%s(%s) -> alignments\n" % ( 321 self.__name__, ', '.join(self.param_names)) 322 if match_doc: 323 doc += "\n%s\n" % match_doc 324 if penalty_doc: 325 doc += "\n%s\n" % penalty_doc 326 doc += ("""\ 327 \nalignments is a list of tuples (seqA, seqB, score, begin, end). 328 seqA and seqB are strings showing the alignment between the 329 sequences. score is the score of the alignment. begin and end 330 are indexes into seqA and seqB that indicate the where the 331 alignment occurs. 332 """) 333 self.__doc__ = doc
334
335 - def decode(self, *args, **keywds):
336 """Decode the arguments for the _align function. 337 338 keywds will get passed to it, so translate the arguments 339 to this function into forms appropriate for _align. 340 """ 341 keywds = keywds.copy() 342 if len(args) != len(self.param_names): 343 raise TypeError("%s takes exactly %d argument (%d given)" 344 % (self.function_name, len(self.param_names), 345 len(args))) 346 i = 0 347 while i < len(self.param_names): 348 if self.param_names[i] in [ 349 'sequenceA', 'sequenceB', 350 'gap_A_fn', 'gap_B_fn', 'match_fn']: 351 keywds[self.param_names[i]] = args[i] 352 i += 1 353 elif self.param_names[i] == 'match': 354 assert self.param_names[i + 1] == 'mismatch' 355 match, mismatch = args[i], args[i + 1] 356 keywds['match_fn'] = identity_match(match, mismatch) 357 i += 2 358 elif self.param_names[i] == 'match_dict': 359 keywds['match_fn'] = dictionary_match(args[i]) 360 i += 1 361 elif self.param_names[i] == 'open': 362 assert self.param_names[i + 1] == 'extend' 363 open, extend = args[i], args[i + 1] 364 pe = keywds.get('penalize_extend_when_opening', 0) 365 keywds['gap_A_fn'] = affine_penalty(open, extend, pe) 366 keywds['gap_B_fn'] = affine_penalty(open, extend, pe) 367 i += 2 368 elif self.param_names[i] == 'openA': 369 assert self.param_names[i + 3] == 'extendB' 370 openA, extendA, openB, extendB = args[i:i + 4] 371 pe = keywds.get('penalize_extend_when_opening', 0) 372 keywds['gap_A_fn'] = affine_penalty(openA, extendA, pe) 373 keywds['gap_B_fn'] = affine_penalty(openB, extendB, pe) 374 i += 4 375 else: 376 raise ValueError("unknown parameter %r" 377 % self.param_names[i]) 378 379 # Here are the default parameters for _align. Assign 380 # these to keywds, unless already specified. 381 pe = keywds.get('penalize_extend_when_opening', 0) 382 default_params = [ 383 ('match_fn', identity_match(1, 0)), 384 ('gap_A_fn', affine_penalty(0, 0, pe)), 385 ('gap_B_fn', affine_penalty(0, 0, pe)), 386 ('penalize_extend_when_opening', 0), 387 ('penalize_end_gaps', self.align_type == 'global'), 388 ('align_globally', self.align_type == 'global'), 389 ('gap_char', '-'), 390 ('force_generic', 0), 391 ('score_only', 0), 392 ('one_alignment_only', 0), 393 ] 394 for name, default in default_params: 395 keywds[name] = keywds.get(name, default) 396 value = keywds['penalize_end_gaps'] 397 try: 398 n = len(value) 399 except TypeError: 400 keywds['penalize_end_gaps'] = tuple([value] * 2) 401 else: 402 assert n == 2 403 return keywds
404
405 - def __call__(self, *args, **keywds):
406 """Call the alignment instance already created.""" 407 keywds = self.decode(*args, **keywds) 408 return _align(**keywds)
409
410 - def __getattr__(self, attr):
411 """Call alignment_function() to check and decode the attributes.""" 412 # The following 'magic' is needed to rewrite the class docstring 413 # dynamically: 414 wrapper = self.alignment_function(attr) 415 wrapper_type = type(wrapper) 416 wrapper_dict = wrapper_type.__dict__.copy() 417 wrapper_dict['__doc__'] = wrapper.__doc__ 418 new_alignment_function = type('alignment_function', (object,), 419 wrapper_dict) 420 421 return new_alignment_function(attr)
422 423 424 align = align() 425 426
427 -def _align(sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 428 penalize_extend_when_opening, penalize_end_gaps, 429 align_globally, gap_char, force_generic, score_only, 430 one_alignment_only):
431 """Return optimal alignments between two sequences (PRIVATE). 432 433 This method either returns a list of optimal alignments (with the same 434 score) or just the optimal score. 435 """ 436 if not sequenceA or not sequenceB: 437 return [] 438 try: 439 sequenceA + gap_char 440 sequenceB + gap_char 441 except TypeError: 442 raise TypeError('both sequences must be of the same type, either ' 443 'string/sequence object or list. Gap character must ' 444 'fit the sequence type (string or list)') 445 446 if not isinstance(sequenceA, list): 447 sequenceA = str(sequenceA) 448 if not isinstance(sequenceB, list): 449 sequenceB = str(sequenceB) 450 if not align_globally and (penalize_end_gaps[0] or penalize_end_gaps[1]): 451 warnings.warn('"penalize_end_gaps" should not be used in local ' 452 'alignments. The resulting score may be wrong.', 453 BiopythonWarning) 454 455 if (not force_generic) and isinstance(gap_A_fn, affine_penalty) \ 456 and isinstance(gap_B_fn, affine_penalty): 457 open_A, extend_A = gap_A_fn.open, gap_A_fn.extend 458 open_B, extend_B = gap_B_fn.open, gap_B_fn.extend 459 matrices = _make_score_matrix_fast( 460 sequenceA, sequenceB, match_fn, open_A, extend_A, open_B, 461 extend_B, penalize_extend_when_opening, penalize_end_gaps, 462 align_globally, score_only) 463 else: 464 matrices = _make_score_matrix_generic( 465 sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 466 penalize_end_gaps, align_globally, score_only) 467 score_matrix, trace_matrix = matrices 468 469 # print("SCORE %s" % print_matrix(score_matrix)) 470 # print("TRACEBACK %s" % print_matrix(trace_matrix)) 471 472 # Look for the proper starting point. Get a list of all possible 473 # starting points. 474 starts = _find_start(score_matrix, align_globally) 475 # Find the highest score. 476 best_score = max([_[0] for _ in starts]) 477 478 # If they only want the score, then return it. 479 if score_only: 480 return best_score 481 482 tolerance = 0 # XXX do anything with this? 483 # Now find all the positions within some tolerance of the best 484 # score. 485 starts = [(score, pos) for score, pos in starts 486 if rint(abs(score - best_score)) <= rint(tolerance)] 487 488 # Recover the alignments and return them. 489 alignments = _recover_alignments(sequenceA, sequenceB, starts, 490 score_matrix, trace_matrix, 491 align_globally, gap_char, 492 one_alignment_only, gap_A_fn, gap_B_fn) 493 if not alignments: 494 # This may happen, see recover_alignments for explanation 495 score_matrix, trace_matrix = _reverse_matrices(score_matrix, 496 trace_matrix) 497 starts = [(z, (y, x)) for z, (x, y) in starts] 498 alignments = _recover_alignments(sequenceB, sequenceA, starts, 499 score_matrix, trace_matrix, 500 align_globally, gap_char, 501 one_alignment_only, gap_B_fn, 502 gap_A_fn, reverse=True) 503 return alignments
504 505
506 -def _make_score_matrix_generic(sequenceA, sequenceB, match_fn, gap_A_fn, 507 gap_B_fn, penalize_end_gaps, align_globally, 508 score_only):
509 """Generate a score and traceback matrix (PRIVATE). 510 511 This implementation according to Needleman-Wunsch allows the usage of 512 general gap functions and is rather slow. It is automatically called if 513 you define your own gap functions. You can force the usage of this method 514 with ``force_generic=True``. 515 """ 516 # Create the score and traceback matrices. These should be in the 517 # shape: 518 # sequenceA (down) x sequenceB (across) 519 lenA, lenB = len(sequenceA), len(sequenceB) 520 score_matrix, trace_matrix = [], [] 521 for i in range(lenA + 1): 522 score_matrix.append([None] * (lenB + 1)) 523 if not score_only: 524 trace_matrix.append([None] * (lenB + 1)) 525 526 # Initialize first row and column with gap scores. This is like opening up 527 # i gaps at the beginning of sequence A or B. 528 for i in range(lenA + 1): 529 if penalize_end_gaps[1]: # [1]:gap in sequence B 530 score = gap_B_fn(0, i) 531 else: 532 score = 0 533 score_matrix[i][0] = score 534 for i in range(lenB + 1): 535 if penalize_end_gaps[0]: # [0]:gap in sequence A 536 score = gap_A_fn(0, i) 537 else: 538 score = 0 539 score_matrix[0][i] = score 540 541 # Fill in the score matrix. Each position in the matrix 542 # represents an alignment between a character from sequence A to 543 # one in sequence B. As I iterate through the matrix, find the 544 # alignment by choose the best of: 545 # 1) extending a previous alignment without gaps 546 # 2) adding a gap in sequenceA 547 # 3) adding a gap in sequenceB 548 for row in range(1, lenA + 1): 549 for col in range(1, lenB + 1): 550 # First, calculate the score that would occur by extending 551 # the alignment without gaps. 552 nogap_score = score_matrix[row - 1][col - 1] + \ 553 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 554 555 # Try to find a better score by opening gaps in sequenceA. 556 # Do this by checking alignments from each column in the row. 557 # Each column represents a different character to align from, 558 # and thus a different length gap. 559 # Although the gap function does not distinguish between opening 560 # and extending a gap, we distinguish them for the backtrace. 561 if not penalize_end_gaps[0] and row == lenA: 562 row_open = score_matrix[row][col - 1] 563 row_extend = max([score_matrix[row][x] for x in range(col)]) 564 else: 565 row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) 566 row_extend = max([score_matrix[row][x] + gap_A_fn(row, col - x) 567 for x in range(col)]) 568 569 # Try to find a better score by opening gaps in sequenceB. 570 if not penalize_end_gaps[1] and col == lenB: 571 col_open = score_matrix[row - 1][col] 572 col_extend = max([score_matrix[x][col] for x in range(row)]) 573 else: 574 col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) 575 col_extend = max([score_matrix[x][col] + gap_B_fn(col, row - x) 576 for x in range(row)]) 577 578 best_score = max(nogap_score, row_open, row_extend, col_open, 579 col_extend) 580 if not align_globally and best_score < 0: 581 score_matrix[row][col] = 0 582 else: 583 score_matrix[row][col] = best_score 584 585 # The backtrace is encoded binary. See _make_score_matrix_fast 586 # for details. 587 if not score_only: 588 trace_score = 0 589 if rint(nogap_score) == rint(best_score): 590 trace_score += 2 591 if rint(row_open) == rint(best_score): 592 trace_score += 1 593 if rint(row_extend) == rint(best_score): 594 trace_score += 8 595 if rint(col_open) == rint(best_score): 596 trace_score += 4 597 if rint(col_extend) == rint(best_score): 598 trace_score += 16 599 trace_matrix[row][col] = trace_score 600 601 return score_matrix, trace_matrix
602 603
604 -def _make_score_matrix_fast(sequenceA, sequenceB, match_fn, open_A, extend_A, 605 open_B, extend_B, penalize_extend_when_opening, 606 penalize_end_gaps, align_globally, score_only):
607 """Generate a score and traceback matrix according to Gotoh (PRIVATE). 608 609 This is an implementation of the Needleman-Wunsch dynamic programming 610 algorithm as modified by Gotoh, implementing affine gap penalties. 611 In short, we have three matrices, holding scores for alignments ending 612 in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in 613 sequence B, respectively. However, we can combine them in one matrix, 614 which holds the best scores, and store only those values from the 615 other matrices that are actually used for the next step of calculation. 616 The traceback matrix holds the positions for backtracing the alignment. 617 """ 618 first_A_gap = calc_affine_penalty(1, open_A, extend_A, 619 penalize_extend_when_opening) 620 first_B_gap = calc_affine_penalty(1, open_B, extend_B, 621 penalize_extend_when_opening) 622 623 # Create the score and traceback matrices. These should be in the 624 # shape: 625 # sequenceA (down) x sequenceB (across) 626 lenA, lenB = len(sequenceA), len(sequenceB) 627 score_matrix, trace_matrix = [], [] 628 for i in range(lenA + 1): 629 score_matrix.append([None] * (lenB + 1)) 630 if not score_only: 631 trace_matrix.append([None] * (lenB + 1)) 632 633 # Initialize first row and column with gap scores. This is like opening up 634 # i gaps at the beginning of sequence A or B. 635 for i in range(lenA + 1): 636 if penalize_end_gaps[1]: # [1]:gap in sequence B 637 score = calc_affine_penalty(i, open_B, extend_B, 638 penalize_extend_when_opening) 639 else: 640 score = 0 641 score_matrix[i][0] = score 642 for i in range(lenB + 1): 643 if penalize_end_gaps[0]: # [0]:gap in sequence A 644 score = calc_affine_penalty(i, open_A, extend_A, 645 penalize_extend_when_opening) 646 else: 647 score = 0 648 score_matrix[0][i] = score 649 650 # Now initialize the col 'matrix'. Actually this is only a one dimensional 651 # list, since we only need the col scores from the last row. 652 col_score = [0] # Best score, if actual alignment ends with gap in seqB 653 for i in range(1, lenB + 1): 654 col_score.append(calc_affine_penalty(i, 2 * open_B, extend_B, 655 penalize_extend_when_opening)) 656 657 # The row 'matrix' is calculated on the fly. Here we only need the actual 658 # score. 659 # Now, filling up the score and traceback matrices: 660 for row in range(1, lenA + 1): 661 row_score = calc_affine_penalty(row, 2 * open_A, extend_A, 662 penalize_extend_when_opening) 663 for col in range(1, lenB + 1): 664 # Calculate the score that would occur by extending the 665 # alignment without gaps. 666 nogap_score = score_matrix[row - 1][col - 1] + \ 667 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 668 669 # Check the score that would occur if there were a gap in 670 # sequence A. This could come from opening a new gap or 671 # extending an existing one. 672 # A gap in sequence A can also be opened if it follows a gap in 673 # sequence B: A- 674 # -B 675 if not penalize_end_gaps[0] and row == lenA: 676 row_open = score_matrix[row][col - 1] 677 row_extend = row_score 678 else: 679 row_open = score_matrix[row][col - 1] + first_A_gap 680 row_extend = row_score + extend_A 681 row_score = max(row_open, row_extend) 682 683 # The same for sequence B: 684 if not penalize_end_gaps[1] and col == lenB: 685 col_open = score_matrix[row - 1][col] 686 col_extend = col_score[col] 687 else: 688 col_open = score_matrix[row - 1][col] + first_B_gap 689 col_extend = col_score[col] + extend_B 690 col_score[col] = max(col_open, col_extend) 691 692 best_score = max(nogap_score, col_score[col], row_score) 693 if not align_globally and best_score < 0: 694 score_matrix[row][col] = 0 695 else: 696 score_matrix[row][col] = best_score 697 698 # Now the trace_matrix. The edges of the backtrace are encoded 699 # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and 700 # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and 701 # 16 = extend gap in seqB. This values can be summed up. 702 # Thus, the trace score 7 means that the best score can either 703 # come from opening a gap in seqA (=1), pairing two characters 704 # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). 705 # However, if we only want the score we don't care about the trace. 706 if not score_only: 707 row_score_rint = rint(row_score) 708 col_score_rint = rint(col_score[col]) 709 row_trace_score = 0 710 col_trace_score = 0 711 if rint(row_open) == row_score_rint: 712 row_trace_score += 1 # Open gap in seqA 713 if rint(row_extend) == row_score_rint: 714 row_trace_score += 8 # Extend gap in seqA 715 if rint(col_open) == col_score_rint: 716 col_trace_score += 4 # Open gap in seqB 717 if rint(col_extend) == col_score_rint: 718 col_trace_score += 16 # Extend gap in seqB 719 720 trace_score = 0 721 best_score_rint = rint(best_score) 722 if rint(nogap_score) == best_score_rint: 723 trace_score += 2 # Align seqA with seqB 724 if row_score_rint == best_score_rint: 725 trace_score += row_trace_score 726 if col_score_rint == best_score_rint: 727 trace_score += col_trace_score 728 trace_matrix[row][col] = trace_score 729 730 return score_matrix, trace_matrix
731 732
733 -def _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 734 trace_matrix, align_globally, gap_char, 735 one_alignment_only, gap_A_fn, gap_B_fn, reverse=False):
736 """Do the backtracing and return a list of alignments (PRIVATE). 737 738 Recover the alignments by following the traceback matrix. This 739 is a recursive procedure, but it's implemented here iteratively 740 with a stack. 741 742 sequenceA and sequenceB may be sequences, including strings, 743 lists, or list-like objects. In order to preserve the type of 744 the object, we need to use slices on the sequences instead of 745 indexes. For example, sequenceA[row] may return a type that's 746 not compatible with sequenceA, e.g. if sequenceA is a list and 747 sequenceA[row] is a string. Thus, avoid using indexes and use 748 slices, e.g. sequenceA[row:row+1]. Assume that client-defined 749 sequence classes preserve these semantics. 750 """ 751 lenA, lenB = len(sequenceA), len(sequenceB) 752 ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] 753 tracebacks = [] 754 in_process = [] 755 756 for start in starts: 757 score, (row, col) = start 758 begin = 0 759 if align_globally: 760 end = None 761 else: 762 # Local alignments should start with a positive score! 763 if score <= 0: 764 continue 765 # Local alignments should not end with a gap!: 766 trace = trace_matrix[row][col] 767 if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! 768 trace_matrix[row][col] = 2 769 # If not, don't start here! 770 else: 771 continue 772 end = -max(lenA - row, lenB - col) 773 if not end: 774 end = None 775 col_distance = lenB - col 776 row_distance = lenA - row 777 ali_seqA = ((col_distance - row_distance) * gap_char + 778 sequenceA[lenA - 1:row - 1:-1]) 779 ali_seqB = ((row_distance - col_distance) * gap_char + 780 sequenceB[lenB - 1:col - 1:-1]) 781 in_process += [(ali_seqA, ali_seqB, end, row, col, False, 782 trace_matrix[row][col])] 783 while in_process and len(tracebacks) < MAX_ALIGNMENTS: 784 # Although we allow a gap in seqB to be followed by a gap in seqA, 785 # we don't want to allow it the other way round, since this would 786 # give redundant alignments of type: A- vs. -A 787 # -B B- 788 # Thus we need to keep track if a gap in seqA was opened (col_gap) 789 # and stop the backtrace (dead_end) if a gap in seqB follows. 790 # 791 # Attention: This may fail, if the gap-penalties for both strands are 792 # different. In this case the second aligment may be the only optimal 793 # alignment. Thus it can happen that no alignment is returned. For 794 # this case a workaround was implemented, which reverses the input and 795 # the matrices (this happens in _reverse_matrices) and repeats the 796 # backtrace. The variable 'reverse' keeps track of this. 797 dead_end = False 798 ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() 799 800 while (row > 0 or col > 0) and not dead_end: 801 cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) 802 803 # If trace is empty we have reached at least one border of the 804 # matrix or the end of a local aligment. Just add the rest of 805 # the sequence(s) and fill with gaps if necessary. 806 if not trace: 807 if col and col_gap: 808 dead_end = True 809 else: 810 ali_seqA, ali_seqB = _finish_backtrace( 811 sequenceA, sequenceB, ali_seqA, ali_seqB, 812 row, col, gap_char) 813 break 814 elif trace % 2 == 1: # = row open = open gap in seqA 815 trace -= 1 816 if col_gap: 817 dead_end = True 818 else: 819 col -= 1 820 ali_seqA += gap_char 821 ali_seqB += sequenceB[col:col + 1] 822 col_gap = False 823 elif trace % 4 == 2: # = match/mismatch of seqA with seqB 824 trace -= 2 825 row -= 1 826 col -= 1 827 ali_seqA += sequenceA[row:row + 1] 828 ali_seqB += sequenceB[col:col + 1] 829 col_gap = False 830 elif trace % 8 == 4: # = col open = open gap in seqB 831 trace -= 4 832 row -= 1 833 ali_seqA += sequenceA[row:row + 1] 834 ali_seqB += gap_char 835 col_gap = True 836 elif trace in (8, 24): # = row extend = extend gap in seqA 837 trace -= 8 838 if col_gap: 839 dead_end = True 840 else: 841 col_gap = False 842 # We need to find the starting point of the extended gap 843 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, 844 ali_seqB, end, row, col, col_gap, 845 gap_char, score_matrix, trace_matrix, 846 in_process, gap_A_fn, col, row, 'col') 847 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 848 elif trace == 16: # = col extend = extend gap in seqB 849 trace -= 16 850 col_gap = True 851 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, 852 end, row, col, col_gap, gap_char, 853 score_matrix, trace_matrix, in_process, 854 gap_B_fn, row, col, 'row') 855 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 856 857 if trace: # There is another path to follow... 858 cache += (trace,) 859 in_process.append(cache) 860 trace = trace_matrix[row][col] 861 if not align_globally and score_matrix[row][col] <= 0: 862 begin = max(row, col) 863 trace = 0 864 if not dead_end: 865 if not reverse: 866 tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, 867 begin, end)) 868 else: 869 tracebacks.append((ali_seqB[::-1], ali_seqA[::-1], score, 870 begin, end)) 871 if one_alignment_only: 872 break 873 return _clean_alignments(tracebacks)
874 875
876 -def _find_start(score_matrix, align_globally):
877 """Return a list of starting points (score, (row, col)) (PRIVATE). 878 879 Indicating every possible place to start the tracebacks. 880 """ 881 nrows, ncols = len(score_matrix), len(score_matrix[0]) 882 # In this implementation of the global algorithm, the start will always be 883 # the bottom right corner of the matrix. 884 if align_globally: 885 starts = [(score_matrix[-1][-1], (nrows - 1, ncols - 1))] 886 else: 887 starts = [] 888 for row in range(nrows): 889 for col in range(ncols): 890 score = score_matrix[row][col] 891 starts.append((score, (row, col))) 892 return starts
893 894
895 -def _reverse_matrices(score_matrix, trace_matrix):
896 """Reverse score and trace matrices (PRIVATE).""" 897 reverse_score_matrix = [] 898 reverse_trace_matrix = [] 899 reverse_trace = {1: 4, 2: 2, 3: 6, 4: 1, 5: 5, 6: 3, 7: 7, 8: 16, 9: 20, 900 10: 18, 11: 22, 12: 17, 13: 21, 14: 19, 15: 23, 16: 8, 901 17: 12, 18: 10, 19: 14, 20: 9, 21: 13, 22: 11, 23: 15, 902 24: 24, 25: 28, 26: 26, 27: 30, 28: 25, 29: 29, 30: 27, 903 31: 31, None: None} 904 for col in range(len(score_matrix[0])): 905 new_score_row = [] 906 new_trace_row = [] 907 for row in range(len(score_matrix)): 908 new_score_row.append(score_matrix[row][col]) 909 new_trace_row.append(reverse_trace[trace_matrix[row][col]]) 910 reverse_score_matrix.append(new_score_row) 911 reverse_trace_matrix.append(new_trace_row) 912 return reverse_score_matrix, reverse_trace_matrix
913 914
915 -def _clean_alignments(alignments):
916 """Take a list of alignments and return a cleaned version (PRIVATE). 917 918 Remove duplicates, make sure begin and end are set correctly, remove 919 empty alignments. 920 """ 921 unique_alignments = [] 922 for align in alignments: 923 if align not in unique_alignments: 924 unique_alignments.append(align) 925 i = 0 926 while i < len(unique_alignments): 927 seqA, seqB, score, begin, end = unique_alignments[i] 928 # Make sure end is set reasonably. 929 if end is None: # global alignment 930 end = len(seqA) 931 elif end < 0: 932 end = end + len(seqA) 933 # If there's no alignment here, get rid of it. 934 if begin >= end: 935 del unique_alignments[i] 936 continue 937 unique_alignments[i] = seqA, seqB, score, begin, end 938 i += 1 939 return unique_alignments
940 941
942 -def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, 943 gap_char):
944 """Add remaining sequences and fill with gaps if necessary (PRIVATE).""" 945 if row: 946 ali_seqA += sequenceA[row - 1::-1] 947 if col: 948 ali_seqB += sequenceB[col - 1::-1] 949 if row > col: 950 ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) 951 elif col > row: 952 ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) 953 return ali_seqA, ali_seqB
954 955
956 -def _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, end, row, col, 957 col_gap, gap_char, score_matrix, trace_matrix, in_process, 958 gap_fn, target, index, direction):
959 """Find the starting point(s) of the extended gap (PRIVATE).""" 960 dead_end = False 961 target_score = score_matrix[row][col] 962 for n in range(target): 963 if direction == 'col': 964 col -= 1 965 ali_seqA += gap_char 966 ali_seqB += sequenceB[col:col + 1] 967 else: 968 row -= 1 969 ali_seqA += sequenceA[row:row + 1] 970 ali_seqB += gap_char 971 actual_score = score_matrix[row][col] + gap_fn(index, n + 1) 972 if rint(actual_score) == rint(target_score) and n > 0: 973 if not trace_matrix[row][col]: 974 break 975 else: 976 in_process.append((ali_seqA[:], ali_seqB[:], end, row, col, 977 col_gap, trace_matrix[row][col])) 978 if not trace_matrix[row][col]: 979 dead_end = True 980 return ali_seqA, ali_seqB, row, col, in_process, dead_end
981 982 983 _PRECISION = 1000 984 985
986 -def rint(x, precision=_PRECISION):
987 """Print number with declared precision.""" 988 return int(x * precision + 0.5)
989 990
991 -class identity_match(object):
992 """Create a match function for use in an alignment. 993 994 match and mismatch are the scores to give when two residues are equal 995 or unequal. By default, match is 1 and mismatch is 0. 996 """ 997
998 - def __init__(self, match=1, mismatch=0):
999 """Initialize the class.""" 1000 self.match = match 1001 self.mismatch = mismatch
1002
1003 - def __call__(self, charA, charB):
1004 """Call a match function instance already created.""" 1005 if charA == charB: 1006 return self.match 1007 return self.mismatch
1008 1009
1010 -class dictionary_match(object):
1011 """Create a match function for use in an alignment. 1012 1013 Attributes: 1014 - score_dict - A dictionary where the keys are tuples (residue 1, 1015 residue 2) and the values are the match scores between those residues. 1016 - symmetric - A flag that indicates whether the scores are symmetric. 1017 1018 """ 1019
1020 - def __init__(self, score_dict, symmetric=1):
1021 """Initialize the class.""" 1022 self.score_dict = score_dict 1023 self.symmetric = symmetric
1024
1025 - def __call__(self, charA, charB):
1026 """Call a dictionary match instance already created.""" 1027 if self.symmetric and (charA, charB) not in self.score_dict: 1028 # If the score dictionary is symmetric, then look up the 1029 # score both ways. 1030 charB, charA = charA, charB 1031 return self.score_dict[(charA, charB)]
1032 1033
1034 -class affine_penalty(object):
1035 """Create a gap function for use in an alignment.""" 1036
1037 - def __init__(self, open, extend, penalize_extend_when_opening=0):
1038 """Initialize the class.""" 1039 if open > 0 or extend > 0: 1040 raise ValueError("Gap penalties should be non-positive.") 1041 if not penalize_extend_when_opening and (extend < open): 1042 raise ValueError("Gap opening penalty should be higher than " + 1043 "gap extension penalty (or equal)") 1044 self.open, self.extend = open, extend 1045 self.penalize_extend_when_opening = penalize_extend_when_opening
1046
1047 - def __call__(self, index, length):
1048 """Call a gap function instance already created.""" 1049 return calc_affine_penalty( 1050 length, self.open, self.extend, self.penalize_extend_when_opening)
1051 1052
1053 -def calc_affine_penalty(length, open, extend, penalize_extend_when_opening):
1054 """Calculate a penality score for the gap function.""" 1055 if length <= 0: 1056 return 0 1057 penalty = open + extend * length 1058 if not penalize_extend_when_opening: 1059 penalty -= extend 1060 return penalty
1061 1062 1075 1076
1077 -def format_alignment(align1, align2, score, begin, end):
1078 """Format the alignment prettily into a string. 1079 1080 Since Biopython 1.71 identical matches are shown with a pipe 1081 character, mismatches as a dot, and gaps as a space. 1082 1083 Note that spaces are also used at the start/end of a local 1084 alignment. 1085 1086 Prior releases just used the pipe character to indicate the 1087 aligned region (matches, mismatches and gaps). 1088 """ 1089 s = [] 1090 s.append("%s\n" % align1) 1091 s.append(" " * begin) 1092 for a, b in zip(align1[begin:end], align2[begin:end]): 1093 if a == b: 1094 s.append("|") # match 1095 elif a == "-" or b == "-": 1096 s.append(" ") # gap 1097 else: 1098 s.append(".") # mismatch 1099 s.append("\n") 1100 s.append("%s\n" % align2) 1101 s.append(" Score=%g\n" % score) 1102 return ''.join(s)
1103 1104 1105 # Try and load C implementations of functions. If I can't, 1106 # then throw a warning and use the pure Python implementations. 1107 # The redefinition is deliberate, thus the no quality assurance 1108 # flag for when using flake8: 1109 try: 1110 from .cpairwise2 import rint, _make_score_matrix_fast # noqa 1111 except ImportError: 1112 warnings.warn('Import of C module failed. Falling back to pure Python ' + 1113 'implementation. This may be slooow...', BiopythonWarning) 1114 1115 if __name__ == "__main__": 1116 from Bio._utils import run_doctest 1117 run_doctest() 1118