Package Bio :: Module pairwise2
[hide private]
[frames] | no frames]

Source Code for Module Bio.pairwise2

   1  # Copyright 2002 by Jeffrey Chang. 
   2  # Copyright 2016 by Markus Piotrowski. 
   3  # All rights reserved. 
   4  # This code is part of the Biopython distribution and governed by its 
   5  # license.  Please see the LICENSE file that should have been included 
   6  # as part of this package. 
   7   
   8  """Pairwise sequence alignment using a dynamic programming algorithm. 
   9   
  10  This provides functions to get global and local alignments between two 
  11  sequences. A global alignment finds the best concordance between all 
  12  characters in two sequences. A local alignment finds just the 
  13  subsequences that align the best. 
  14   
  15  When doing alignments, you can specify the match score and gap 
  16  penalties.  The match score indicates the compatibility between an 
  17  alignment of two characters in the sequences. Highly compatible 
  18  characters should be given positive scores, and incompatible ones 
  19  should be given negative scores or 0.  The gap penalties should be 
  20  negative. 
  21   
  22  The names of the alignment functions in this module follow the 
  23  convention 
  24  <alignment type>XX 
  25  where <alignment type> is either "global" or "local" and XX is a 2 
  26  character code indicating the parameters it takes.  The first 
  27  character indicates the parameters for matches (and mismatches), and 
  28  the second indicates the parameters for gap penalties. 
  29   
  30  The match parameters are:: 
  31   
  32      CODE  DESCRIPTION 
  33      x     No parameters. Identical characters have score of 1, otherwise 0. 
  34      m     A match score is the score of identical chars, otherwise mismatch 
  35            score. 
  36      d     A dictionary returns the score of any pair of characters. 
  37      c     A callback function returns scores. 
  38   
  39  The gap penalty parameters are:: 
  40   
  41      CODE  DESCRIPTION 
  42      x     No gap penalties. 
  43      s     Same open and extend gap penalties for both sequences. 
  44      d     The sequences have different open and extend gap penalties. 
  45      c     A callback function returns the gap penalties. 
  46   
  47  All the different alignment functions are contained in an object 
  48  ``align``. For example: 
  49   
  50      >>> from Bio import pairwise2 
  51      >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") 
  52   
  53  will return a list of the alignments between the two strings. For a nice 
  54  printout, use the ``format_alignment`` method of the module: 
  55   
  56      >>> from Bio.pairwise2 import format_alignment 
  57      >>> print(format_alignment(*alignments[0])) 
  58      ACCGT 
  59      ||||| 
  60      A-CG- 
  61        Score=3 
  62      <BLANKLINE> 
  63   
  64  All alignment functions have the following arguments: 
  65   
  66  - Two sequences: strings, Biopython sequence objects or lists. 
  67    Lists are useful for suppling sequences which contain residues that are 
  68    encoded by more than one letter. 
  69   
  70  - ``penalize_extend_when_opening``: boolean (default: False). 
  71    Whether to count an extension penalty when opening a gap. If false, a gap of 
  72    1 is only penalized an "open" penalty, otherwise it is penalized 
  73    "open+extend". 
  74   
  75  - ``penalize_end_gaps``: boolean. 
  76    Whether to count the gaps at the ends of an alignment. By default, they are 
  77    counted for global alignments but not for local ones. Setting 
  78    ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the 
  79    two sequences separately whether gaps at the end of the alignment should be 
  80    counted. 
  81   
  82  - ``gap_char``: string (default: ``'-'``). 
  83    Which character to use as a gap character in the alignment returned. If your 
  84    input sequences are lists, you must change this to ``['-']``. 
  85   
  86  - ``force_generic``: boolean (default: False). 
  87    Always use the generic, non-cached, dynamic programming function (slow!). 
  88    For debugging. 
  89   
  90  - ``score_only``: boolean (default: False). 
  91    Only get the best score, don't recover any alignments. The return value of 
  92    the function is the score. Faster and uses less memory. 
  93   
  94  - ``one_alignment_only``: boolean (default: False). 
  95    Only recover one alignment. 
  96   
  97  The other parameters of the alignment function depend on the function called. 
  98  Some examples: 
  99   
 100  - Find the best global alignment between the two sequences. Identical 
 101    characters are given 1 point. No points are deducted for mismatches or gaps. 
 102   
 103      >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): 
 104      ...     print(format_alignment(*a)) 
 105      ACCGT 
 106      ||||| 
 107      A-CG- 
 108        Score=3 
 109      <BLANKLINE> 
 110      ACCGT 
 111      ||||| 
 112      AC-G- 
 113        Score=3 
 114      <BLANKLINE> 
 115   
 116  - Same thing as before, but with a local alignment. 
 117   
 118      >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): 
 119      ...     print(format_alignment(*a)) 
 120      ACCGT 
 121      |||| 
 122      A-CG- 
 123        Score=3 
 124      <BLANKLINE> 
 125      ACCGT 
 126      |||| 
 127      AC-G- 
 128        Score=3 
 129      <BLANKLINE> 
 130   
 131  - Do a global alignment. Identical characters are given 2 points, 1 point is 
 132    deducted for each non-identical character. Don't penalize gaps. 
 133   
 134      >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): 
 135      ...     print(format_alignment(*a)) 
 136      ACCGT 
 137      ||||| 
 138      A-CG- 
 139        Score=6 
 140      <BLANKLINE> 
 141      ACCGT 
 142      ||||| 
 143      AC-G- 
 144        Score=6 
 145      <BLANKLINE> 
 146   
 147  - Same as above, except now 0.5 points are deducted when opening a gap, and 
 148    0.1 points are deducted when extending it. 
 149   
 150      >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): 
 151      ...     print(format_alignment(*a)) 
 152      ACCGT 
 153      ||||| 
 154      A-CG- 
 155        Score=5 
 156      <BLANKLINE> 
 157      ACCGT 
 158      ||||| 
 159      AC-G- 
 160        Score=5 
 161      <BLANKLINE> 
 162   
 163  - Depending on the penalties, a gap in one sequence may be followed by a gap in 
 164    the other sequence.If you don't like this behaviour, increase the gap-open 
 165    penalty: 
 166   
 167      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): 
 168      ...     print(format_alignment(*a)) 
 169      A- 
 170      || 
 171      -T 
 172        Score=-2 
 173      <BLANKLINE> 
 174      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): 
 175      ...     print(format_alignment(*a)) 
 176      A 
 177      | 
 178      T 
 179        Score=-4 
 180      <BLANKLINE> 
 181   
 182  - The alignment function can also use known matrices already included in 
 183    Biopython (``MatrixInfo`` from ``Bio.SubsMat``): 
 184   
 185      >>> from Bio.SubsMat import MatrixInfo as matlist 
 186      >>> matrix = matlist.blosum62 
 187      >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): 
 188      ...     print(format_alignment(*a)) 
 189      KEVLA 
 190      ||||| 
 191      -EVL- 
 192        Score=13 
 193      <BLANKLINE> 
 194   
 195  - With the parameter ``c`` you can define your own match- and gap functions. 
 196    E.g. to define an affine logarithmic gap function and using it: 
 197   
 198      >>> from math import log 
 199      >>> def gap_function(x, y):  # x is gap position in seq, y is gap length 
 200      ...     if y == 0:  # No gap 
 201      ...         return 0 
 202      ...     elif y == 1:  # Gap open penalty 
 203      ...         return -2 
 204      ...     return - (2 + y/4.0 + log(y)/2.0) 
 205      ... 
 206      >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, 
 207      ...                                      gap_function, gap_function) 
 208   
 209    You can define different gap functions for each sequence. 
 210    Self-defined match functions must take the two residues to be compared and 
 211    return a score. 
 212   
 213  To see a description of the parameters for a function, please look at 
 214  the docstring for the function via the help function, e.g. 
 215  type ``help(pairwise2.align.localds``) at the Python prompt. 
 216   
 217  """ 
 218  from __future__ import print_function 
 219   
 220  import warnings 
 221   
 222  from Bio import BiopythonWarning 
 223   
 224   
 225  MAX_ALIGNMENTS = 1000   # maximum alignments recovered in traceback 
 226   
 227   
228 -class align(object):
229 """Provide functions that do alignments. 230 231 Alignment functions are called as: 232 233 pairwise2.align.globalXX 234 235 or 236 237 pairwise2.align.localXX 238 239 Where XX is a 2 character code indicating the match/mismatch parameters 240 (first character, either x, m, d or c) and the gap penalty parameters 241 (second character, either x, s, d, or c). 242 243 For a detailed description read the main module's docstring (e.g., 244 type ``help(pairwise2)``). 245 To see a description of the parameters for a function, please 246 look at the docstring for the function, e.g. type 247 ``help(pairwise2.align.localds``) at the Python prompt. 248 """ 249
250 - class alignment_function(object):
251 """Callable class which impersonates an alignment function. 252 253 The constructor takes the name of the function. This class 254 will decode the name of the function to figure out how to 255 interpret the parameters. 256 """ 257 258 # match code -> tuple of (parameters, docstring) 259 match2args = { 260 'x': ([], ''), 261 'm': (['match', 'mismatch'], 262 "match is the score to given to identical characters.\n" 263 "mismatch is the score given to non-identical ones."), 264 'd': (['match_dict'], 265 "match_dict is a dictionary where the keys are tuples\n" 266 "of pairs of characters and the values are the scores,\n" 267 "e.g. ('A', 'C') : 2.5."), 268 'c': (['match_fn'], 269 "match_fn is a callback function that takes two " 270 "characters and returns the score between them."), 271 } 272 # penalty code -> tuple of (parameters, docstring) 273 penalty2args = { 274 'x': ([], ''), 275 's': (['open', 'extend'], 276 "open and extend are the gap penalties when a gap is\n" 277 "opened and extended. They should be negative."), 278 'd': (['openA', 'extendA', 'openB', 'extendB'], 279 "openA and extendA are the gap penalties for sequenceA,\n" 280 "and openB and extendB for sequenceB. The penalties\n" 281 "should be negative."), 282 'c': (['gap_A_fn', 'gap_B_fn'], 283 "gap_A_fn and gap_B_fn are callback functions that takes\n" 284 "(1) the index where the gap is opened, and (2) the length\n" 285 "of the gap. They should return a gap penalty."), 286 } 287
288 - def __init__(self, name):
289 """Check to make sure the name of the function is reasonable.""" 290 if name.startswith("global"): 291 if len(name) != 8: 292 raise AttributeError("function should be globalXX") 293 elif name.startswith("local"): 294 if len(name) != 7: 295 raise AttributeError("function should be localXX") 296 else: 297 raise AttributeError(name) 298 align_type, match_type, penalty_type = \ 299 name[:-2], name[-2], name[-1] 300 try: 301 match_args, match_doc = self.match2args[match_type] 302 except KeyError: 303 raise AttributeError("unknown match type %r" % match_type) 304 try: 305 penalty_args, penalty_doc = self.penalty2args[penalty_type] 306 except KeyError: 307 raise AttributeError("unknown penalty type %r" % penalty_type) 308 309 # Now get the names of the parameters to this function. 310 param_names = ['sequenceA', 'sequenceB'] 311 param_names.extend(match_args) 312 param_names.extend(penalty_args) 313 self.function_name = name 314 self.align_type = align_type 315 self.param_names = param_names 316 317 self.__name__ = self.function_name 318 # Set the doc string. 319 doc = "%s(%s) -> alignments\n" % ( 320 self.__name__, ', '.join(self.param_names)) 321 if match_doc: 322 doc += "\n%s\n" % match_doc 323 if penalty_doc: 324 doc += "\n%s\n" % penalty_doc 325 doc += ("""\ 326 \nalignments is a list of tuples (seqA, seqB, score, begin, end). 327 seqA and seqB are strings showing the alignment between the 328 sequences. score is the score of the alignment. begin and end 329 are indexes into seqA and seqB that indicate the where the 330 alignment occurs. 331 """) 332 self.__doc__ = doc
333
334 - def decode(self, *args, **keywds):
335 """Decode the arguments for the _align function. 336 337 keywds will get passed to it, so translate the arguments 338 to this function into forms appropriate for _align. 339 """ 340 keywds = keywds.copy() 341 if len(args) != len(self.param_names): 342 raise TypeError("%s takes exactly %d argument (%d given)" 343 % (self.function_name, len(self.param_names), 344 len(args))) 345 i = 0 346 while i < len(self.param_names): 347 if self.param_names[i] in [ 348 'sequenceA', 'sequenceB', 349 'gap_A_fn', 'gap_B_fn', 'match_fn']: 350 keywds[self.param_names[i]] = args[i] 351 i += 1 352 elif self.param_names[i] == 'match': 353 assert self.param_names[i + 1] == 'mismatch' 354 match, mismatch = args[i], args[i + 1] 355 keywds['match_fn'] = identity_match(match, mismatch) 356 i += 2 357 elif self.param_names[i] == 'match_dict': 358 keywds['match_fn'] = dictionary_match(args[i]) 359 i += 1 360 elif self.param_names[i] == 'open': 361 assert self.param_names[i + 1] == 'extend' 362 open, extend = args[i], args[i + 1] 363 pe = keywds.get('penalize_extend_when_opening', 0) 364 keywds['gap_A_fn'] = affine_penalty(open, extend, pe) 365 keywds['gap_B_fn'] = affine_penalty(open, extend, pe) 366 i += 2 367 elif self.param_names[i] == 'openA': 368 assert self.param_names[i + 3] == 'extendB' 369 openA, extendA, openB, extendB = args[i:i + 4] 370 pe = keywds.get('penalize_extend_when_opening', 0) 371 keywds['gap_A_fn'] = affine_penalty(openA, extendA, pe) 372 keywds['gap_B_fn'] = affine_penalty(openB, extendB, pe) 373 i += 4 374 else: 375 raise ValueError("unknown parameter %r" 376 % self.param_names[i]) 377 378 # Here are the default parameters for _align. Assign 379 # these to keywds, unless already specified. 380 pe = keywds.get('penalize_extend_when_opening', 0) 381 default_params = [ 382 ('match_fn', identity_match(1, 0)), 383 ('gap_A_fn', affine_penalty(0, 0, pe)), 384 ('gap_B_fn', affine_penalty(0, 0, pe)), 385 ('penalize_extend_when_opening', 0), 386 ('penalize_end_gaps', self.align_type == 'global'), 387 ('align_globally', self.align_type == 'global'), 388 ('gap_char', '-'), 389 ('force_generic', 0), 390 ('score_only', 0), 391 ('one_alignment_only', 0), 392 ] 393 for name, default in default_params: 394 keywds[name] = keywds.get(name, default) 395 value = keywds['penalize_end_gaps'] 396 try: 397 n = len(value) 398 except TypeError: 399 keywds['penalize_end_gaps'] = tuple([value] * 2) 400 else: 401 assert n == 2 402 return keywds
403
404 - def __call__(self, *args, **keywds):
405 """Call the alignment instance already created.""" 406 keywds = self.decode(*args, **keywds) 407 return _align(**keywds)
408
409 - def __getattr__(self, attr):
410 """Call alignment_function() to check and decode the attributes.""" 411 # The following 'magic' is needed to rewrite the class docstring 412 # dynamically: 413 wrapper = self.alignment_function(attr) 414 wrapper_type = type(wrapper) 415 wrapper_dict = wrapper_type.__dict__.copy() 416 wrapper_dict['__doc__'] = wrapper.__doc__ 417 new_alignment_function = type('alignment_function', (object,), 418 wrapper_dict) 419 420 return new_alignment_function(attr)
421 422 423 align = align() 424 425
426 -def _align(sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 427 penalize_extend_when_opening, penalize_end_gaps, 428 align_globally, gap_char, force_generic, score_only, 429 one_alignment_only):
430 """Return a list of alignments between two sequences or its score (PRIVATE).""" 431 if not sequenceA or not sequenceB: 432 return [] 433 try: 434 sequenceA + gap_char 435 sequenceB + gap_char 436 except TypeError: 437 raise TypeError('both sequences must be of the same type, either ' 438 'string/sequence object or list. Gap character must ' 439 'fit the sequence type (string or list)') 440 441 if not isinstance(sequenceA, list): 442 sequenceA = str(sequenceA) 443 if not isinstance(sequenceB, list): 444 sequenceB = str(sequenceB) 445 if not align_globally and (penalize_end_gaps[0] or penalize_end_gaps[1]): 446 warnings.warn('"penalize_end_gaps" should not be used in local ' 447 'alignments. The resulting score may be wrong.', 448 BiopythonWarning) 449 450 if (not force_generic) and isinstance(gap_A_fn, affine_penalty) \ 451 and isinstance(gap_B_fn, affine_penalty): 452 open_A, extend_A = gap_A_fn.open, gap_A_fn.extend 453 open_B, extend_B = gap_B_fn.open, gap_B_fn.extend 454 x = _make_score_matrix_fast( 455 sequenceA, sequenceB, match_fn, open_A, extend_A, open_B, 456 extend_B, penalize_extend_when_opening, penalize_end_gaps, 457 align_globally, score_only) 458 else: 459 x = _make_score_matrix_generic( 460 sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 461 penalize_end_gaps, align_globally, score_only) 462 score_matrix, trace_matrix = x 463 464 # print("SCORE %s" % print_matrix(score_matrix)) 465 # print("TRACEBACK %s" % print_matrix(trace_matrix)) 466 467 # Look for the proper starting point. Get a list of all possible 468 # starting points. 469 starts = _find_start(score_matrix, align_globally) 470 # Find the highest score. 471 best_score = max([x[0] for x in starts]) 472 473 # If they only want the score, then return it. 474 if score_only: 475 return best_score 476 477 tolerance = 0 # XXX do anything with this? 478 # Now find all the positions within some tolerance of the best 479 # score. 480 starts = [(score, pos) for score, pos in starts 481 if rint(abs(score - best_score)) <= rint(tolerance)] 482 483 # Recover the alignments and return them. 484 return _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 485 trace_matrix, align_globally, gap_char, 486 one_alignment_only, gap_A_fn, gap_B_fn)
487 488
489 -def _make_score_matrix_generic(sequenceA, sequenceB, match_fn, gap_A_fn, 490 gap_B_fn, penalize_end_gaps, align_globally, 491 score_only):
492 """Generate a score and traceback matrix according to Needleman-Wunsch (PRIVATE). 493 494 This implementation allows the usage of general gap functions and is rather 495 slow. It is automatically called if you define your own gap functions. You 496 can force the usage of this method with ``force_generic=True``. 497 """ 498 # Create the score and traceback matrices. These should be in the 499 # shape: 500 # sequenceA (down) x sequenceB (across) 501 lenA, lenB = len(sequenceA), len(sequenceB) 502 score_matrix, trace_matrix = [], [] 503 for i in range(lenA + 1): 504 score_matrix.append([None] * (lenB + 1)) 505 if not score_only: 506 trace_matrix.append([None] * (lenB + 1)) 507 508 # Initialize first row and column with gap scores. This is like opening up 509 # i gaps at the beginning of sequence A or B. 510 for i in range(lenA + 1): 511 if penalize_end_gaps[1]: # [1]:gap in sequence B 512 score = gap_B_fn(0, i) 513 else: 514 score = 0 515 score_matrix[i][0] = score 516 for i in range(lenB + 1): 517 if penalize_end_gaps[0]: # [0]:gap in sequence A 518 score = gap_A_fn(0, i) 519 else: 520 score = 0 521 score_matrix[0][i] = score 522 523 # Fill in the score matrix. Each position in the matrix 524 # represents an alignment between a character from sequence A to 525 # one in sequence B. As I iterate through the matrix, find the 526 # alignment by choose the best of: 527 # 1) extending a previous alignment without gaps 528 # 2) adding a gap in sequenceA 529 # 3) adding a gap in sequenceB 530 for row in range(1, lenA + 1): 531 for col in range(1, lenB + 1): 532 # First, calculate the score that would occur by extending 533 # the alignment without gaps. 534 nogap_score = score_matrix[row - 1][col - 1] + \ 535 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 536 537 # Try to find a better score by opening gaps in sequenceA. 538 # Do this by checking alignments from each column in the row. 539 # Each column represents a different character to align from, 540 # and thus a different length gap. 541 # Although the gap function does not distinguish between opening 542 # and extending a gap, we distinguish them for the backtrace. 543 if not penalize_end_gaps[0] and row == lenA: 544 row_open = score_matrix[row][col - 1] 545 row_extend = max([score_matrix[row][x] for x in range(col)]) 546 else: 547 row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) 548 row_extend = max([score_matrix[row][x] + gap_A_fn(row, col - x) 549 for x in range(col)]) 550 551 # Try to find a better score by opening gaps in sequenceB. 552 if not penalize_end_gaps[1] and col == lenB: 553 col_open = score_matrix[row - 1][col] 554 col_extend = max([score_matrix[x][col] for x in range(row)]) 555 else: 556 col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) 557 col_extend = max([score_matrix[x][col] + gap_B_fn(col, row - x) 558 for x in range(row)]) 559 560 best_score = max(nogap_score, row_open, row_extend, col_open, 561 col_extend) 562 if not align_globally and best_score < 0: 563 score_matrix[row][col] = 0 564 else: 565 score_matrix[row][col] = best_score 566 567 # The backtrace is encoded binary. See _make_score_matrix_fast 568 # for details. 569 if not score_only: 570 trace_score = 0 571 if rint(nogap_score) == rint(best_score): 572 trace_score += 2 573 if rint(row_open) == rint(best_score): 574 trace_score += 1 575 if rint(row_extend) == rint(best_score): 576 trace_score += 8 577 if rint(col_open) == rint(best_score): 578 trace_score += 4 579 if rint(col_extend) == rint(best_score): 580 trace_score += 16 581 trace_matrix[row][col] = trace_score 582 583 return score_matrix, trace_matrix
584 585
586 -def _make_score_matrix_fast(sequenceA, sequenceB, match_fn, open_A, extend_A, 587 open_B, extend_B, penalize_extend_when_opening, 588 penalize_end_gaps, align_globally, score_only):
589 """Generate a score and traceback matrix according to Gotoh (PRIVATE). 590 591 This is an implementation of the Needleman-Wunsch dynamic programming 592 algorithm as modified by Gotoh, implementing affine gap penalties. 593 In short, we have three matrices, holding scores for alignments ending 594 in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in 595 sequence B, respectively. However, we can combine them in one matrix, 596 which holds the best scores, and store only those values from the 597 other matrices that are actually used for the next step of calculation. 598 The traceback matrix holds the positions for backtracing the alignment. 599 """ 600 first_A_gap = calc_affine_penalty(1, open_A, extend_A, 601 penalize_extend_when_opening) 602 first_B_gap = calc_affine_penalty(1, open_B, extend_B, 603 penalize_extend_when_opening) 604 605 # Create the score and traceback matrices. These should be in the 606 # shape: 607 # sequenceA (down) x sequenceB (across) 608 lenA, lenB = len(sequenceA), len(sequenceB) 609 score_matrix, trace_matrix = [], [] 610 for i in range(lenA + 1): 611 score_matrix.append([None] * (lenB + 1)) 612 if not score_only: 613 trace_matrix.append([None] * (lenB + 1)) 614 615 # Initialize first row and column with gap scores. This is like opening up 616 # i gaps at the beginning of sequence A or B. 617 for i in range(lenA + 1): 618 if penalize_end_gaps[1]: # [1]:gap in sequence B 619 score = calc_affine_penalty(i, open_B, extend_B, 620 penalize_extend_when_opening) 621 else: 622 score = 0 623 score_matrix[i][0] = score 624 for i in range(lenB + 1): 625 if penalize_end_gaps[0]: # [0]:gap in sequence A 626 score = calc_affine_penalty(i, open_A, extend_A, 627 penalize_extend_when_opening) 628 else: 629 score = 0 630 score_matrix[0][i] = score 631 632 # Now initialize the col 'matrix'. Actually this is only a one dimensional 633 # list, since we only need the col scores from the last row. 634 col_score = [0] # Best score, if actual alignment ends with gap in seqB 635 for i in range(1, lenB + 1): 636 col_score.append(calc_affine_penalty(i, 2 * open_B, extend_B, 637 penalize_extend_when_opening)) 638 639 # The row 'matrix' is calculated on the fly. Here we only need the actual 640 # score. 641 # Now, filling up the score and traceback matrices: 642 for row in range(1, lenA + 1): 643 row_score = calc_affine_penalty(row, 2 * open_A, extend_A, 644 penalize_extend_when_opening) 645 for col in range(1, lenB + 1): 646 # Calculate the score that would occur by extending the 647 # alignment without gaps. 648 nogap_score = score_matrix[row - 1][col - 1] + \ 649 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 650 651 # Check the score that would occur if there were a gap in 652 # sequence A. This could come from opening a new gap or 653 # extending an existing one. 654 # A gap in sequence A can also be opened if it follows a gap in 655 # sequence B: A- 656 # -B 657 if not penalize_end_gaps[0] and row == lenA: 658 row_open = score_matrix[row][col - 1] 659 row_extend = row_score 660 else: 661 row_open = score_matrix[row][col - 1] + first_A_gap 662 row_extend = row_score + extend_A 663 row_score = max(row_open, row_extend) 664 665 # The same for sequence B: 666 if not penalize_end_gaps[1] and col == lenB: 667 col_open = score_matrix[row - 1][col] 668 col_extend = col_score[col] 669 else: 670 col_open = score_matrix[row - 1][col] + first_B_gap 671 col_extend = col_score[col] + extend_B 672 col_score[col] = max(col_open, col_extend) 673 674 best_score = max(nogap_score, col_score[col], row_score) 675 if not align_globally and best_score < 0: 676 score_matrix[row][col] = 0 677 else: 678 score_matrix[row][col] = best_score 679 680 # Now the trace_matrix. The edges of the backtrace are encoded 681 # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and 682 # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and 683 # 16 = extend gap in seqA. This values can be summed up. 684 # Thus, the trace score 7 means that the best score can either 685 # come from opening a gap in seqA (=1), pairing two characters 686 # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). 687 # However, if we only want the score we don't care about the trace. 688 if not score_only: 689 row_score_rint = rint(row_score) 690 col_score_rint = rint(col_score[col]) 691 row_trace_score = 0 692 col_trace_score = 0 693 if rint(row_open) == row_score_rint: 694 row_trace_score += 1 # Open gap in seqA 695 if rint(row_extend) == row_score_rint: 696 row_trace_score += 8 # Extend gap in seqA 697 if rint(col_open) == col_score_rint: 698 col_trace_score += 4 # Open gap in seqB 699 if rint(col_extend) == col_score_rint: 700 col_trace_score += 16 # Extend gap in seqB 701 702 trace_score = 0 703 best_score_rint = rint(best_score) 704 if rint(nogap_score) == best_score_rint: 705 trace_score += 2 # Align seqA with seqB 706 if row_score_rint == best_score_rint: 707 trace_score += row_trace_score 708 if col_score_rint == best_score_rint: 709 trace_score += col_trace_score 710 trace_matrix[row][col] = trace_score 711 712 return score_matrix, trace_matrix
713 714
715 -def _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 716 trace_matrix, align_globally, gap_char, 717 one_alignment_only, gap_A_fn, gap_B_fn):
718 """Do the backtracing and return a list of alignments (PRIVATE). 719 720 Recover the alignments by following the traceback matrix. This 721 is a recursive procedure, but it's implemented here iteratively 722 with a stack. 723 724 sequenceA and sequenceB may be sequences, including strings, 725 lists, or list-like objects. In order to preserve the type of 726 the object, we need to use slices on the sequences instead of 727 indexes. For example, sequenceA[row] may return a type that's 728 not compatible with sequenceA, e.g. if sequenceA is a list and 729 sequenceA[row] is a string. Thus, avoid using indexes and use 730 slices, e.g. sequenceA[row:row+1]. Assume that client-defined 731 sequence classes preserve these semantics. 732 """ 733 lenA, lenB = len(sequenceA), len(sequenceB) 734 ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] 735 tracebacks = [] 736 in_process = [] 737 738 for start in starts: 739 score, (row, col) = start 740 begin = 0 741 if align_globally: 742 end = None 743 else: 744 # Local alignments should start with a positive score! 745 if score <= 0: 746 continue 747 # Local alignments should not end with a gap!: 748 trace = trace_matrix[row][col] 749 if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! 750 trace_matrix[row][col] = 2 751 # If not, don't start here! 752 else: 753 continue 754 end = -max(lenA - row, lenB - col) 755 if not end: 756 end = None 757 col_distance = lenB - col 758 row_distance = lenA - row 759 ali_seqA = ((col_distance - row_distance) * gap_char + 760 sequenceA[lenA - 1:row - 1:-1]) 761 ali_seqB = ((row_distance - col_distance) * gap_char + 762 sequenceB[lenB - 1:col - 1:-1]) 763 in_process += [(ali_seqA, ali_seqB, end, row, col, False, 764 trace_matrix[row][col])] 765 while in_process and len(tracebacks) < MAX_ALIGNMENTS: 766 # Although we allow a gap in seqB to be followed by a gap in seqA, 767 # we don't want to allow it the other way round, since this would 768 # give redundant alignments of type: A- vs. -A 769 # -B B- 770 # Thus we need to keep track if a gap in seqA was opened (col_gap) 771 # and stop the backtrace (dead_end) if a gap in seqB follows. 772 dead_end = False 773 ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() 774 775 while (row > 0 or col > 0) and not dead_end: 776 cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) 777 778 # If trace is empty we have reached at least one border of the 779 # matrix or the end of a local aligment. Just add the rest of 780 # the sequence(s) and fill with gaps if neccessary. 781 if not trace: 782 if col and col_gap: 783 dead_end = True 784 else: 785 ali_seqA, ali_seqB = _finish_backtrace( 786 sequenceA, sequenceB, ali_seqA, ali_seqB, 787 row, col, gap_char) 788 break 789 elif trace % 2 == 1: # = row open = open gap in seqA 790 trace -= 1 791 if col_gap: 792 dead_end = True 793 else: 794 col -= 1 795 ali_seqA += gap_char 796 ali_seqB += sequenceB[col:col + 1] 797 col_gap = False 798 elif trace % 4 == 2: # = match/mismatch of seqA with seqB 799 trace -= 2 800 row -= 1 801 col -= 1 802 ali_seqA += sequenceA[row:row + 1] 803 ali_seqB += sequenceB[col:col + 1] 804 col_gap = False 805 elif trace % 8 == 4: # = col open = open gap in seqB 806 trace -= 4 807 row -= 1 808 ali_seqA += sequenceA[row:row + 1] 809 ali_seqB += gap_char 810 col_gap = True 811 elif trace in (8, 24): # = row extend = extend gap in seqA 812 trace -= 8 813 if col_gap: 814 dead_end = True 815 else: 816 col_gap = False 817 # We need to find the starting point of the extended gap 818 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, 819 ali_seqB, end, row, col, col_gap, 820 gap_char, score_matrix, trace_matrix, 821 in_process, gap_A_fn, col, row, 'col') 822 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 823 elif trace == 16: # = col extend = extend gap in seqB 824 trace -= 16 825 col_gap = True 826 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, 827 end, row, col, col_gap, gap_char, 828 score_matrix, trace_matrix, in_process, 829 gap_B_fn, row, col, 'row') 830 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 831 832 if trace: # There is another path to follow... 833 cache += (trace,) 834 in_process.append(cache) 835 trace = trace_matrix[row][col] 836 if not align_globally and score_matrix[row][col] <= 0: 837 begin = max(row, col) 838 trace = 0 839 if not dead_end: 840 tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, 841 end)) 842 if one_alignment_only: 843 break 844 return _clean_alignments(tracebacks)
845 846
847 -def _find_start(score_matrix, align_globally):
848 """Return a list of starting points (score, (row, col)) (PRIVATE). 849 850 Indicating every possible place to start the tracebacks. 851 """ 852 nrows, ncols = len(score_matrix), len(score_matrix[0]) 853 # In this implementation of the global algorithm, the start will always be 854 # the bottom right corner of the matrix. 855 if align_globally: 856 starts = [(score_matrix[-1][-1], (nrows - 1, ncols - 1))] 857 else: 858 starts = [] 859 for row in range(nrows): 860 for col in range(ncols): 861 score = score_matrix[row][col] 862 starts.append((score, (row, col))) 863 return starts
864 865
866 -def _clean_alignments(alignments):
867 """Take a list of alignments and return a cleaned version (PRIVATE). 868 869 Remove duplicates, make sure begin and end are set correctly, remove 870 empty alignments. 871 """ 872 unique_alignments = [] 873 for align in alignments: 874 if align not in unique_alignments: 875 unique_alignments.append(align) 876 i = 0 877 while i < len(unique_alignments): 878 seqA, seqB, score, begin, end = unique_alignments[i] 879 # Make sure end is set reasonably. 880 if end is None: # global alignment 881 end = len(seqA) 882 elif end < 0: 883 end = end + len(seqA) 884 # If there's no alignment here, get rid of it. 885 if begin >= end: 886 del unique_alignments[i] 887 continue 888 unique_alignments[i] = seqA, seqB, score, begin, end 889 i += 1 890 return unique_alignments
891 892
893 -def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, 894 gap_char):
895 """Add remaining sequences and fill with gaps if neccessary (PRIVATE).""" 896 if row: 897 ali_seqA += sequenceA[row - 1::-1] 898 if col: 899 ali_seqB += sequenceB[col - 1::-1] 900 if row > col: 901 ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) 902 elif col > row: 903 ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) 904 return ali_seqA, ali_seqB
905 906
907 -def _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, end, row, col, 908 col_gap, gap_char, score_matrix, trace_matrix, in_process, 909 gap_fn, target, index, direction):
910 """Find the starting point(s) of the extended gap (PRIVATE).""" 911 dead_end = False 912 target_score = score_matrix[row][col] 913 for n in range(target): 914 if direction == 'col': 915 col -= 1 916 ali_seqA += gap_char 917 ali_seqB += sequenceB[col:col + 1] 918 else: 919 row -= 1 920 ali_seqA += sequenceA[row:row + 1] 921 ali_seqB += gap_char 922 actual_score = score_matrix[row][col] + gap_fn(index, n + 1) 923 if rint(actual_score) == rint(target_score) and n > 0: 924 if not trace_matrix[row][col]: 925 break 926 else: 927 in_process.append((ali_seqA[:], ali_seqB[:], end, row, col, 928 col_gap, trace_matrix[row][col])) 929 if not trace_matrix[row][col]: 930 dead_end = True 931 return ali_seqA, ali_seqB, row, col, in_process, dead_end
932 933 934 _PRECISION = 1000 935 936
937 -def rint(x, precision=_PRECISION):
938 """Print number with declared precision.""" 939 return int(x * precision + 0.5)
940 941
942 -class identity_match(object):
943 """Create a match function for use in an alignment. 944 945 match and mismatch are the scores to give when two residues are equal 946 or unequal. By default, match is 1 and mismatch is 0. 947 """ 948
949 - def __init__(self, match=1, mismatch=0):
950 """Initialize the class.""" 951 self.match = match 952 self.mismatch = mismatch
953
954 - def __call__(self, charA, charB):
955 """Call a match function instance already created.""" 956 if charA == charB: 957 return self.match 958 return self.mismatch
959 960
961 -class dictionary_match(object):
962 """Create a match function for use in an alignment. 963 964 Attributes: 965 - score_dict - A dictionary where the keys are tuples (residue 1, 966 residue 2) and the values are the match scores between those residues. 967 - symmetric - A flag that indicates whether the scores are symmetric. 968 969 """ 970
971 - def __init__(self, score_dict, symmetric=1):
972 """Initialize the class.""" 973 self.score_dict = score_dict 974 self.symmetric = symmetric
975
976 - def __call__(self, charA, charB):
977 """Call a dictionary match instance already created.""" 978 if self.symmetric and (charA, charB) not in self.score_dict: 979 # If the score dictionary is symmetric, then look up the 980 # score both ways. 981 charB, charA = charA, charB 982 return self.score_dict[(charA, charB)]
983 984
985 -class affine_penalty(object):
986 """Create a gap function for use in an alignment.""" 987
988 - def __init__(self, open, extend, penalize_extend_when_opening=0):
989 """Initialize the class.""" 990 if open > 0 or extend > 0: 991 raise ValueError("Gap penalties should be non-positive.") 992 if not penalize_extend_when_opening and (extend < open): 993 raise ValueError("Gap opening penalty should be higher than " + 994 "gap extension penalty (or equal)") 995 self.open, self.extend = open, extend 996 self.penalize_extend_when_opening = penalize_extend_when_opening
997
998 - def __call__(self, index, length):
999 """Call a gap function instance already created.""" 1000 return calc_affine_penalty( 1001 length, self.open, self.extend, self.penalize_extend_when_opening)
1002 1003
1004 -def calc_affine_penalty(length, open, extend, penalize_extend_when_opening):
1005 """Calculate a penality score for the gap function.""" 1006 if length <= 0: 1007 return 0 1008 penalty = open + extend * length 1009 if not penalize_extend_when_opening: 1010 penalty -= extend 1011 return penalty
1012 1013 1026 1027
1028 -def format_alignment(align1, align2, score, begin, end):
1029 """Format the alignment prettily into a string.""" 1030 s = [] 1031 s.append("%s\n" % align1) 1032 s.append("%s%s\n" % (" " * begin, "|" * (end - begin))) 1033 s.append("%s\n" % align2) 1034 s.append(" Score=%g\n" % score) 1035 return ''.join(s)
1036 1037 1038 # Try and load C implementations of functions. If I can't, 1039 # then throw a warning and use the pure Python implementations. 1040 # The redefinition is deliberate, thus the no quality assurance 1041 # flag for when using flake8: 1042 try: 1043 from .cpairwise2 import rint, _make_score_matrix_fast # noqa 1044 except ImportError: 1045 warnings.warn('Import of C module failed. Falling back to pure Python ' + 1046 'implementation. This may be slooow...', BiopythonWarning) 1047