Package Bio :: Module pairwise2
[hide private]
[frames] | no frames]

Source Code for Module Bio.pairwise2

  1  # Copyright 2002 by Jeffrey Chang. 
  2  # Copyright 2016 by Markus Piotrowski. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """This package implements pairwise sequence alignment using a dynamic 
  9  programming algorithm. 
 10   
 11  This provides functions to get global and local alignments between two 
 12  sequences. A global alignment finds the best concordance between all 
 13  characters in two sequences. A local alignment finds just the 
 14  subsequences that align the best. 
 15   
 16  When doing alignments, you can specify the match score and gap 
 17  penalties.  The match score indicates the compatibility between an 
 18  alignment of two characters in the sequences. Highly compatible 
 19  characters should be given positive scores, and incompatible ones 
 20  should be given negative scores or 0.  The gap penalties should be 
 21  negative. 
 22   
 23  The names of the alignment functions in this module follow the 
 24  convention 
 25  <alignment type>XX 
 26  where <alignment type> is either "global" or "local" and XX is a 2 
 27  character code indicating the parameters it takes.  The first 
 28  character indicates the parameters for matches (and mismatches), and 
 29  the second indicates the parameters for gap penalties. 
 30   
 31  The match parameters are:: 
 32   
 33      CODE  DESCRIPTION 
 34      x     No parameters. Identical characters have score of 1, otherwise 0. 
 35      m     A match score is the score of identical chars, otherwise mismatch 
 36            score. 
 37      d     A dictionary returns the score of any pair of characters. 
 38      c     A callback function returns scores. 
 39   
 40  The gap penalty parameters are:: 
 41   
 42      CODE  DESCRIPTION 
 43      x     No gap penalties. 
 44      s     Same open and extend gap penalties for both sequences. 
 45      d     The sequences have different open and extend gap penalties. 
 46      c     A callback function returns the gap penalties. 
 47   
 48  All the different alignment functions are contained in an object 
 49  ``align``. For example: 
 50   
 51      >>> from Bio import pairwise2 
 52      >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") 
 53   
 54  will return a list of the alignments between the two strings. For a nice 
 55  printout, use the ``format_alignment`` method of the module: 
 56   
 57      >>> from Bio.pairwise2 import format_alignment 
 58      >>> print(format_alignment(*alignments[0])) 
 59      ACCGT 
 60      ||||| 
 61      A-CG- 
 62        Score=3 
 63      <BLANKLINE> 
 64   
 65  All alignment functions have the following arguments: 
 66   
 67  - Two sequences: strings, Biopython sequence objects or lists. 
 68    Lists are useful for suppling sequences which contain residues that are 
 69    encoded by more than one letter. 
 70   
 71  - ``penalize_extend_when_opening``: boolean (default: False). 
 72    Whether to count an extension penalty when opening a gap. If false, a gap of 
 73    1 is only penalized an "open" penalty, otherwise it is penalized 
 74    "open+extend". 
 75   
 76  - ``penalize_end_gaps``: boolean. 
 77    Whether to count the gaps at the ends of an alignment. By default, they are 
 78    counted for global alignments but not for local ones. Setting 
 79    ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the 
 80    two sequences separately whether gaps at the end of the alignment should be 
 81    counted. 
 82   
 83  - ``gap_char``: string (default: ``'-'``). 
 84    Which character to use as a gap character in the alignment returned. If your 
 85    input sequences are lists, you must change this to ``['-']``. 
 86   
 87  - ``force_generic``: boolean (default: False). 
 88    Always use the generic, non-cached, dynamic programming function (slow!). 
 89    For debugging. 
 90   
 91  - ``score_only``: boolean (default: False). 
 92    Only get the best score, don't recover any alignments. The return value of 
 93    the function is the score. Faster and uses less memory. 
 94   
 95  - ``one_alignment_only``: boolean (default: False). 
 96    Only recover one alignment. 
 97   
 98  The other parameters of the alignment function depend on the function called. 
 99  Some examples: 
100   
101  - Find the best global alignment between the two sequences. Identical 
102    characters are given 1 point. No points are deducted for mismatches or gaps. 
103   
104      >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): 
105      ...     print(format_alignment(*a)) 
106      ACCGT 
107      ||||| 
108      A-CG- 
109        Score=3 
110      <BLANKLINE> 
111      ACCGT 
112      ||||| 
113      AC-G- 
114        Score=3 
115      <BLANKLINE> 
116   
117  - Same thing as before, but with a local alignment. 
118   
119      >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): 
120      ...     print(format_alignment(*a)) 
121      ACCGT 
122      |||| 
123      A-CG- 
124        Score=3 
125      <BLANKLINE> 
126      ACCGT 
127      |||| 
128      AC-G- 
129        Score=3 
130      <BLANKLINE> 
131   
132  - Do a global alignment. Identical characters are given 2 points, 1 point is 
133    deducted for each non-identical character. Don't penalize gaps. 
134   
135      >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): 
136      ...     print(format_alignment(*a)) 
137      ACCGT 
138      ||||| 
139      A-CG- 
140        Score=6 
141      <BLANKLINE> 
142      ACCGT 
143      ||||| 
144      AC-G- 
145        Score=6 
146      <BLANKLINE> 
147   
148  - Same as above, except now 0.5 points are deducted when opening a gap, and 
149    0.1 points are deducted when extending it. 
150   
151      >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): 
152      ...     print(format_alignment(*a)) 
153      ACCGT 
154      ||||| 
155      A-CG- 
156        Score=5 
157      <BLANKLINE> 
158      ACCGT 
159      ||||| 
160      AC-G- 
161        Score=5 
162      <BLANKLINE> 
163   
164  - Depending on the penalties, a gap in one sequence may be followed by a gap in 
165    the other sequence.If you don't like this behaviour, increase the gap-open 
166    penalty: 
167   
168      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): 
169      ...     print(format_alignment(*a)) 
170      A- 
171      || 
172      -T 
173        Score=-2 
174      <BLANKLINE> 
175      >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): 
176      ...     print(format_alignment(*a)) 
177      A 
178      | 
179      T 
180        Score=-4 
181      <BLANKLINE> 
182   
183  - The alignment function can also use known matrices already included in 
184    Biopython (``MatrixInfo`` from ``Bio.SubsMat``): 
185   
186      >>> from Bio.SubsMat import MatrixInfo as matlist 
187      >>> matrix = matlist.blosum62 
188      >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): 
189      ...     print(format_alignment(*a)) 
190      KEVLA 
191      ||||| 
192      -EVL- 
193        Score=13 
194      <BLANKLINE> 
195   
196  - With the parameter ``c`` you can define your own match- and gap functions. 
197    E.g. to define an affine logarithmic gap function and using it: 
198   
199      >>> from math import log 
200      >>> def gap_function(x, y):  # x is gap position in seq, y is gap length 
201      ...     if y == 0:  # No gap 
202      ...         return 0 
203      ...     elif y == 1:  # Gap open penalty 
204      ...         return -2 
205      ...     return - (2 + y/4.0 + log(y)/2.0) 
206      ... 
207      >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, 
208      ...                                      gap_function, gap_function) 
209   
210    You can define different gap functions for each sequence. 
211    Self-defined match functions must take the two residues to be compared and 
212    return a score. 
213   
214  To see a description of the parameters for a function, please look at 
215  the docstring for the function via the help function, e.g. 
216  type ``help(pairwise2.align.localds``) at the Python prompt. 
217   
218  """ 
219  from __future__ import print_function 
220   
221  import warnings 
222   
223  from Bio import BiopythonWarning 
224   
225   
226  MAX_ALIGNMENTS = 1000   # maximum alignments recovered in traceback 
227   
228   
229 -class align(object):
230 """This class provides functions that do alignments.""" 231
232 - class alignment_function(object):
233 """This class is callable impersonates an alignment function. 234 The constructor takes the name of the function. This class 235 will decode the name of the function to figure out how to 236 interpret the parameters. 237 238 """ 239 # match code -> tuple of (parameters, docstring) 240 match2args = { 241 'x': ([], ''), 242 'm': (['match', 'mismatch'], 243 "match is the score to given to identical characters. " 244 "mismatch is the score given to non-identical ones."), 245 'd': (['match_dict'], 246 "match_dict is a dictionary where the keys are tuples " 247 "of pairs of characters and the values are the scores, " 248 "e.g. ('A', 'C') : 2.5."), 249 'c': (['match_fn'], 250 "match_fn is a callback function that takes two " 251 "characters and returns the score between them."), 252 } 253 # penalty code -> tuple of (parameters, docstring) 254 penalty2args = { 255 'x': ([], ''), 256 's': (['open', 'extend'], 257 "open and extend are the gap penalties when a gap is " 258 "opened and extended. They should be negative."), 259 'd': (['openA', 'extendA', 'openB', 'extendB'], 260 "openA and extendA are the gap penalties for sequenceA, " 261 "and openB and extendB for sequeneB. The penalties " 262 "should be negative."), 263 'c': (['gap_A_fn', 'gap_B_fn'], 264 "gap_A_fn and gap_B_fn are callback functions that takes " 265 "(1) the index where the gap is opened, and (2) the length " 266 "of the gap. They should return a gap penalty."), 267 } 268
269 - def __init__(self, name):
270 # Check to make sure the name of the function is 271 # reasonable. 272 if name.startswith("global"): 273 if len(name) != 8: 274 raise AttributeError("function should be globalXX") 275 elif name.startswith("local"): 276 if len(name) != 7: 277 raise AttributeError("function should be localXX") 278 else: 279 raise AttributeError(name) 280 align_type, match_type, penalty_type = \ 281 name[:-2], name[-2], name[-1] 282 try: 283 match_args, match_doc = self.match2args[match_type] 284 except KeyError: 285 raise AttributeError("unknown match type %r" % match_type) 286 try: 287 penalty_args, penalty_doc = self.penalty2args[penalty_type] 288 except KeyError: 289 raise AttributeError("unknown penalty type %r" % penalty_type) 290 291 # Now get the names of the parameters to this function. 292 param_names = ['sequenceA', 'sequenceB'] 293 param_names.extend(match_args) 294 param_names.extend(penalty_args) 295 self.function_name = name 296 self.align_type = align_type 297 self.param_names = param_names 298 299 self.__name__ = self.function_name 300 # Set the doc string. 301 doc = "%s(%s) -> alignments\n" % ( 302 self.__name__, ', '.join(self.param_names)) 303 if match_doc: 304 doc += "\n%s\n" % match_doc 305 if penalty_doc: 306 doc += "\n%s\n" % penalty_doc 307 doc += ("""\ 308 \nalignments is a list of tuples (seqA, seqB, score, begin, end). 309 seqA and seqB are strings showing the alignment between the 310 sequences. score is the score of the alignment. begin and end 311 are indexes into seqA and seqB that indicate the where the 312 alignment occurs. 313 """) 314 self.__doc__ = doc
315
316 - def decode(self, *args, **keywds):
317 # Decode the arguments for the _align function. keywds 318 # will get passed to it, so translate the arguments to 319 # this function into forms appropriate for _align. 320 keywds = keywds.copy() 321 if len(args) != len(self.param_names): 322 raise TypeError("%s takes exactly %d argument (%d given)" 323 % (self.function_name, len(self.param_names), 324 len(args))) 325 i = 0 326 while i < len(self.param_names): 327 if self.param_names[i] in [ 328 'sequenceA', 'sequenceB', 329 'gap_A_fn', 'gap_B_fn', 'match_fn']: 330 keywds[self.param_names[i]] = args[i] 331 i += 1 332 elif self.param_names[i] == 'match': 333 assert self.param_names[i + 1] == 'mismatch' 334 match, mismatch = args[i], args[i + 1] 335 keywds['match_fn'] = identity_match(match, mismatch) 336 i += 2 337 elif self.param_names[i] == 'match_dict': 338 keywds['match_fn'] = dictionary_match(args[i]) 339 i += 1 340 elif self.param_names[i] == 'open': 341 assert self.param_names[i + 1] == 'extend' 342 open, extend = args[i], args[i + 1] 343 pe = keywds.get('penalize_extend_when_opening', 0) 344 keywds['gap_A_fn'] = affine_penalty(open, extend, pe) 345 keywds['gap_B_fn'] = affine_penalty(open, extend, pe) 346 i += 2 347 elif self.param_names[i] == 'openA': 348 assert self.param_names[i + 3] == 'extendB' 349 openA, extendA, openB, extendB = args[i:i + 4] 350 pe = keywds.get('penalize_extend_when_opening', 0) 351 keywds['gap_A_fn'] = affine_penalty(openA, extendA, pe) 352 keywds['gap_B_fn'] = affine_penalty(openB, extendB, pe) 353 i += 4 354 else: 355 raise ValueError("unknown parameter %r" 356 % self.param_names[i]) 357 358 # Here are the default parameters for _align. Assign 359 # these to keywds, unless already specified. 360 pe = keywds.get('penalize_extend_when_opening', 0) 361 default_params = [ 362 ('match_fn', identity_match(1, 0)), 363 ('gap_A_fn', affine_penalty(0, 0, pe)), 364 ('gap_B_fn', affine_penalty(0, 0, pe)), 365 ('penalize_extend_when_opening', 0), 366 ('penalize_end_gaps', self.align_type == 'global'), 367 ('align_globally', self.align_type == 'global'), 368 ('gap_char', '-'), 369 ('force_generic', 0), 370 ('score_only', 0), 371 ('one_alignment_only', 0), 372 ] 373 for name, default in default_params: 374 keywds[name] = keywds.get(name, default) 375 value = keywds['penalize_end_gaps'] 376 try: 377 n = len(value) 378 except TypeError: 379 keywds['penalize_end_gaps'] = tuple([value] * 2) 380 else: 381 assert n == 2 382 return keywds
383
384 - def __call__(self, *args, **keywds):
385 keywds = self.decode(*args, **keywds) 386 return _align(**keywds)
387
388 - def __getattr__(self, attr):
389 return self.alignment_function(attr)
390 391 392 align = align() 393 394
395 -def _align(sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 396 penalize_extend_when_opening, penalize_end_gaps, 397 align_globally, gap_char, force_generic, score_only, 398 one_alignment_only):
399 """Return a list of alignments between two sequences or its score""" 400 if not sequenceA or not sequenceB: 401 return [] 402 try: 403 sequenceA + gap_char 404 sequenceB + gap_char 405 except TypeError: 406 raise TypeError('both sequences must be of the same type, either ' + 407 'string/sequence object or list. Gap character must ' + 408 'fit the sequence type (string or list)') 409 410 if not isinstance(sequenceA, list): 411 sequenceA = str(sequenceA) 412 if not isinstance(sequenceB, list): 413 sequenceB = str(sequenceB) 414 415 if (not force_generic) and isinstance(gap_A_fn, affine_penalty) \ 416 and isinstance(gap_B_fn, affine_penalty): 417 open_A, extend_A = gap_A_fn.open, gap_A_fn.extend 418 open_B, extend_B = gap_B_fn.open, gap_B_fn.extend 419 x = _make_score_matrix_fast( 420 sequenceA, sequenceB, match_fn, open_A, extend_A, open_B, 421 extend_B, penalize_extend_when_opening, penalize_end_gaps, 422 align_globally, score_only) 423 else: 424 x = _make_score_matrix_generic( 425 sequenceA, sequenceB, match_fn, gap_A_fn, gap_B_fn, 426 penalize_end_gaps, align_globally, score_only) 427 score_matrix, trace_matrix = x 428 429 # print("SCORE %s" % print_matrix(score_matrix)) 430 # print("TRACEBACK %s" % print_matrix(trace_matrix)) 431 432 # Look for the proper starting point. Get a list of all possible 433 # starting points. 434 starts = _find_start(score_matrix, align_globally) 435 # Find the highest score. 436 best_score = max([x[0] for x in starts]) 437 438 # If they only want the score, then return it. 439 if score_only: 440 return best_score 441 442 tolerance = 0 # XXX do anything with this? 443 # Now find all the positions within some tolerance of the best 444 # score. 445 starts = [(score, pos) for score, pos in starts 446 if rint(abs(score - best_score)) <= rint(tolerance)] 447 448 # Recover the alignments and return them. 449 return _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 450 trace_matrix, align_globally, gap_char, 451 one_alignment_only, gap_A_fn, gap_B_fn)
452 453
454 -def _make_score_matrix_generic(sequenceA, sequenceB, match_fn, gap_A_fn, 455 gap_B_fn, penalize_end_gaps, align_globally, 456 score_only):
457 """Generate a score and traceback matrix according to Needleman-Wunsch 458 459 This implementation allows the usage of general gap functions and is rather 460 slow. It is automatically called if you define your own gap functions. You 461 can force the usage of this method with ``force_generic=True``. 462 """ 463 # Create the score and traceback matrices. These should be in the 464 # shape: 465 # sequenceA (down) x sequenceB (across) 466 lenA, lenB = len(sequenceA), len(sequenceB) 467 score_matrix, trace_matrix = [], [] 468 for i in range(lenA + 1): 469 score_matrix.append([None] * (lenB + 1)) 470 if not score_only: 471 trace_matrix.append([None] * (lenB + 1)) 472 473 # Initialize first row and column with gap scores. This is like opening up 474 # i gaps at the beginning of sequence A or B. 475 for i in range(lenA + 1): 476 if penalize_end_gaps[1]: # [1]:gap in sequence B 477 score = gap_B_fn(0, i) 478 else: 479 score = 0 480 score_matrix[i][0] = score 481 for i in range(lenB + 1): 482 if penalize_end_gaps[0]: # [0]:gap in sequence A 483 score = gap_A_fn(0, i) 484 else: 485 score = 0 486 score_matrix[0][i] = score 487 488 # Fill in the score matrix. Each position in the matrix 489 # represents an alignment between a character from sequence A to 490 # one in sequence B. As I iterate through the matrix, find the 491 # alignment by choose the best of: 492 # 1) extending a previous alignment without gaps 493 # 2) adding a gap in sequenceA 494 # 3) adding a gap in sequenceB 495 for row in range(1, lenA + 1): 496 for col in range(1, lenB + 1): 497 # First, calculate the score that would occur by extending 498 # the alignment without gaps. 499 nogap_score = score_matrix[row - 1][col - 1] + \ 500 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 501 502 # Try to find a better score by opening gaps in sequenceA. 503 # Do this by checking alignments from each column in the row. 504 # Each column represents a different character to align from, 505 # and thus a different length gap. 506 # Although the gap function does not distinguish between opening 507 # and extending a gap, we distinguish them for the backtrace. 508 if not penalize_end_gaps[0] and row == lenA: 509 row_open = score_matrix[row][col - 1] 510 row_extend = max([score_matrix[row][x] for x in range(col)]) 511 else: 512 row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) 513 row_extend = max([score_matrix[row][x] + gap_A_fn(row, col - x) 514 for x in range(col)]) 515 516 # Try to find a better score by opening gaps in sequenceB. 517 if not penalize_end_gaps[1] and col == lenB: 518 col_open = score_matrix[row - 1][col] 519 col_extend = max([score_matrix[x][col] for x in range(row)]) 520 else: 521 col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) 522 col_extend = max([score_matrix[x][col] + gap_B_fn(col, row - x) 523 for x in range(row)]) 524 525 best_score = max(nogap_score, row_open, row_extend, col_open, 526 col_extend) 527 if not align_globally and best_score < 0: 528 score_matrix[row][col] = 0 529 else: 530 score_matrix[row][col] = best_score 531 532 # The backtrace is encoded binary. See _make_score_matrix_fast 533 # for details. 534 if not score_only: 535 trace_score = 0 536 if rint(nogap_score) == rint(best_score): 537 trace_score += 2 538 if rint(row_open) == rint(best_score): 539 trace_score += 1 540 if rint(row_extend) == rint(best_score): 541 trace_score += 8 542 if rint(col_open) == rint(best_score): 543 trace_score += 4 544 if rint(col_extend) == rint(best_score): 545 trace_score += 16 546 trace_matrix[row][col] = trace_score 547 548 return score_matrix, trace_matrix
549 550
551 -def _make_score_matrix_fast(sequenceA, sequenceB, match_fn, open_A, extend_A, 552 open_B, extend_B, penalize_extend_when_opening, 553 penalize_end_gaps, align_globally, score_only):
554 """Generate a score and traceback matrix according to Gotoh""" 555 # This is an implementation of the Needleman-Wunsch dynamic programming 556 # algorithm as modified by Gotoh, implementing affine gap penalties. 557 # In short, we have three matrices, holding scores for alignments ending 558 # in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in 559 # sequence B, respectively. However, we can combine them in one matrix, 560 # which holds the best scores, and store only those values from the 561 # other matrices that are actually used for the next step of calculation. 562 # The traceback matrix holds the positions for backtracing the alignment. 563 564 first_A_gap = calc_affine_penalty(1, open_A, extend_A, 565 penalize_extend_when_opening) 566 first_B_gap = calc_affine_penalty(1, open_B, extend_B, 567 penalize_extend_when_opening) 568 569 # Create the score and traceback matrices. These should be in the 570 # shape: 571 # sequenceA (down) x sequenceB (across) 572 lenA, lenB = len(sequenceA), len(sequenceB) 573 score_matrix, trace_matrix = [], [] 574 for i in range(lenA + 1): 575 score_matrix.append([None] * (lenB + 1)) 576 if not score_only: 577 trace_matrix.append([None] * (lenB + 1)) 578 579 # Initialize first row and column with gap scores. This is like opening up 580 # i gaps at the beginning of sequence A or B. 581 for i in range(lenA + 1): 582 if penalize_end_gaps[1]: # [1]:gap in sequence B 583 score = calc_affine_penalty(i, open_B, extend_B, 584 penalize_extend_when_opening) 585 else: 586 score = 0 587 score_matrix[i][0] = score 588 for i in range(lenB + 1): 589 if penalize_end_gaps[0]: # [0]:gap in sequence A 590 score = calc_affine_penalty(i, open_A, extend_A, 591 penalize_extend_when_opening) 592 else: 593 score = 0 594 score_matrix[0][i] = score 595 596 # Now initialize the col 'matrix'. Actually this is only a one dimensional 597 # list, since we only need the col scores from the last row. 598 col_score = [0] # Best score, if actual alignment ends with gap in seqB 599 for i in range(1, lenB + 1): 600 col_score.append(calc_affine_penalty(i, 2 * open_B, extend_B, 601 penalize_extend_when_opening)) 602 603 # The row 'matrix' is calculated on the fly. Here we only need the actual 604 # score. 605 # Now, filling up the score and traceback matrices: 606 for row in range(1, lenA + 1): 607 row_score = calc_affine_penalty(row, 2 * open_A, extend_A, 608 penalize_extend_when_opening) 609 for col in range(1, lenB + 1): 610 # Calculate the score that would occur by extending the 611 # alignment without gaps. 612 nogap_score = score_matrix[row - 1][col - 1] + \ 613 match_fn(sequenceA[row - 1], sequenceB[col - 1]) 614 615 # Check the score that would occur if there were a gap in 616 # sequence A. This could come from opening a new gap or 617 # extending an existing one. 618 # A gap in sequence A can also be opened if it follows a gap in 619 # sequence B: A- 620 # -B 621 if not penalize_end_gaps[0] and row == lenA: 622 row_open = score_matrix[row][col - 1] 623 row_extend = row_score 624 else: 625 row_open = score_matrix[row][col - 1] + first_A_gap 626 row_extend = row_score + extend_A 627 row_score = max(row_open, row_extend) 628 629 # The same for sequence B: 630 if not penalize_end_gaps[1] and col == lenB: 631 col_open = score_matrix[row - 1][col] 632 col_extend = col_score[col] 633 else: 634 col_open = score_matrix[row - 1][col] + first_B_gap 635 col_extend = col_score[col] + extend_B 636 col_score[col] = max(col_open, col_extend) 637 638 best_score = max(nogap_score, col_score[col], row_score) 639 if not align_globally and best_score < 0: 640 score_matrix[row][col] = 0 641 else: 642 score_matrix[row][col] = best_score 643 644 # Now the trace_matrix. The edges of the backtrace are encoded 645 # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and 646 # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and 647 # 16 = extend gap in seqA. This values can be summed up. 648 # Thus, the trace score 7 means that the best score can either 649 # come from opening a gap in seqA (=1), pairing two characters 650 # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). 651 # However, if we only want the score we don't care about the trace. 652 if not score_only: 653 row_score_rint = rint(row_score) 654 col_score_rint = rint(col_score[col]) 655 row_trace_score = 0 656 col_trace_score = 0 657 if rint(row_open) == row_score_rint: 658 row_trace_score += 1 # Open gap in seqA 659 if rint(row_extend) == row_score_rint: 660 row_trace_score += 8 # Extend gap in seqA 661 if rint(col_open) == col_score_rint: 662 col_trace_score += 4 # Open gap in seqB 663 if rint(col_extend) == col_score_rint: 664 col_trace_score += 16 # Extend gap in seqB 665 666 trace_score = 0 667 best_score_rint = rint(best_score) 668 if rint(nogap_score) == best_score_rint: 669 trace_score += 2 # Align seqA with seqB 670 if row_score_rint == best_score_rint: 671 trace_score += row_trace_score 672 if col_score_rint == best_score_rint: 673 trace_score += col_trace_score 674 trace_matrix[row][col] = trace_score 675 676 return score_matrix, trace_matrix
677 678
679 -def _recover_alignments(sequenceA, sequenceB, starts, score_matrix, 680 trace_matrix, align_globally, gap_char, 681 one_alignment_only, gap_A_fn, gap_B_fn):
682 """Do the backtracing and return a list of alignments""" 683 # Recover the alignments by following the traceback matrix. This 684 # is a recursive procedure, but it's implemented here iteratively 685 # with a stack. 686 lenA, lenB = len(sequenceA), len(sequenceB) 687 ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] 688 tracebacks = [] 689 in_process = [] 690 691 for start in starts: 692 score, (row, col) = start 693 begin = 0 694 if align_globally: 695 end = None 696 else: 697 # Local alignments should start with a positive score! 698 if score <= 0: 699 continue 700 # Local alignments should not end with a gap!: 701 trace = trace_matrix[row][col] 702 if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! 703 trace_matrix[row][col] = 2 704 # If not, don't start here! 705 else: 706 continue 707 end = -max(lenA - row, lenB - col) 708 if not end: 709 end = None 710 col_distance = lenB - col 711 row_distance = lenA - row 712 ali_seqA = ((col_distance - row_distance) * gap_char + 713 sequenceA[lenA - 1:row - 1:-1]) 714 ali_seqB = ((row_distance - col_distance) * gap_char + 715 sequenceB[lenB - 1:col - 1:-1]) 716 in_process += [(ali_seqA, ali_seqB, end, row, col, False, 717 trace_matrix[row][col])] 718 while in_process and len(tracebacks) < MAX_ALIGNMENTS: 719 # Although we allow a gap in seqB to be followed by a gap in seqA, 720 # we don't want to allow it the other way round, since this would 721 # give redundant alignments of type: A- vs. -A 722 # -B B- 723 # Thus we need to keep track if a gap in seqA was opened (col_gap) 724 # and stop the backtrace (dead_end) if a gap in seqB follows. 725 dead_end = False 726 ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() 727 728 while (row > 0 or col > 0) and not dead_end: 729 cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) 730 731 # If trace is empty we have reached at least one border of the 732 # matrix or the end of a local aligment. Just add the rest of 733 # the sequence(s) and fill with gaps if neccessary. 734 if not trace: 735 if col and col_gap: 736 dead_end = True 737 else: 738 ali_seqA, ali_seqB = _finish_backtrace( 739 sequenceA, sequenceB, ali_seqA, ali_seqB, 740 row, col, gap_char) 741 break 742 elif trace % 2 == 1: # = row open = open gap in seqA 743 trace -= 1 744 if col_gap: 745 dead_end = True 746 else: 747 col -= 1 748 ali_seqA += gap_char 749 ali_seqB += sequenceB[col] 750 col_gap = False 751 elif trace % 4 == 2: # = match/mismatch of seqA with seqB 752 trace -= 2 753 row -= 1 754 col -= 1 755 ali_seqA += sequenceA[row] 756 ali_seqB += sequenceB[col] 757 col_gap = False 758 elif trace % 8 == 4: # = col open = open gap in seqB 759 trace -= 4 760 row -= 1 761 ali_seqA += sequenceA[row] 762 ali_seqB += gap_char 763 col_gap = True 764 elif trace in (8, 24): # = row extend = extend gap in seqA 765 trace -= 8 766 if col_gap: 767 dead_end = True 768 else: 769 col_gap = False 770 # We need to find the starting point of the extended gap 771 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, 772 ali_seqB, end, row, col, col_gap, 773 gap_char, score_matrix, trace_matrix, 774 in_process, gap_A_fn, col, row, 'col') 775 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 776 elif trace == 16: # = col extend = extend gap in seqB 777 trace -= 16 778 col_gap = True 779 x = _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, 780 end, row, col, col_gap, gap_char, 781 score_matrix, trace_matrix, in_process, 782 gap_B_fn, row, col, 'row') 783 ali_seqA, ali_seqB, row, col, in_process, dead_end = x 784 785 if trace: # There is another path to follow... 786 cache += (trace,) 787 in_process.append(cache) 788 trace = trace_matrix[row][col] 789 if not align_globally and score_matrix[row][col] <= 0: 790 begin = max(row, col) 791 trace = 0 792 if not dead_end: 793 tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, 794 end)) 795 if one_alignment_only: 796 break 797 return _clean_alignments(tracebacks)
798 799
800 -def _find_start(score_matrix, align_globally):
801 """Return a list of starting points (score, (row, col)). 802 803 Indicating every possible place to start the tracebacks. 804 """ 805 nrows, ncols = len(score_matrix), len(score_matrix[0]) 806 # In this implementation of the global algorithm, the start will always be 807 # the bottom right corner of the matrix. 808 if align_globally: 809 starts = [(score_matrix[-1][-1], (nrows - 1, ncols - 1))] 810 else: 811 starts = [] 812 for row in range(nrows): 813 for col in range(ncols): 814 score = score_matrix[row][col] 815 starts.append((score, (row, col))) 816 return starts
817 818
819 -def _clean_alignments(alignments):
820 """Take a list of alignments and return a cleaned version.""" 821 # Remove duplicates, make sure begin and end are set correctly, remove 822 # empty alignments. 823 unique_alignments = [] 824 for align in alignments: 825 if align not in unique_alignments: 826 unique_alignments.append(align) 827 i = 0 828 while i < len(unique_alignments): 829 seqA, seqB, score, begin, end = unique_alignments[i] 830 # Make sure end is set reasonably. 831 if end is None: # global alignment 832 end = len(seqA) 833 elif end < 0: 834 end = end + len(seqA) 835 # If there's no alignment here, get rid of it. 836 if begin >= end: 837 del unique_alignments[i] 838 continue 839 unique_alignments[i] = seqA, seqB, score, begin, end 840 i += 1 841 return unique_alignments
842 843
844 -def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, 845 gap_char):
846 """Add remaining sequences and fill with gaps if neccessary""" 847 if row: 848 ali_seqA += sequenceA[row - 1::-1] 849 if col: 850 ali_seqB += sequenceB[col - 1::-1] 851 if row > col: 852 ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) 853 elif col > row: 854 ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) 855 return ali_seqA, ali_seqB
856 857
858 -def _find_gap_open(sequenceA, sequenceB, ali_seqA, ali_seqB, end, row, col, 859 col_gap, gap_char, score_matrix, trace_matrix, in_process, 860 gap_fn, target, index, direction):
861 """Find the starting point(s) of the extended gap""" 862 dead_end = False 863 target_score = score_matrix[row][col] 864 for n in range(target): 865 if direction == 'col': 866 col -= 1 867 ali_seqA += gap_char 868 ali_seqB += sequenceB[col] 869 else: 870 row -= 1 871 ali_seqA += sequenceA[row] 872 ali_seqB += gap_char 873 actual_score = score_matrix[row][col] + gap_fn(index, n + 1) 874 if rint(actual_score) == rint(target_score) and n > 0: 875 if not trace_matrix[row][col]: 876 break 877 else: 878 in_process.append((ali_seqA[:], ali_seqB[:], end, row, col, 879 col_gap, trace_matrix[row][col])) 880 if not trace_matrix[row][col]: 881 dead_end = True 882 return ali_seqA, ali_seqB, row, col, in_process, dead_end
883 884 885 _PRECISION = 1000 886 887
888 -def rint(x, precision=_PRECISION):
889 return int(x * precision + 0.5)
890 891
892 -class identity_match(object):
893 """identity_match([match][, mismatch]) -> match_fn 894 895 Create a match function for use in an alignment. match and 896 mismatch are the scores to give when two residues are equal or 897 unequal. By default, match is 1 and mismatch is 0. 898 """
899 - def __init__(self, match=1, mismatch=0):
900 self.match = match 901 self.mismatch = mismatch
902
903 - def __call__(self, charA, charB):
904 if charA == charB: 905 return self.match 906 return self.mismatch
907 908
909 -class dictionary_match(object):
910 """dictionary_match(score_dict[, symmetric]) -> match_fn 911 912 Create a match function for use in an alignment. score_dict is a 913 dictionary where the keys are tuples (residue 1, residue 2) and 914 the values are the match scores between those residues. symmetric 915 is a flag that indicates whether the scores are symmetric. If 916 true, then if (res 1, res 2) doesn't exist, I will use the score 917 at (res 2, res 1). 918 """
919 - def __init__(self, score_dict, symmetric=1):
920 self.score_dict = score_dict 921 self.symmetric = symmetric
922
923 - def __call__(self, charA, charB):
924 if self.symmetric and (charA, charB) not in self.score_dict: 925 # If the score dictionary is symmetric, then look up the 926 # score both ways. 927 charB, charA = charA, charB 928 return self.score_dict[(charA, charB)]
929 930
931 -class affine_penalty(object):
932 """affine_penalty(open, extend[, penalize_extend_when_opening]) -> gap_fn 933 934 Create a gap function for use in an alignment. 935 """
936 - def __init__(self, open, extend, penalize_extend_when_opening=0):
937 if open > 0 or extend > 0: 938 raise ValueError("Gap penalties should be non-positive.") 939 if not penalize_extend_when_opening and (extend < open): 940 raise ValueError("Gap opening penalty should be higher than " + 941 "gap extension penalty (or equal)") 942 self.open, self.extend = open, extend 943 self.penalize_extend_when_opening = penalize_extend_when_opening
944
945 - def __call__(self, index, length):
946 return calc_affine_penalty( 947 length, self.open, self.extend, self.penalize_extend_when_opening)
948 949
950 -def calc_affine_penalty(length, open, extend, penalize_extend_when_opening):
951 if length <= 0: 952 return 0 953 penalty = open + extend * length 954 if not penalize_extend_when_opening: 955 penalty -= extend 956 return penalty
957 958 974 975
976 -def format_alignment(align1, align2, score, begin, end):
977 """format_alignment(align1, align2, score, begin, end) -> string 978 979 Format the alignment prettily into a string. 980 """ 981 s = [] 982 s.append("%s\n" % align1) 983 s.append("%s%s\n" % (" " * begin, "|" * (end - begin))) 984 s.append("%s\n" % align2) 985 s.append(" Score=%g\n" % score) 986 return ''.join(s)
987 988 989 # Try and load C implementations of functions. If I can't, 990 # then throw a warning and use the pure Python implementations. 991 # The redefinition is deliberate, thus the no quality assurance 992 # flag for when using flake8: 993 try: 994 from .cpairwise2 import rint, _make_score_matrix_fast # noqa 995 except ImportError: 996 warnings.warn('Import of C module failed. Falling back to pure Python ' + 997 'implementation. This may be slooow...', BiopythonWarning) 998