Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008-2018 by Peter Cock.  All rights reserved. 
  2  # 
  3  # This file is part of the Biopython distribution and governed by your 
  4  # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". 
  5  # Please see the LICENSE file that should have been included as part of this 
  6  # package. 
  7   
  8  """Multiple sequence alignment input/output as alignment objects. 
  9   
 10  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
 11  fact the two are connected internally.  Both modules use the same set of file 
 12  format names (lower case strings).  From the user's perspective, you can read 
 13  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 14  can read in the sequences within these alignmenta using Bio.SeqIO. 
 15   
 16  Bio.AlignIO is also documented at http://biopython.org/wiki/AlignIO and by 
 17  a whole chapter in our tutorial: 
 18   
 19  * `HTML Tutorial`_ 
 20  * `PDF Tutorial`_ 
 21   
 22  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
 23  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
 24   
 25  Input 
 26  ----- 
 27  For the typical special case when your file or handle contains one and only 
 28  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 29  handle (or in recent versions of Biopython a filename as a string), format 
 30  string and optional number of sequences per alignment.  It will return a single 
 31  MultipleSeqAlignment object (or raise an exception if there isn't just one 
 32  alignment): 
 33   
 34  >>> from Bio import AlignIO 
 35  >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip") 
 36  >>> print(align) 
 37  SingleLetterAlphabet() alignment with 3 rows and 384 columns 
 38  -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI 
 39  MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU 
 40  ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN 
 41   
 42  For the general case, when the handle could contain any number of alignments, 
 43  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 44  returns an iterator giving MultipleSeqAlignment objects (typically used in a 
 45  for loop). If you want random access to the alignments by number, turn this 
 46  into a list: 
 47   
 48  >>> from Bio import AlignIO 
 49  >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss")) 
 50  >>> print(alignments[2]) 
 51  SingleLetterAlphabet() alignment with 2 rows and 120 columns 
 52  -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec 
 53  LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver 
 54   
 55  Most alignment file formats can be concatenated so as to hold as many 
 56  different multiple sequence alignments as possible.  One common example 
 57  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 58  can be a file header and footer, as seen in the EMBOSS alignment output. 
 59   
 60  Output 
 61  ------ 
 62  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 63  Alignment objects (either as a list, or an iterator), an output file handle 
 64  (or filename in recent versions of Biopython) and of course the file format:: 
 65   
 66      from Bio import AlignIO 
 67      alignments = ... 
 68      count = SeqIO.write(alignments, "example.faa", "fasta") 
 69   
 70  If using a handle make sure to close it to flush the data to the disk:: 
 71   
 72      from Bio import AlignIO 
 73      alignments = ... 
 74      with open("example.faa", "w") as handle: 
 75          count = SeqIO.write(alignments, handle, "fasta") 
 76   
 77  In general, you are expected to call this function once (with all your 
 78  alignments) and then close the file handle.  However, for file formats 
 79  like PHYLIP where multiple alignments are stored sequentially (with no file 
 80  header and footer), then multiple calls to the write function should work as 
 81  expected when using handles. 
 82   
 83  If you are using a filename, the repeated calls to the write functions will 
 84  overwrite the existing file each time. 
 85   
 86  Conversion 
 87  ---------- 
 88  The Bio.AlignIO.convert(...) function allows an easy interface for simple 
 89  alignment file format conversions. Additionally, it may use file format 
 90  specific optimisations so this should be the fastest way too. 
 91   
 92  In general however, you can combine the Bio.AlignIO.parse(...) function with 
 93  the Bio.AlignIO.write(...) function for sequence file conversion. Using 
 94  generator expressions provides a memory efficient way to perform filtering or 
 95  other extra operations as part of the process. 
 96   
 97  File Formats 
 98  ------------ 
 99  When specifying the file format, use lowercase strings.  The same format 
100  names are also used in Bio.SeqIO and include the following: 
101   
102    - clustal -   Output from Clustal W or X, see also the module Bio.Clustalw 
103      which can be used to run the command line tool from Biopython. 
104    - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats. 
105    - fasta     - The generic sequence file format where each record starts with 
106      an identifer line starting with a ">" character, followed by 
107      lines of sequence. 
108    - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
109      tools when used with the -m 10 command line option for machine 
110      readable output. 
111    - ig        - The IntelliGenetics file format, apparently the same as the 
112      MASE alignment format. 
113    - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
114      read any phylogenetic trees in these files. 
115    - phylip    - Interlaced PHYLIP, as used by the PHLIP tools. 
116    - phylip-sequential - Sequential PHYLIP. 
117    - phylip-relaxed - PHYLIP like format allowing longer names. 
118    - stockholm - A richly annotated alignment file format used by PFAM. 
119    - mauve - Output from progressiveMauve/Mauve 
120   
121  Note that while Bio.AlignIO can read all the above file formats, it cannot 
122  write to all of them. 
123   
124  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
125  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
126  same length. 
127  """ 
128   
129   
130  from __future__ import print_function 
131  from Bio._py3k import basestring 
132   
133  # TODO 
134  # - define policy on reading aligned sequences with gaps in 
135  #   (e.g. - and . characters) including how the alphabet interacts 
136  # 
137  # - Can we build the to_alignment(...) functionality 
138  #   into the generic Alignment class instead? 
139  # 
140  # - How best to handle unique/non unique record.id when writing. 
141  #   For most file formats reading such files is fine; The stockholm 
142  #   parser would fail. 
143  # 
144  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
145  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
146   
147  from Bio.Align import MultipleSeqAlignment 
148  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
149  from Bio.File import as_handle 
150   
151  from . import StockholmIO 
152  from . import ClustalIO 
153  from . import NexusIO 
154  from . import PhylipIO 
155  from . import EmbossIO 
156  from . import FastaIO 
157  from . import MafIO 
158  from . import MauveIO 
159   
160  # Convention for format names is "mainname-subtype" in lower case. 
161  # Please use the same names as BioPerl and EMBOSS where possible. 
162   
163  _FormatToIterator = {  # "fasta" is done via Bio.SeqIO 
164                       "clustal": ClustalIO.ClustalIterator, 
165                       "emboss": EmbossIO.EmbossIterator, 
166                       "fasta-m10": FastaIO.FastaM10Iterator, 
167                       "maf": MafIO.MafIterator, 
168                       "mauve": MauveIO.MauveIterator, 
169                       "nexus": NexusIO.NexusIterator, 
170                       "phylip": PhylipIO.PhylipIterator, 
171                       "phylip-sequential": PhylipIO.SequentialPhylipIterator, 
172                       "phylip-relaxed": PhylipIO.RelaxedPhylipIterator, 
173                       "stockholm": StockholmIO.StockholmIterator, 
174                       } 
175   
176  _FormatToWriter = {  # "fasta" is done via Bio.SeqIO 
177                       # "emboss" : EmbossIO.EmbossWriter, (unfinished) 
178                     "clustal": ClustalIO.ClustalWriter, 
179                     "maf": MafIO.MafWriter, 
180                     "mauve": MauveIO.MauveWriter, 
181                     "nexus": NexusIO.NexusWriter, 
182                     "phylip": PhylipIO.PhylipWriter, 
183                     "phylip-sequential": PhylipIO.SequentialPhylipWriter, 
184                     "phylip-relaxed": PhylipIO.RelaxedPhylipWriter, 
185                     "stockholm": StockholmIO.StockholmWriter, 
186                     } 
187   
188   
189 -def write(alignments, handle, format):
190 """Write complete set of alignments to a file. 191 192 Arguments: 193 - alignments - A list (or iterator) of MultipleSeqAlignment objects, 194 or a single alignment object. 195 - handle - File handle object to write to, or filename as string 196 (note older versions of Biopython only took a handle). 197 - format - lower case string describing the file format to write. 198 199 You should close the handle after calling this function. 200 201 Returns the number of alignments written (as an integer). 202 """ 203 from Bio import SeqIO 204 205 # Try and give helpful error messages: 206 if not isinstance(format, basestring): 207 raise TypeError("Need a string for the file format (lower case)") 208 if not format: 209 raise ValueError("Format required (lower case string)") 210 if format != format.lower(): 211 raise ValueError("Format string '%s' should be lower case" % format) 212 213 if isinstance(alignments, MultipleSeqAlignment): 214 # This raised an exception in older versions of Biopython 215 alignments = [alignments] 216 217 with as_handle(handle, 'w') as fp: 218 # Map the file format to a writer class 219 if format in _FormatToWriter: 220 writer_class = _FormatToWriter[format] 221 count = writer_class(fp).write_file(alignments) 222 elif format in SeqIO._FormatToWriter: 223 # Exploit the existing SeqIO parser to do the dirty work! 224 # TODO - Can we make one call to SeqIO.write() and count the alignments? 225 count = 0 226 for alignment in alignments: 227 if not isinstance(alignment, MultipleSeqAlignment): 228 raise TypeError("Expect a list or iterator of MultipleSeqAlignment " 229 "objects, got: %r" % alignment) 230 SeqIO.write(alignment, fp, format) 231 count += 1 232 elif format in _FormatToIterator or format in SeqIO._FormatToIterator: 233 raise ValueError("Reading format '%s' is supported, but not writing" 234 % format) 235 else: 236 raise ValueError("Unknown format '%s'" % format) 237 238 if not isinstance(count, int): 239 raise RuntimeError("Internal error - the underlying %s " 240 "writer should have returned the alignment count, not %s" 241 % (format, repr(count))) 242 243 return count
244 245 246 # This is a generator function!
247 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
248 """Use Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE). 249 250 Arguments: 251 - handle - handle to the file. 252 - format - string describing the file format. 253 - alphabet - optional Alphabet object, useful when the sequence type 254 cannot be automatically inferred from the file itself 255 (e.g. fasta, phylip, clustal) 256 - seq_count - Optional integer, number of sequences expected in each 257 alignment. Recommended for fasta format files. 258 259 If count is omitted (default) then all the sequences in the file are 260 combined into a single MultipleSeqAlignment. 261 """ 262 from Bio import SeqIO 263 264 if format not in SeqIO._FormatToIterator: 265 raise ValueError("Unknown format '%s'" % format) 266 267 if seq_count: 268 # Use the count to split the records into batches. 269 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 270 271 records = [] 272 for record in seq_record_iterator: 273 records.append(record) 274 if len(records) == seq_count: 275 yield MultipleSeqAlignment(records, alphabet) 276 records = [] 277 if records: 278 raise ValueError("Check seq_count argument, not enough sequences?") 279 else: 280 # Must assume that there is a single alignment using all 281 # the SeqRecord objects: 282 records = list(SeqIO.parse(handle, format, alphabet)) 283 if records: 284 yield MultipleSeqAlignment(records, alphabet)
285 286
287 -def _force_alphabet(alignment_iterator, alphabet):
288 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 289 # Assume the alphabet argument has been pre-validated 290 given_base_class = _get_base_alphabet(alphabet).__class__ 291 for align in alignment_iterator: 292 if not isinstance(_get_base_alphabet(align._alphabet), 293 given_base_class): 294 raise ValueError("Specified alphabet %s clashes with " 295 "that determined from the file, %s" 296 % (repr(alphabet), repr(align._alphabet))) 297 for record in align: 298 if not isinstance(_get_base_alphabet(record.seq.alphabet), 299 given_base_class): 300 raise ValueError("Specified alphabet %s clashes with " 301 "that determined from the file, %s" 302 % (repr(alphabet), repr(record.seq.alphabet))) 303 record.seq.alphabet = alphabet 304 align._alphabet = alphabet 305 yield align
306 307
308 -def parse(handle, format, seq_count=None, alphabet=None):
309 """Iterate over an alignment file as MultipleSeqAlignment objects. 310 311 Arguments: 312 - handle - handle to the file, or the filename as a string 313 (note older versions of Biopython only took a handle). 314 - format - string describing the file format. 315 - alphabet - optional Alphabet object, useful when the sequence type 316 cannot be automatically inferred from the file itself 317 (e.g. fasta, phylip, clustal) 318 - seq_count - Optional integer, number of sequences expected in each 319 alignment. Recommended for fasta format files. 320 321 If you have the file name in a string 'filename', use: 322 323 >>> from Bio import AlignIO 324 >>> filename = "Emboss/needle.txt" 325 >>> format = "emboss" 326 >>> for alignment in AlignIO.parse(filename, format): 327 ... print("Alignment of length %i" % alignment.get_alignment_length()) 328 Alignment of length 124 329 Alignment of length 119 330 Alignment of length 120 331 Alignment of length 118 332 Alignment of length 125 333 334 If you have a string 'data' containing the file contents, use:: 335 336 from Bio import AlignIO 337 from StringIO import StringIO 338 my_iterator = AlignIO.parse(StringIO(data), format) 339 340 Use the Bio.AlignIO.read() function when you expect a single record only. 341 """ 342 from Bio import SeqIO 343 344 # Try and give helpful error messages: 345 if not isinstance(format, basestring): 346 raise TypeError("Need a string for the file format (lower case)") 347 if not format: 348 raise ValueError("Format required (lower case string)") 349 if format != format.lower(): 350 raise ValueError("Format string '%s' should be lower case" % format) 351 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 352 isinstance(alphabet, AlphabetEncoder)): 353 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 354 if seq_count is not None and not isinstance(seq_count, int): 355 raise TypeError("Need integer for seq_count (sequences per alignment)") 356 357 with as_handle(handle, 'rU') as fp: 358 # Map the file format to a sequence iterator: 359 if format in _FormatToIterator: 360 iterator_generator = _FormatToIterator[format] 361 if alphabet is None: 362 i = iterator_generator(fp, seq_count) 363 else: 364 try: 365 # Initially assume the optional alphabet argument is supported 366 i = iterator_generator(fp, seq_count, alphabet=alphabet) 367 except TypeError: 368 # It isn't supported. 369 i = _force_alphabet(iterator_generator(fp, seq_count), 370 alphabet) 371 372 elif format in SeqIO._FormatToIterator: 373 # Exploit the existing SeqIO parser to the dirty work! 374 i = _SeqIO_to_alignment_iterator(fp, format, 375 alphabet=alphabet, 376 seq_count=seq_count) 377 else: 378 raise ValueError("Unknown format '%s'" % format) 379 380 # TODO: As of Python 3.3, can write "yield from i" instead. See PEP380. 381 # For loop imposes some overhead... wait until we drop Python 2.7 to fix it. 382 for a in i: 383 yield a
384 385
386 -def read(handle, format, seq_count=None, alphabet=None):
387 """Turn an alignment file into a single MultipleSeqAlignment object. 388 389 Arguments: 390 - handle - handle to the file, or the filename as a string 391 (note older versions of Biopython only took a handle). 392 - format - string describing the file format. 393 - alphabet - optional Alphabet object, useful when the sequence type 394 cannot be automatically inferred from the file itself 395 (e.g. fasta, phylip, clustal) 396 - seq_count - Optional integer, number of sequences expected in each 397 alignment. Recommended for fasta format files. 398 399 If the handle contains no alignments, or more than one alignment, 400 an exception is raised. For example, using a PFAM/Stockholm file 401 containing one alignment: 402 403 >>> from Bio import AlignIO 404 >>> filename = "Clustalw/protein.aln" 405 >>> format = "clustal" 406 >>> alignment = AlignIO.read(filename, format) 407 >>> print("Alignment of length %i" % alignment.get_alignment_length()) 408 Alignment of length 411 409 410 If however you want the first alignment from a file containing 411 multiple alignments this function would raise an exception. 412 413 >>> from Bio import AlignIO 414 >>> filename = "Emboss/needle.txt" 415 >>> format = "emboss" 416 >>> alignment = AlignIO.read(filename, format) 417 Traceback (most recent call last): 418 ... 419 ValueError: More than one record found in handle 420 421 Instead use: 422 423 >>> from Bio import AlignIO 424 >>> filename = "Emboss/needle.txt" 425 >>> format = "emboss" 426 >>> alignment = next(AlignIO.parse(filename, format)) 427 >>> print("First alignment has length %i" % alignment.get_alignment_length()) 428 First alignment has length 124 429 430 You must use the Bio.AlignIO.parse() function if you want to read multiple 431 records from the handle. 432 """ 433 iterator = parse(handle, format, seq_count, alphabet) 434 try: 435 first = next(iterator) 436 except StopIteration: 437 first = None 438 if first is None: 439 raise ValueError("No records found in handle") 440 try: 441 second = next(iterator) 442 except StopIteration: 443 second = None 444 if second is not None: 445 raise ValueError("More than one record found in handle") 446 if seq_count: 447 if len(first) != seq_count: 448 raise RuntimeError("More sequences found in alignment than specified in seq_count: %s." % seq_count) 449 return first
450 451
452 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
453 """Convert between two alignment files, returns number of alignments. 454 455 Arguments: 456 - in_file - an input handle or filename 457 - in_format - input file format, lower case string 458 - output - an output handle or filename 459 - out_file - output file format, lower case string 460 - alphabet - optional alphabet to assume 461 462 **NOTE** - If you provide an output filename, it will be opened which will 463 overwrite any existing file without warning. This may happen if even the 464 conversion is aborted (e.g. an invalid out_format name is given). 465 """ 466 # TODO - Add optimised versions of important conversions 467 # For now just off load the work to SeqIO parse/write 468 with as_handle(in_file, 'rU') as in_handle: 469 # Don't open the output file until we've checked the input is OK: 470 alignments = parse(in_handle, in_format, None, alphabet) 471 472 # This will check the arguments and issue error messages, 473 # after we have opened the file which is a shame. 474 with as_handle(out_file, 'w') as out_handle: 475 count = write(alignments, out_handle, out_format) 476 477 return count
478 479 480 if __name__ == "__main__": 481 from Bio._utils import run_doctest 482 run_doctest() 483