Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008-2017 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Multiple sequence alignment input/output as alignment objects. 
  7   
  8  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
  9  fact the two are connected internally.  Both modules use the same set of file 
 10  format names (lower case strings).  From the user's perspective, you can read 
 11  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 12  can read in the sequences within these alignmenta using Bio.SeqIO. 
 13   
 14  Bio.AlignIO is also documented at http://biopython.org/wiki/AlignIO and by 
 15  a whole chapter in our tutorial: 
 16   
 17  * `HTML Tutorial`_ 
 18  * `PDF Tutorial`_ 
 19   
 20  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
 21  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
 22   
 23  Input 
 24  ----- 
 25  For the typical special case when your file or handle contains one and only 
 26  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 27  handle (or in recent versions of Biopython a filename as a string), format 
 28  string and optional number of sequences per alignment.  It will return a single 
 29  MultipleSeqAlignment object (or raise an exception if there isn't just one 
 30  alignment): 
 31   
 32  >>> from Bio import AlignIO 
 33  >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip") 
 34  >>> print(align) 
 35  SingleLetterAlphabet() alignment with 3 rows and 384 columns 
 36  -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI 
 37  MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU 
 38  ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN 
 39   
 40  For the general case, when the handle could contain any number of alignments, 
 41  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 42  returns an iterator giving MultipleSeqAlignment objects (typically used in a 
 43  for loop). If you want random access to the alignments by number, turn this 
 44  into a list: 
 45   
 46  >>> from Bio import AlignIO 
 47  >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss")) 
 48  >>> print(alignments[2]) 
 49  SingleLetterAlphabet() alignment with 2 rows and 120 columns 
 50  -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec 
 51  LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver 
 52   
 53  Most alignment file formats can be concatenated so as to hold as many 
 54  different multiple sequence alignments as possible.  One common example 
 55  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 56  can be a file header and footer, as seen in the EMBOSS alignment output. 
 57   
 58  Output 
 59  ------ 
 60  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 61  Alignment objects (either as a list, or an iterator), an output file handle 
 62  (or filename in recent versions of Biopython) and of course the file format:: 
 63   
 64    from Bio import AlignIO 
 65    alignments = ... 
 66    count = SeqIO.write(alignments, "example.faa", "fasta") 
 67   
 68  If using a handle make sure to close it to flush the data to the disk:: 
 69   
 70    from Bio import AlignIO 
 71    alignments = ... 
 72    with open("example.faa", "w") as handle: 
 73      count = SeqIO.write(alignments, handle, "fasta") 
 74   
 75  In general, you are expected to call this function once (with all your 
 76  alignments) and then close the file handle.  However, for file formats 
 77  like PHYLIP where multiple alignments are stored sequentially (with no file 
 78  header and footer), then multiple calls to the write function should work as 
 79  expected when using handles. 
 80   
 81  If you are using a filename, the repeated calls to the write functions will 
 82  overwrite the existing file each time. 
 83   
 84  Conversion 
 85  ---------- 
 86  The Bio.AlignIO.convert(...) function allows an easy interface for simple 
 87  alignment file format conversions. Additionally, it may use file format 
 88  specific optimisations so this should be the fastest way too. 
 89   
 90  In general however, you can combine the Bio.AlignIO.parse(...) function with 
 91  the Bio.AlignIO.write(...) function for sequence file conversion. Using 
 92  generator expressions provides a memory efficient way to perform filtering or 
 93  other extra operations as part of the process. 
 94   
 95  File Formats 
 96  ------------ 
 97  When specifying the file format, use lowercase strings.  The same format 
 98  names are also used in Bio.SeqIO and include the following: 
 99   
100    - clustal -   Output from Clustal W or X, see also the module Bio.Clustalw 
101      which can be used to run the command line tool from Biopython. 
102    - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats. 
103    - fasta     - The generic sequence file format where each record starts with 
104      an identifer line starting with a ">" character, followed by 
105      lines of sequence. 
106    - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
107      tools when used with the -m 10 command line option for machine 
108      readable output. 
109    - ig        - The IntelliGenetics file format, apparently the same as the 
110      MASE alignment format. 
111    - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
112      read any phylogenetic trees in these files. 
113    - phylip    - Interlaced PHYLIP, as used by the PHLIP tools. 
114    - phylip-sequential - Sequential PHYLIP. 
115    - phylip-relaxed - PHYLIP like format allowing longer names. 
116    - stockholm - A richly annotated alignment file format used by PFAM. 
117    - mauve - Output from progressiveMauve/Mauve 
118   
119  Note that while Bio.AlignIO can read all the above file formats, it cannot 
120  write to all of them. 
121   
122  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
123  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
124  same length. 
125  """ 
126   
127   
128  from __future__ import print_function 
129  from Bio._py3k import basestring 
130   
131  # TODO 
132  # - define policy on reading aligned sequences with gaps in 
133  #   (e.g. - and . characters) including how the alphabet interacts 
134  # 
135  # - Can we build the to_alignment(...) functionality 
136  #   into the generic Alignment class instead? 
137  # 
138  # - How best to handle unique/non unique record.id when writing. 
139  #   For most file formats reading such files is fine; The stockholm 
140  #   parser would fail. 
141  # 
142  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
143  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
144   
145  from Bio.Align import MultipleSeqAlignment 
146  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
147  from Bio.File import as_handle 
148   
149  from . import StockholmIO 
150  from . import ClustalIO 
151  from . import NexusIO 
152  from . import PhylipIO 
153  from . import EmbossIO 
154  from . import FastaIO 
155  from . import MafIO 
156  from . import MauveIO 
157   
158  # Convention for format names is "mainname-subtype" in lower case. 
159  # Please use the same names as BioPerl and EMBOSS where possible. 
160   
161  _FormatToIterator = {  # "fasta" is done via Bio.SeqIO 
162                       "clustal": ClustalIO.ClustalIterator, 
163                       "emboss": EmbossIO.EmbossIterator, 
164                       "fasta-m10": FastaIO.FastaM10Iterator, 
165                       "maf": MafIO.MafIterator, 
166                       "mauve": MauveIO.MauveIterator, 
167                       "nexus": NexusIO.NexusIterator, 
168                       "phylip": PhylipIO.PhylipIterator, 
169                       "phylip-sequential": PhylipIO.SequentialPhylipIterator, 
170                       "phylip-relaxed": PhylipIO.RelaxedPhylipIterator, 
171                       "stockholm": StockholmIO.StockholmIterator, 
172                       } 
173   
174  _FormatToWriter = {  # "fasta" is done via Bio.SeqIO 
175                       # "emboss" : EmbossIO.EmbossWriter, (unfinished) 
176                     "clustal": ClustalIO.ClustalWriter, 
177                     "maf": MafIO.MafWriter, 
178                     "mauve": MauveIO.MauveWriter, 
179                     "nexus": NexusIO.NexusWriter, 
180                     "phylip": PhylipIO.PhylipWriter, 
181                     "phylip-sequential": PhylipIO.SequentialPhylipWriter, 
182                     "phylip-relaxed": PhylipIO.RelaxedPhylipWriter, 
183                     "stockholm": StockholmIO.StockholmWriter, 
184                     } 
185   
186   
187 -def write(alignments, handle, format):
188 """Write complete set of alignments to a file. 189 190 Arguments: 191 - alignments - A list (or iterator) of MultipleSeqAlignment objects, 192 or a single alignment object. 193 - handle - File handle object to write to, or filename as string 194 (note older versions of Biopython only took a handle). 195 - format - lower case string describing the file format to write. 196 197 You should close the handle after calling this function. 198 199 Returns the number of alignments written (as an integer). 200 """ 201 from Bio import SeqIO 202 203 # Try and give helpful error messages: 204 if not isinstance(format, basestring): 205 raise TypeError("Need a string for the file format (lower case)") 206 if not format: 207 raise ValueError("Format required (lower case string)") 208 if format != format.lower(): 209 raise ValueError("Format string '%s' should be lower case" % format) 210 211 if isinstance(alignments, MultipleSeqAlignment): 212 # This raised an exception in older versions of Biopython 213 alignments = [alignments] 214 215 with as_handle(handle, 'w') as fp: 216 # Map the file format to a writer class 217 if format in _FormatToWriter: 218 writer_class = _FormatToWriter[format] 219 count = writer_class(fp).write_file(alignments) 220 elif format in SeqIO._FormatToWriter: 221 # Exploit the existing SeqIO parser to do the dirty work! 222 # TODO - Can we make one call to SeqIO.write() and count the alignments? 223 count = 0 224 for alignment in alignments: 225 if not isinstance(alignment, MultipleSeqAlignment): 226 raise TypeError("Expect a list or iterator of MultipleSeqAlignment " 227 "objects, got: %r" % alignment) 228 SeqIO.write(alignment, fp, format) 229 count += 1 230 elif format in _FormatToIterator or format in SeqIO._FormatToIterator: 231 raise ValueError("Reading format '%s' is supported, but not writing" 232 % format) 233 else: 234 raise ValueError("Unknown format '%s'" % format) 235 236 assert isinstance(count, int), "Internal error - the underlying %s " \ 237 "writer should have returned the alignment count, not %s" \ 238 % (format, repr(count)) 239 240 return count
241 242 243 # This is a generator function!
244 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
245 """Use Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE). 246 247 Arguments: 248 - handle - handle to the file. 249 - format - string describing the file format. 250 - alphabet - optional Alphabet object, useful when the sequence type 251 cannot be automatically inferred from the file itself 252 (e.g. fasta, phylip, clustal) 253 - seq_count - Optional integer, number of sequences expected in each 254 alignment. Recommended for fasta format files. 255 256 If count is omitted (default) then all the sequences in the file are 257 combined into a single MultipleSeqAlignment. 258 """ 259 from Bio import SeqIO 260 assert format in SeqIO._FormatToIterator 261 262 if seq_count: 263 # Use the count to split the records into batches. 264 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 265 266 records = [] 267 for record in seq_record_iterator: 268 records.append(record) 269 if len(records) == seq_count: 270 yield MultipleSeqAlignment(records, alphabet) 271 records = [] 272 if records: 273 raise ValueError("Check seq_count argument, not enough sequences?") 274 else: 275 # Must assume that there is a single alignment using all 276 # the SeqRecord objects: 277 records = list(SeqIO.parse(handle, format, alphabet)) 278 if records: 279 yield MultipleSeqAlignment(records, alphabet)
280 281
282 -def _force_alphabet(alignment_iterator, alphabet):
283 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 284 # Assume the alphabet argument has been pre-validated 285 given_base_class = _get_base_alphabet(alphabet).__class__ 286 for align in alignment_iterator: 287 if not isinstance(_get_base_alphabet(align._alphabet), 288 given_base_class): 289 raise ValueError("Specified alphabet %s clashes with " 290 "that determined from the file, %s" 291 % (repr(alphabet), repr(align._alphabet))) 292 for record in align: 293 if not isinstance(_get_base_alphabet(record.seq.alphabet), 294 given_base_class): 295 raise ValueError("Specified alphabet %s clashes with " 296 "that determined from the file, %s" 297 % (repr(alphabet), repr(record.seq.alphabet))) 298 record.seq.alphabet = alphabet 299 align._alphabet = alphabet 300 yield align
301 302
303 -def parse(handle, format, seq_count=None, alphabet=None):
304 """Iterate over an alignment file as MultipleSeqAlignment objects. 305 306 Arguments: 307 - handle - handle to the file, or the filename as a string 308 (note older versions of Biopython only took a handle). 309 - format - string describing the file format. 310 - alphabet - optional Alphabet object, useful when the sequence type 311 cannot be automatically inferred from the file itself 312 (e.g. fasta, phylip, clustal) 313 - seq_count - Optional integer, number of sequences expected in each 314 alignment. Recommended for fasta format files. 315 316 If you have the file name in a string 'filename', use: 317 318 >>> from Bio import AlignIO 319 >>> filename = "Emboss/needle.txt" 320 >>> format = "emboss" 321 >>> for alignment in AlignIO.parse(filename, format): 322 ... print("Alignment of length %i" % alignment.get_alignment_length()) 323 Alignment of length 124 324 Alignment of length 119 325 Alignment of length 120 326 Alignment of length 118 327 Alignment of length 125 328 329 If you have a string 'data' containing the file contents, use:: 330 331 from Bio import AlignIO 332 from StringIO import StringIO 333 my_iterator = AlignIO.parse(StringIO(data), format) 334 335 Use the Bio.AlignIO.read() function when you expect a single record only. 336 """ 337 from Bio import SeqIO 338 339 # Try and give helpful error messages: 340 if not isinstance(format, basestring): 341 raise TypeError("Need a string for the file format (lower case)") 342 if not format: 343 raise ValueError("Format required (lower case string)") 344 if format != format.lower(): 345 raise ValueError("Format string '%s' should be lower case" % format) 346 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 347 isinstance(alphabet, AlphabetEncoder)): 348 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 349 if seq_count is not None and not isinstance(seq_count, int): 350 raise TypeError("Need integer for seq_count (sequences per alignment)") 351 352 with as_handle(handle, 'rU') as fp: 353 # Map the file format to a sequence iterator: 354 if format in _FormatToIterator: 355 iterator_generator = _FormatToIterator[format] 356 if alphabet is None: 357 i = iterator_generator(fp, seq_count) 358 else: 359 try: 360 # Initially assume the optional alphabet argument is supported 361 i = iterator_generator(fp, seq_count, alphabet=alphabet) 362 except TypeError: 363 # It isn't supported. 364 i = _force_alphabet(iterator_generator(fp, seq_count), 365 alphabet) 366 367 elif format in SeqIO._FormatToIterator: 368 # Exploit the existing SeqIO parser to the dirty work! 369 i = _SeqIO_to_alignment_iterator(fp, format, 370 alphabet=alphabet, 371 seq_count=seq_count) 372 else: 373 raise ValueError("Unknown format '%s'" % format) 374 375 # This imposes some overhead... wait until we drop Python 2.4 to fix it 376 for a in i: 377 yield a
378 379
380 -def read(handle, format, seq_count=None, alphabet=None):
381 """Turn an alignment file into a single MultipleSeqAlignment object. 382 383 Arguments: 384 - handle - handle to the file, or the filename as a string 385 (note older versions of Biopython only took a handle). 386 - format - string describing the file format. 387 - alphabet - optional Alphabet object, useful when the sequence type 388 cannot be automatically inferred from the file itself 389 (e.g. fasta, phylip, clustal) 390 - seq_count - Optional integer, number of sequences expected in each 391 alignment. Recommended for fasta format files. 392 393 If the handle contains no alignments, or more than one alignment, 394 an exception is raised. For example, using a PFAM/Stockholm file 395 containing one alignment: 396 397 >>> from Bio import AlignIO 398 >>> filename = "Clustalw/protein.aln" 399 >>> format = "clustal" 400 >>> alignment = AlignIO.read(filename, format) 401 >>> print("Alignment of length %i" % alignment.get_alignment_length()) 402 Alignment of length 411 403 404 If however you want the first alignment from a file containing 405 multiple alignments this function would raise an exception. 406 407 >>> from Bio import AlignIO 408 >>> filename = "Emboss/needle.txt" 409 >>> format = "emboss" 410 >>> alignment = AlignIO.read(filename, format) 411 Traceback (most recent call last): 412 ... 413 ValueError: More than one record found in handle 414 415 Instead use: 416 417 >>> from Bio import AlignIO 418 >>> filename = "Emboss/needle.txt" 419 >>> format = "emboss" 420 >>> alignment = next(AlignIO.parse(filename, format)) 421 >>> print("First alignment has length %i" % alignment.get_alignment_length()) 422 First alignment has length 124 423 424 You must use the Bio.AlignIO.parse() function if you want to read multiple 425 records from the handle. 426 """ 427 iterator = parse(handle, format, seq_count, alphabet) 428 try: 429 first = next(iterator) 430 except StopIteration: 431 first = None 432 if first is None: 433 raise ValueError("No records found in handle") 434 try: 435 second = next(iterator) 436 except StopIteration: 437 second = None 438 if second is not None: 439 raise ValueError("More than one record found in handle") 440 if seq_count: 441 assert len(first) == seq_count 442 return first
443 444
445 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
446 """Convert between two alignment files, returns number of alignments. 447 448 Arguments: 449 - in_file - an input handle or filename 450 - in_format - input file format, lower case string 451 - output - an output handle or filename 452 - out_file - output file format, lower case string 453 - alphabet - optional alphabet to assume 454 455 **NOTE** - If you provide an output filename, it will be opened which will 456 overwrite any existing file without warning. This may happen if even the 457 conversion is aborted (e.g. an invalid out_format name is given). 458 """ 459 # TODO - Add optimised versions of important conversions 460 # For now just off load the work to SeqIO parse/write 461 with as_handle(in_file, 'rU') as in_handle: 462 # Don't open the output file until we've checked the input is OK: 463 alignments = parse(in_handle, in_format, None, alphabet) 464 465 # This will check the arguments and issue error messages, 466 # after we have opened the file which is a shame. 467 with as_handle(out_file, 'w') as out_handle: 468 count = write(alignments, out_handle, out_format) 469 470 return count
471 472 473 if __name__ == "__main__": 474 from Bio._utils import run_doctest 475 run_doctest() 476