1
2
3
4
5
6 """Multiple sequence alignment input/output as alignment objects.
7
8 The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in
9 fact the two are connected internally. Both modules use the same set of file
10 format names (lower case strings). From the user's perspective, you can read
11 in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you
12 can read in the sequences within these alignmenta using Bio.SeqIO.
13
14 Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by
15 a whole chapter in our tutorial:
16 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
17 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
18
19 Input
20 =====
21 For the typical special case when your file or handle contains one and only
22 one alignment, use the function Bio.AlignIO.read(). This takes an input file
23 handle (or in recent versions of Biopython a filename as a string), format
24 string and optional number of sequences per alignment. It will return a single
25 MultipleSeqAlignment object (or raise an exception if there isn't just one
26 alignment):
27
28 >>> from Bio import AlignIO
29 >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip")
30 >>> print align
31 SingleLetterAlphabet() alignment with 3 rows and 384 columns
32 -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI
33 MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU
34 ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN
35
36 For the general case, when the handle could contain any number of alignments,
37 use the function Bio.AlignIO.parse(...) which takes the same arguments, but
38 returns an iterator giving MultipleSeqAlignment objects (typically used in a
39 for loop). If you want random access to the alignments by number, turn this
40 into a list:
41
42 >>> from Bio import AlignIO
43 >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss"))
44 >>> print alignments[2]
45 SingleLetterAlphabet() alignment with 2 rows and 120 columns
46 -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec
47 LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver
48
49 Most alignment file formats can be concatenated so as to hold as many
50 different multiple sequence alignments as possible. One common example
51 is the output of the tool seqboot in the PHLYIP suite. Sometimes there
52 can be a file header and footer, as seen in the EMBOSS alignment output.
53
54 Output
55 ======
56 Use the function Bio.AlignIO.write(...), which takes a complete set of
57 Alignment objects (either as a list, or an iterator), an output file handle
58 (or filename in recent versions of Biopython) and of course the file format::
59
60 from Bio import AlignIO
61 alignments = ...
62 count = SeqIO.write(alignments, "example.faa", "fasta")
63
64 If using a handle make sure to close it to flush the data to the disk::
65
66 from Bio import AlignIO
67 alignments = ...
68 handle = open("example.faa", "w")
69 count = SeqIO.write(alignments, handle, "fasta")
70 handle.close()
71
72 In general, you are expected to call this function once (with all your
73 alignments) and then close the file handle. However, for file formats
74 like PHYLIP where multiple alignments are stored sequentially (with no file
75 header and footer), then multiple calls to the write function should work as
76 expected when using handles.
77
78 If you are using a filename, the repeated calls to the write functions will
79 overwrite the existing file each time.
80
81 Conversion
82 ==========
83 The Bio.AlignIO.convert(...) function allows an easy interface for simple
84 alignnment file format conversions. Additionally, it may use file format
85 specific optimisations so this should be the fastest way too.
86
87 In general however, you can combine the Bio.AlignIO.parse(...) function with
88 the Bio.AlignIO.write(...) function for sequence file conversion. Using
89 generator expressions provides a memory efficient way to perform filtering or
90 other extra operations as part of the process.
91
92 File Formats
93 ============
94 When specifying the file format, use lowercase strings. The same format
95 names are also used in Bio.SeqIO and include the following:
96
97 - clustal - Output from Clustal W or X, see also the module Bio.Clustalw
98 which can be used to run the command line tool from Biopython.
99 - emboss - EMBOSS tools' "pairs" and "simple" alignment formats.
100 - fasta - The generic sequence file format where each record starts with
101 an identifer line starting with a ">" character, followed by
102 lines of sequence.
103 - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA
104 tools when used with the -m 10 command line option for machine
105 readable output.
106 - ig - The IntelliGenetics file format, apparently the same as the
107 MASE alignment format.
108 - nexus - Output from NEXUS, see also the module Bio.Nexus which can also
109 read any phylogenetic trees in these files.
110 - phylip - Interlaced PHYLIP, as used by the PHLIP tools.
111 - phylip-sequential - Sequential PHYLIP.
112 - phylip-relaxed - PHYLIP like format allowing longer names.
113 - stockholm - A richly annotated alignment file format used by PFAM.
114
115 Note that while Bio.AlignIO can read all the above file formats, it cannot
116 write to all of them.
117
118 You can also use any file format supported by Bio.SeqIO, such as "fasta" or
119 "ig" (which are listed above), PROVIDED the sequences in your file are all the
120 same length.
121 """
122
123
124 from __future__ import with_statement
125
126 __docformat__ = "epytext en"
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142 from Bio.Align import MultipleSeqAlignment
143 from Bio.Align.Generic import Alignment
144 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
145 from Bio.File import as_handle
146
147 import StockholmIO
148 import ClustalIO
149 import NexusIO
150 import PhylipIO
151 import EmbossIO
152 import FastaIO
153
154
155
156
157 _FormatToIterator = {
158 "clustal": ClustalIO.ClustalIterator,
159 "emboss": EmbossIO.EmbossIterator,
160 "fasta-m10": FastaIO.FastaM10Iterator,
161 "nexus": NexusIO.NexusIterator,
162 "phylip": PhylipIO.PhylipIterator,
163 "phylip-sequential": PhylipIO.SequentialPhylipIterator,
164 "phylip-relaxed": PhylipIO.RelaxedPhylipIterator,
165 "stockholm": StockholmIO.StockholmIterator,
166 }
167
168 _FormatToWriter = {
169
170 "nexus": NexusIO.NexusWriter,
171 "phylip": PhylipIO.PhylipWriter,
172 "phylip-sequential": PhylipIO.SequentialPhylipWriter,
173 "phylip-relaxed": PhylipIO.RelaxedPhylipWriter,
174 "stockholm": StockholmIO.StockholmWriter,
175 "clustal": ClustalIO.ClustalWriter,
176 }
177
178
179 -def write(alignments, handle, format):
180 """Write complete set of alignments to a file.
181
182 Arguments:
183 - alignments - A list (or iterator) of Alignment objects (ideally the
184 new MultipleSeqAlignment objects), or (if using Biopython
185 1.54 or later) a single alignment object.
186 - handle - File handle object to write to, or filename as string
187 (note older versions of Biopython only took a handle).
188 - format - lower case string describing the file format to write.
189
190 You should close the handle after calling this function.
191
192 Returns the number of alignments written (as an integer).
193 """
194 from Bio import SeqIO
195
196
197 if not isinstance(format, basestring):
198 raise TypeError("Need a string for the file format (lower case)")
199 if not format:
200 raise ValueError("Format required (lower case string)")
201 if format != format.lower():
202 raise ValueError("Format string '%s' should be lower case" % format)
203
204 if isinstance(alignments, Alignment):
205
206 alignments = [alignments]
207
208 with as_handle(handle, 'w') as fp:
209
210 if format in _FormatToWriter:
211 writer_class = _FormatToWriter[format]
212 count = writer_class(fp).write_file(alignments)
213 elif format in SeqIO._FormatToWriter:
214
215
216 count = 0
217 for alignment in alignments:
218 if not isinstance(alignment, Alignment):
219 raise TypeError(
220 "Expect a list or iterator of Alignment objects.")
221 SeqIO.write(alignment, fp, format)
222 count += 1
223 elif format in _FormatToIterator or format in SeqIO._FormatToIterator:
224 raise ValueError("Reading format '%s' is supported, but not writing"
225 % format)
226 else:
227 raise ValueError("Unknown format '%s'" % format)
228
229 assert isinstance(count, int), "Internal error - the underlying %s " \
230 "writer should have returned the alignment count, not %s" \
231 % (format, repr(count))
232
233 return count
234
235
236
238 """Uses Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE).
239
240 Arguments:
241 - handle - handle to the file.
242 - format - string describing the file format.
243 - alphabet - optional Alphabet object, useful when the sequence type
244 cannot be automatically inferred from the file itself
245 (e.g. fasta, phylip, clustal)
246 - seq_count - Optional integer, number of sequences expected in each
247 alignment. Recommended for fasta format files.
248
249 If count is omitted (default) then all the sequences in the file are
250 combined into a single MultipleSeqAlignment.
251 """
252 from Bio import SeqIO
253 assert format in SeqIO._FormatToIterator
254
255 if seq_count:
256
257 seq_record_iterator = SeqIO.parse(handle, format, alphabet)
258
259 records = []
260 for record in seq_record_iterator:
261 records.append(record)
262 if len(records) == seq_count:
263 yield MultipleSeqAlignment(records, alphabet)
264 records = []
265 if len(records) > 0:
266 raise ValueError("Check seq_count argument, not enough sequences?")
267 else:
268
269
270 records = list(SeqIO.parse(handle, format, alphabet))
271 if records:
272 yield MultipleSeqAlignment(records, alphabet)
273 raise StopIteration
274
275
295
296
297 -def parse(handle, format, seq_count=None, alphabet=None):
298 """Iterate over an alignment file as MultipleSeqAlignment objects.
299
300 Arguments:
301 - handle - handle to the file, or the filename as a string
302 (note older versions of Biopython only took a handle).
303 - format - string describing the file format.
304 - alphabet - optional Alphabet object, useful when the sequence type
305 cannot be automatically inferred from the file itself
306 (e.g. fasta, phylip, clustal)
307 - seq_count - Optional integer, number of sequences expected in each
308 alignment. Recommended for fasta format files.
309
310 If you have the file name in a string 'filename', use:
311
312 >>> from Bio import AlignIO
313 >>> filename = "Emboss/needle.txt"
314 >>> format = "emboss"
315 >>> for alignment in AlignIO.parse(filename, format):
316 ... print "Alignment of length", alignment.get_alignment_length()
317 Alignment of length 124
318 Alignment of length 119
319 Alignment of length 120
320 Alignment of length 118
321 Alignment of length 125
322
323 If you have a string 'data' containing the file contents, use:
324
325 from Bio import AlignIO
326 from StringIO import StringIO
327 my_iterator = AlignIO.parse(StringIO(data), format)
328
329 Use the Bio.AlignIO.read() function when you expect a single record only.
330 """
331 from Bio import SeqIO
332
333
334 if not isinstance(format, basestring):
335 raise TypeError("Need a string for the file format (lower case)")
336 if not format:
337 raise ValueError("Format required (lower case string)")
338 if format != format.lower():
339 raise ValueError("Format string '%s' should be lower case" % format)
340 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
341 isinstance(alphabet, AlphabetEncoder)):
342 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
343 if seq_count is not None and not isinstance(seq_count, int):
344 raise TypeError("Need integer for seq_count (sequences per alignment)")
345
346 with as_handle(handle, 'rU') as fp:
347
348 if format in _FormatToIterator:
349 iterator_generator = _FormatToIterator[format]
350 if alphabet is None:
351 i = iterator_generator(fp, seq_count)
352 else:
353 try:
354
355 i = iterator_generator(fp, seq_count, alphabet=alphabet)
356 except TypeError:
357
358 i = _force_alphabet(iterator_generator(fp, seq_count),
359 alphabet)
360
361 elif format in SeqIO._FormatToIterator:
362
363 i = _SeqIO_to_alignment_iterator(fp, format,
364 alphabet=alphabet,
365 seq_count=seq_count)
366 else:
367 raise ValueError("Unknown format '%s'" % format)
368
369
370 for a in i:
371 yield a
372
373
374 -def read(handle, format, seq_count=None, alphabet=None):
375 """Turns an alignment file into a single MultipleSeqAlignment object.
376
377 Arguments:
378 - handle - handle to the file, or the filename as a string
379 (note older versions of Biopython only took a handle).
380 - format - string describing the file format.
381 - alphabet - optional Alphabet object, useful when the sequence type
382 cannot be automatically inferred from the file itself
383 (e.g. fasta, phylip, clustal)
384 - seq_count - Optional integer, number of sequences expected in each
385 alignment. Recommended for fasta format files.
386
387 If the handle contains no alignments, or more than one alignment,
388 an exception is raised. For example, using a PFAM/Stockholm file
389 containing one alignment:
390
391 >>> from Bio import AlignIO
392 >>> filename = "Clustalw/protein.aln"
393 >>> format = "clustal"
394 >>> alignment = AlignIO.read(filename, format)
395 >>> print "Alignment of length", alignment.get_alignment_length()
396 Alignment of length 411
397
398 If however you want the first alignment from a file containing
399 multiple alignments this function would raise an exception.
400
401 >>> from Bio import AlignIO
402 >>> filename = "Emboss/needle.txt"
403 >>> format = "emboss"
404 >>> alignment = AlignIO.read(filename, format)
405 Traceback (most recent call last):
406 ...
407 ValueError: More than one record found in handle
408
409 Instead use:
410
411 >>> from Bio import AlignIO
412 >>> filename = "Emboss/needle.txt"
413 >>> format = "emboss"
414 >>> alignment = AlignIO.parse(filename, format).next()
415 >>> print "First alignment has length", alignment.get_alignment_length()
416 First alignment has length 124
417
418 You must use the Bio.AlignIO.parse() function if you want to read multiple
419 records from the handle.
420 """
421 iterator = parse(handle, format, seq_count, alphabet)
422 try:
423 first = iterator.next()
424 except StopIteration:
425 first = None
426 if first is None:
427 raise ValueError("No records found in handle")
428 try:
429 second = iterator.next()
430 except StopIteration:
431 second = None
432 if second is not None:
433 raise ValueError("More than one record found in handle")
434 if seq_count:
435 assert len(first) == seq_count
436 return first
437
438
439 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
440 """Convert between two alignment files, returns number of alignments.
441
442 - in_file - an input handle or filename
443 - in_format - input file format, lower case string
444 - output - an output handle or filename
445 - out_file - output file format, lower case string
446 - alphabet - optional alphabet to assume
447
448 NOTE - If you provide an output filename, it will be opened which will
449 overwrite any existing file without warning. This may happen if even the
450 conversion is aborted (e.g. an invalid out_format name is given).
451 """
452
453
454 with as_handle(in_file, 'rU') as in_handle:
455
456 alignments = parse(in_handle, in_format, None, alphabet)
457
458
459
460 with as_handle(out_file, 'w') as out_handle:
461 count = write(alignments, out_handle, out_format)
462
463 return count
464
465
466 if __name__ == "__main__":
467 from Bio._utils import run_doctest
468 run_doctest()
469