1
2
3
4
5
6 """Bio.SearchIO support for Exonerate output formats.
7
8 This module adds support for handling Exonerate outputs. Exonerate is a generic
9 tool for pairwise sequence comparison that allows you to align sequences using
10 several different models.
11
12 Bio.SearchIO.ExonerateIO was tested on the following Exonerate versions and
13 models:
14
15 - version: 2.2
16 - models:
17 - affine:local - cdna2genome
18 - coding2coding - est2genome
19 - genome2genome - ner
20 - protein2dna - protein2genome
21 - ungapped - ungapped:translated
22
23 Although model testing were not exhaustive, ExonerateIO should be able to cope
24 with all Exonerate models. Please file a bug report if you stumble upon an
25 unparseable file.
26
27 More information on Exonerate is available on its home page at
28 www.ebi.ac.uk/~guy/exonerate/
29
30
31 Supported Formats
32 =================
33
34 - Plain text alignment - 'exonerate-text' - parsing, indexing
35 - Vulgar line - 'exonerate-vulgar' - parsing, indexing
36 - Cigar line - 'exonerate-cigar' - parsing, indexing
37
38 On Exonerate, these output formats are not exclusive to one another. For
39 example, you may have both plain text and vulgar output in the same file.
40 ExonerateIO can only handle one of these at a time, however. If you have a file
41 containing both plain text and vulgar lines, for example, you have to pick
42 either 'exonerate-text' or 'exonerate-vulgar' to parse it.
43
44 Due to the cigar format specification, many features of the alignments such as
45 introns or frameshifts may be collapsed into a single feature (in this case,
46 they are labelled 'D' for 'deletion'). The parser does not attempt to guess
47 whether the D label it encounters is a real deletion or a collapsed feature.
48 As such, parsing or indexing using 'exonerate-cigar' may yield different results
49 compared to 'exonerate-text' or 'exonerate-vulgar'.
50
51
52 exonerate-text
53 ==============
54
55 The plain text output / C4 alignment is the output triggered by the
56 '--showalignemnt' flag. Compared to the two other output formats, this format
57 contains the most information, having the complete query and hit sequences of
58 the alignment.
59
60 Here are some examples of the C4 output alignment that ExonerateIO can handle
61 (coordinates not written in scale):
62
63 1. simple ungapped alignments
64
65 1 : ATGGGCAATATCCTTCGGAAAGGTCAGCAAAT : 56
66 ||||||||||||||||||||||||||||||||
67 1319275 : ATGGGCAATATCCTTCGGAAAGGTCAGCAAAT : 1319220
68
69 2. alignments with frameshifts:
70
71 129 : -TGCCGTTACCAT----GACGAAAGTATTAAT : 160
72 -CysArgTyrHis----AspGluSerIleAsn
73 #||||||||||||####|||||||||||||||
74 #CysArgTyrHis####AspGluSerIleAsn
75 1234593 : GTGCCGTTACCATCGGTGACGAAAGTATTAAT : 1234630
76
77 3. alignments with introns and split codons:
78
79 382 : {A} {CC}AAA : 358
80 AAA{T} >>>> Target Intron 3 >>>> {hr}LysATGAGCGATGAAAATA
81 || { }++ 55423 bp ++{ } ! ||| ||||||||||
82 AAC{L}gt.........................ag{eu}AspTTGAATGATGAAAATA
83 42322 : {C} {TG}GAT : 97769
84
85 4. alignments with NER blocks
86
87 111 : CAGAAAA--< 31 >--CTGCCCAGAAT--< 10 >--AACGAGCGTTCCG- : 184
88 | |||||--< NER 1 >--| ||||| | |--< NER 2 >--||| | ||||||-
89 297911 : CTGAAAA--< 29 >--CCGCCCAAAGT--< 13 >--AACTGGAGTTCCG- : 297993
90
91 ExonerateIO utilizes the HSPFragment model quite extensively to deal with non-
92 ungapped alignments. For any single HSPFragment, if ExonerateIO sees an intron,
93 a NER block, or a frameshift, it will break the fragment into two HSPFragment
94 objects and adjust each of their start and end coordinate appropriately.
95
96 You may notice that Exonerate always uses the three letter amino acid codes to
97 display protein sequences. If the protein itself is part of the query sequence,
98 such as in the protein2dna model, ExonerateIO will transform the protein
99 sequence into using one letter codes. This is because the SeqRecord objects that
100 store the sequences are designed for single-letter sequences only. If Exonerate
101 also outputs the underlying nucleotide sequence, it will be saved into an
102 `aln_annotation` entry as a list of triplets.
103
104 If the protein sequence is not part of the actual alignment, such as in the
105 est2genome or genome2genome models, ExonerateIO will keep the three letter codes
106 and store them as `aln_annotation` entries. In these cases, the hit and
107 query sequences may be used directly as SeqRecord objects as they are one-letter
108 nucleotide codes. The three-letter protein sequences are then stored as entries
109 in the `aln_annotation` dictionary.
110
111
112 For 'exonerate-text', ExonerateIO provides the following object attributes:
113
114 +-----------------+-------------------------+----------------------------------+
115 | Object | Attribute | Value |
116 +=================+=========================+==================================+
117 | QueryResult | description | query sequence description |
118 | +-------------------------+----------------------------------+
119 | | id | query sequence ID |
120 | +-------------------------+----------------------------------+
121 | | model | alignment model |
122 | +-------------------------+----------------------------------+
123 | | program | 'exonerate' |
124 +-----------------+-------------------------+----------------------------------+
125 | Hit | description | hit sequence description |
126 | +-------------------------+----------------------------------+
127 | | id | hit sequence ID |
128 +-----------------+-------------------------+----------------------------------+
129 | HSP | hit_split_codons | list of split codon coordinates |
130 | | | in the hit sequence |
131 | +-------------------------+----------------------------------+
132 | | score | alignment score |
133 | +-------------------------+----------------------------------+
134 | | query_split_codons | list of split codon coordinates |
135 | | | in the query sequence |
136 +-----------------+-------------------------+----------------------------------+
137 | HSPFragment | aln_annotation | alignment homology string, hit |
138 | | | sequence annotation, and/or |
139 | | | query sequence annotation |
140 | +-------------------------+----------------------------------+
141 | | hit | hit sequence |
142 | +-------------------------+----------------------------------+
143 | | hit_end | hit sequence end coordinate |
144 | +-------------------------+----------------------------------+
145 | | hit_frame | hit sequence reading frame |
146 | +-------------------------+----------------------------------+
147 | | hit_start | hit sequence start coordinate |
148 | +-------------------------+----------------------------------+
149 | | hit_strand | hit sequence strand |
150 | +-------------------------+----------------------------------+
151 | | query | query sequence |
152 | +-------------------------+----------------------------------+
153 | | query_end | query sequence end coordinate |
154 | +-------------------------+----------------------------------+
155 | | query_frame | query sequence reading frame |
156 | +-------------------------+----------------------------------+
157 | | query_start | query sequence start coordinate |
158 | +-------------------------+----------------------------------+
159 | | query_strand | query sequence strand |
160 +-----------------+-------------------------+----------------------------------+
161
162 Note that you can also use the default HSP or HSPFragment properties. For
163 example, to check the intron coordinates of your result you can use the
164 `query_inter_ranges` or `hit_inter_ranges` properties:
165
166 >>> from Bio import SearchIO
167 >>> fname = 'Exonerate/exn_22_m_genome2genome.exn'
168 >>> all_qresult = list(SearchIO.parse(fname, 'exonerate-text'))
169 >>> hsp = all_qresult[-1][-1][-1] # last qresult, last hit, last hsp
170 >>> hsp
171 HSP(...)
172 >>> hsp.query_inter_ranges
173 [(388, 449), (284, 319), (198, 198), (114, 161)]
174 >>> hsp.hit_inter_ranges
175 [(487387, 641682), (386207, 487327), (208677, 386123), (71917, 208639)]
176
177 Here you can see that for both query and hit introns, the coordinates
178 in each tuple is always (start, end) where start <= end. But when you compare
179 each tuple to the next, the coordinates decrease. This is an indication that
180 both the query and hit sequences lie on the minus strand. Exonerate outputs
181 minus strand results in a decreasing manner; the start coordinate is always
182 bigger than the end coordinate. ExonerateIO preserves the fragment ordering as a
183 whole, but uses its own standard to store an individual fragment's start and end
184 coordinates.
185
186 You may also notice that the third tuple in `query_inter_ranges` is (198, 198),
187 two exact same numbers. This means that the query sequence does not have any
188 gaps at that position. The gap is only present in the hit sequence, where we see
189 that the third tuple contains (208677, 386123), a gap of about 177k bases.
190
191 Another example is to use the `hit_frame_all` and `query_frame_all` to see if
192 there are any frameshifts in your alignment:
193
194 >>> from Bio import SearchIO
195 >>> fname = 'Exonerate/exn_22_m_coding2coding_fshifts.exn'
196 >>> qresult = SearchIO.parse(fname, 'exonerate-text').next()
197 >>> hsp = qresult[0][0] # first hit, first hsp
198 >>> hsp
199 HSP(...)
200 >>> hsp.query_frame_all
201 [1, 2, 2, 2]
202 >>> hsp.hit_frame_all
203 [1, 1, 3, 1]
204
205 Here you can see that the alignment as a whole has three frameshifts. The first
206 one occurs in the query sequence, after the first fragment (1 -> 2 shift), the
207 second one occurs in the hit sequence, after the second fragment (1 -> 3 shift),
208 and the last one also occurs in the hit sequence, before the last fragment (3 ->
209 1 shift).
210
211 There are other default HSP properties that you can use to ease your workflow.
212 Please refer to the HSP object documentation for more details.
213
214
215 exonerate-vulgar
216 ================
217
218 The vulgar format provides a compact way of representing alignments created by
219 Exonerate. In general, it contains the same information as the plain text output
220 except for the 'model' information and the actual sequences themselves. You can
221 expect that the coordinates obtained from using 'exonerate-text' and
222 'exonerate-vulgar' to be the same. Both formats also creates HSPFragment using
223 the same triggers: introns, NER blocks, and/or frameshifts.
224
225
226 exonerate-cigar
227 ===============
228
229 The cigar format provides an even more compact representation of Exonerate
230 alignments. However, this comes with a cost of losing information. In the cigar
231 format, for example, introns are treated as simple deletions. This makes it
232 impossible for the parser to distinguish between simple deletions or intron
233 regions. As such, 'exonerate-cigar' may produce different sets of coordinates
234 and fragments compared to 'exonerate-vulgar' or 'exonerate-text'.
235
236 """
237
238
239
240
241
242
243 from exonerate_text import *
244 from exonerate_vulgar import *
245 from exonerate_cigar import *
246
247
248
249 if __name__ == "__main__":
250 from Bio._utils import run_doctest
251 run_doctest()
252