Package Bio :: Package SearchIO :: Package ExonerateIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SearchIO.ExonerateIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO support for Exonerate output formats. 
  7   
  8  This module adds support for handling Exonerate outputs. Exonerate is a generic 
  9  tool for pairwise sequence comparison that allows you to align sequences using 
 10  several different models. 
 11   
 12  Bio.SearchIO.ExonerateIO was tested on the following Exonerate versions and 
 13  models: 
 14   
 15      - version: 2.2 
 16      - models: 
 17        - affine:local                - cdna2genome 
 18        - coding2coding               - est2genome 
 19        - genome2genome               - ner 
 20        - protein2dna                 - protein2genome 
 21        - ungapped                    - ungapped:translated 
 22   
 23  Although model testing were not exhaustive, ExonerateIO should be able to cope 
 24  with all Exonerate models. Please file a bug report if you stumble upon an 
 25  unparseable file. 
 26   
 27  More information on Exonerate is available on its home page at 
 28  www.ebi.ac.uk/~guy/exonerate/ 
 29   
 30   
 31  Supported Formats 
 32  ================= 
 33   
 34    - Plain text alignment - 'exonerate-text'   - parsing, indexing 
 35    - Vulgar line          - 'exonerate-vulgar' - parsing, indexing 
 36    - Cigar line           - 'exonerate-cigar'  - parsing, indexing 
 37   
 38  On Exonerate, these output formats are not exclusive to one another. For 
 39  example, you may have both plain text and vulgar output in the same file. 
 40  ExonerateIO can only handle one of these at a time, however. If you have a file 
 41  containing both plain text and vulgar lines, for example, you have to pick 
 42  either 'exonerate-text' or 'exonerate-vulgar' to parse it. 
 43   
 44  Due to the cigar format specification, many features of the alignments such as 
 45  introns or frameshifts may be collapsed into a single feature (in this case, 
 46  they are labelled 'D' for 'deletion'). The parser does not attempt to guess 
 47  whether the D label it encounters is a real deletion or a collapsed feature. 
 48  As such, parsing or indexing using 'exonerate-cigar' may yield different results 
 49  compared to 'exonerate-text' or 'exonerate-vulgar'. 
 50   
 51   
 52  exonerate-text 
 53  ============== 
 54   
 55  The plain text output / C4 alignment is the output triggered by the 
 56  '--showalignemnt' flag. Compared to the two other output formats, this format 
 57  contains the most information, having the complete query and hit sequences of 
 58  the alignment. 
 59   
 60  Here are some examples of the C4 output alignment that ExonerateIO can handle 
 61  (coordinates not written in scale):: 
 62   
 63      1. simple ungapped alignments 
 64   
 65             1 : ATGGGCAATATCCTTCGGAAAGGTCAGCAAAT :      56 
 66                 |||||||||||||||||||||||||||||||| 
 67       1319275 : ATGGGCAATATCCTTCGGAAAGGTCAGCAAAT : 1319220 
 68   
 69      2. alignments with frameshifts: 
 70   
 71           129 : -TGCCGTTACCAT----GACGAAAGTATTAAT : 160 
 72                 -CysArgTyrHis----AspGluSerIleAsn 
 73                 #||||||||||||####||||||||||||||| 
 74                 #CysArgTyrHis####AspGluSerIleAsn 
 75       1234593 : GTGCCGTTACCATCGGTGACGAAAGTATTAAT : 1234630 
 76   
 77      3. alignments with introns and split codons: 
 78   
 79          382 :    {A}                             {CC}AAA                 :    358 
 80                AAA{T}  >>>> Target Intron 3 >>>>  {hr}LysATGAGCGATGAAAATA 
 81                || { }++         55423 bp        ++{  } !  |||  |||||||||| 
 82                AAC{L}gt.........................ag{eu}AspTTGAATGATGAAAATA 
 83        42322 :    {C}                             {TG}GAT                 :  97769 
 84   
 85      4. alignments with NER blocks 
 86   
 87          111 : CAGAAAA--<   31  >--CTGCCCAGAAT--<   10  >--AACGAGCGTTCCG- :    184 
 88                | |||||--< NER 1 >--| ||||| | |--< NER 2 >--|||  | ||||||- 
 89       297911 : CTGAAAA--<   29  >--CCGCCCAAAGT--<   13  >--AACTGGAGTTCCG- : 297993 
 90   
 91  ExonerateIO utilizes the HSPFragment model quite extensively to deal with non- 
 92  ungapped alignments. For any single HSPFragment, if ExonerateIO sees an intron, 
 93  a NER block, or a frameshift, it will break the fragment into two HSPFragment 
 94  objects and adjust each of their start and end coordinate appropriately. 
 95   
 96  You may notice that Exonerate always uses the three letter amino acid codes to 
 97  display protein sequences. If the protein itself is part of the query sequence, 
 98  such as in the protein2dna model, ExonerateIO will transform the protein 
 99  sequence into using one letter codes. This is because the SeqRecord objects that 
100  store the sequences are designed for single-letter sequences only. If Exonerate 
101  also outputs the underlying nucleotide sequence, it will be saved into an 
102  ``aln_annotation`` entry as a list of triplets. 
103   
104  If the protein sequence is not part of the actual alignment, such as in the 
105  est2genome or genome2genome models, ExonerateIO will keep the three letter codes 
106  and store them as ``aln_annotation`` entries. In these cases, the hit and 
107  query sequences may be used directly as SeqRecord objects as they are one-letter 
108  nucleotide codes. The three-letter protein sequences are then stored as entries 
109  in the ``aln_annotation`` dictionary. 
110   
111   
112  For 'exonerate-text', ExonerateIO provides the following object attributes: 
113   
114  +-----------------+-------------------------+----------------------------------+ 
115  | Object          | Attribute               | Value                            | 
116  +=================+=========================+==================================+ 
117  | QueryResult     | description             | query sequence description       | 
118  |                 +-------------------------+----------------------------------+ 
119  |                 | id                      | query sequence ID                | 
120  |                 +-------------------------+----------------------------------+ 
121  |                 | model                   | alignment model                  | 
122  |                 +-------------------------+----------------------------------+ 
123  |                 | program                 | 'exonerate'                      | 
124  +-----------------+-------------------------+----------------------------------+ 
125  | Hit             | description             | hit sequence description         | 
126  |                 +-------------------------+----------------------------------+ 
127  |                 | id                      | hit sequence ID                  | 
128  +-----------------+-------------------------+----------------------------------+ 
129  | HSP             | hit_split_codons        | list of split codon coordinates  | 
130  |                 |                         | in the hit sequence              | 
131  |                 +-------------------------+----------------------------------+ 
132  |                 | score                   | alignment score                  | 
133  |                 +-------------------------+----------------------------------+ 
134  |                 | query_split_codons      | list of split codon coordinates  | 
135  |                 |                         | in the query sequence            | 
136  +-----------------+-------------------------+----------------------------------+ 
137  | HSPFragment     | aln_annotation          | alignment similarity string, hit | 
138  |                 |                         | sequence annotation, and/or      | 
139  |                 |                         | query sequence annotation        | 
140  |                 +-------------------------+----------------------------------+ 
141  |                 | hit                     | hit sequence                     | 
142  |                 +-------------------------+----------------------------------+ 
143  |                 | hit_end                 | hit sequence end coordinate      | 
144  |                 +-------------------------+----------------------------------+ 
145  |                 | hit_frame               | hit sequence reading frame       | 
146  |                 +-------------------------+----------------------------------+ 
147  |                 | hit_start               | hit sequence start coordinate    | 
148  |                 +-------------------------+----------------------------------+ 
149  |                 | hit_strand              | hit sequence strand              | 
150  |                 +-------------------------+----------------------------------+ 
151  |                 | query                   | query sequence                   | 
152  |                 +-------------------------+----------------------------------+ 
153  |                 | query_end               | query sequence end coordinate    | 
154  |                 +-------------------------+----------------------------------+ 
155  |                 | query_frame             | query sequence reading frame     | 
156  |                 +-------------------------+----------------------------------+ 
157  |                 | query_start             | query sequence start coordinate  | 
158  |                 +-------------------------+----------------------------------+ 
159  |                 | query_strand            | query sequence strand            | 
160  +-----------------+-------------------------+----------------------------------+ 
161   
162  Note that you can also use the default HSP or HSPFragment properties. For 
163  example, to check the intron coordinates of your result you can use the 
164  ``query_inter_ranges`` or ``hit_inter_ranges`` properties: 
165   
166      >>> from Bio import SearchIO 
167      >>> fname = 'Exonerate/exn_22_m_genome2genome.exn' 
168      >>> all_qresult = list(SearchIO.parse(fname, 'exonerate-text')) 
169      >>> hsp = all_qresult[-1][-1][-1]   # last qresult, last hit, last hsp 
170      >>> hsp 
171      HSP(...) 
172      >>> hsp.query_inter_ranges 
173      [(388, 449), (284, 319), (198, 198), (114, 161)] 
174      >>> hsp.hit_inter_ranges 
175      [(487387, 641682), (386207, 487327), (208677, 386123), (71917, 208639)] 
176   
177  Here you can see that for both query and hit introns, the coordinates 
178  in each tuple is always (start, end) where start <= end. But when you compare 
179  each tuple to the next, the coordinates decrease. This is an indication that 
180  both the query and hit sequences lie on the minus strand. Exonerate outputs 
181  minus strand results in a decreasing manner; the start coordinate is always 
182  bigger than the end coordinate. ExonerateIO preserves the fragment ordering as a 
183  whole, but uses its own standard to store an individual fragment's start and end 
184  coordinates. 
185   
186  You may also notice that the third tuple in ``query_inter_ranges`` is (198, 198), 
187  two exact same numbers. This means that the query sequence does not have any 
188  gaps at that position. The gap is only present in the hit sequence, where we see 
189  that the third tuple contains (208677, 386123), a gap of about 177k bases. 
190   
191  Another example is to use the ``hit_frame_all`` and ``query_frame_all`` to see if 
192  there are any frameshifts in your alignment: 
193   
194      >>> from Bio import SearchIO 
195      >>> fname = 'Exonerate/exn_22_m_coding2coding_fshifts.exn' 
196      >>> qresult = next(SearchIO.parse(fname, 'exonerate-text')) 
197      >>> hsp = qresult[0][0]      # first hit, first hsp 
198      >>> hsp 
199      HSP(...) 
200      >>> hsp.query_frame_all 
201      [1, 2, 2, 2] 
202      >>> hsp.hit_frame_all 
203      [1, 1, 3, 1] 
204   
205  Here you can see that the alignment as a whole has three frameshifts. The first 
206  one occurs in the query sequence, after the first fragment (1 -> 2 shift), the 
207  second one occurs in the hit sequence, after the second fragment (1 -> 3 shift), 
208  and the last one also occurs in the hit sequence, before the last fragment (3 -> 
209  1 shift). 
210   
211  There are other default HSP properties that you can use to ease your workflow. 
212  Please refer to the HSP object documentation for more details. 
213   
214   
215  exonerate-vulgar 
216  ================ 
217   
218  The vulgar format provides a compact way of representing alignments created by 
219  Exonerate. In general, it contains the same information as the plain text output 
220  except for the 'model' information and the actual sequences themselves. You can 
221  expect that the coordinates obtained from using 'exonerate-text' and 
222  'exonerate-vulgar' to be the same. Both formats also creates HSPFragment using 
223  the same triggers: introns, NER blocks, and/or frameshifts. 
224   
225   
226  exonerate-cigar 
227  =============== 
228   
229  The cigar format provides an even more compact representation of Exonerate 
230  alignments. However, this comes with a cost of losing information. In the cigar 
231  format, for example, introns are treated as simple deletions. This makes it 
232  impossible for the parser to distinguish between simple deletions or intron 
233  regions. As such, 'exonerate-cigar' may produce different sets of coordinates 
234  and fragments compared to 'exonerate-vulgar' or 'exonerate-text'. 
235   
236  """ 
237   
238  # Known issues & gotchas: 
239  # - The cigar parser does not use the extended cigar string; only supports MID 
240  # - Cigar and vulgar parsing results will most likely be different, due to the 
241  #   different type of data stored by both formats 
242   
243  from .exonerate_text import ExonerateTextParser, ExonerateTextIndexer 
244  from .exonerate_vulgar import ExonerateVulgarParser, ExonerateVulgarIndexer 
245  from .exonerate_cigar import ExonerateCigarParser, ExonerateCigarIndexer 
246   
247   
248  # if not used as a module, run the doctest 
249  if __name__ == "__main__": 
250      from Bio._utils import run_doctest 
251      run_doctest() 
252