Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2013 by Peter Cock.  All rights reserved. 
  2  # 
  3  # This code is part of the Biopython distribution and governed by its 
  4  # license.  Please see the LICENSE file that should have been included 
  5  # as part of this package. 
  6  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
  7   
  8  You are expected to use this module via the Bio.AlignIO functions (or the 
  9  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 10   
 11  For example, consider a Stockholm alignment file containing the following:: 
 12   
 13      # STOCKHOLM 1.0 
 14      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 15      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 16      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 17      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 18      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 19   
 20      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 21      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 22      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 23      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 24      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 25      // 
 26   
 27  This is a single multiple sequence alignment, so you would probably load this 
 28  using the Bio.AlignIO.read() function: 
 29   
 30      >>> from Bio import AlignIO 
 31      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 32      >>> print(align) 
 33      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 34      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 35      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 36      >>> for record in align: 
 37      ...     print("%s %i" % (record.id, len(record))) 
 38      AP001509.1 104 
 39      AE007476.1 104 
 40   
 41  This example file is clearly using RNA, so you might want the alignment object 
 42  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 43  the default single letter alphabet as shown above.  You can do this with an 
 44  optional argument to the Bio.AlignIO.read() function: 
 45   
 46      >>> from Bio import AlignIO 
 47      >>> from Bio.Alphabet import generic_rna 
 48      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 49      ...                      alphabet=generic_rna) 
 50      >>> print(align) 
 51      RNAAlphabet() alignment with 2 rows and 104 columns 
 52      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 53      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 54   
 55  In addition to the sequences themselves, this example alignment also includes 
 56  some GR lines for the secondary structure of the sequences.  These are 
 57  strings, with one character for each letter in the associated sequence: 
 58   
 59      >>> for record in align: 
 60      ...     print(record.id) 
 61      ...     print(record.seq) 
 62      ...     print(record.letter_annotations['secondary_structure']) 
 63      AP001509.1 
 64      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 65      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 66      AE007476.1 
 67      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 68      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 69   
 70  Any general annotation for each row is recorded in the SeqRecord's annotations 
 71  dictionary.  You can output this alignment in many different file formats 
 72  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 73   
 74      >>> print(align.format("fasta")) 
 75      >AP001509.1 
 76      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 77      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 78      >AE007476.1 
 79      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 80      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 81      <BLANKLINE> 
 82   
 83  Most output formats won't be able to hold the annotation possible in a 
 84  Stockholm file: 
 85   
 86      >>> print(align.format("stockholm")) 
 87      # STOCKHOLM 1.0 
 88      #=GF SQ 2 
 89      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 90      #=GS AP001509.1 AC AP001509.1 
 91      #=GS AP001509.1 DE AP001509.1 
 92      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 93      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 94      #=GS AE007476.1 AC AE007476.1 
 95      #=GS AE007476.1 DE AE007476.1 
 96      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 97      // 
 98      <BLANKLINE> 
 99   
100  Note that when writing Stockholm files, AlignIO does not break long sequences 
101  up and interleave them (as in the input file shown above).  The standard 
102  allows this simpler layout, and it is more likely to be understood by other 
103  tools. 
104   
105  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
106  iterate over the alignment rows as SeqRecord objects - rather than working 
107  with Alignnment objects. Again, if you want to you can specify this is RNA: 
108   
109      >>> from Bio import SeqIO 
110      >>> from Bio.Alphabet import generic_rna 
111      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
112      ...                           alphabet=generic_rna): 
113      ...     print(record.id) 
114      ...     print(record.seq) 
115      ...     print(record.letter_annotations['secondary_structure']) 
116      AP001509.1 
117      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
118      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
119      AE007476.1 
120      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
121      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
122   
123  Remember that if you slice a SeqRecord, the per-letter-annotions like the 
124  secondary structure string here, are also sliced: 
125   
126      >>> sub_record = record[10:20] 
127      >>> print(sub_record.seq) 
128      AUCGUUUUAC 
129      >>> print(sub_record.letter_annotations['secondary_structure']) 
130      -------<<< 
131  """ 
132  from __future__ import print_function 
133   
134  __docformat__ = "epytext en"  # not just plaintext 
135  from Bio.Seq import Seq 
136  from Bio.SeqRecord import SeqRecord 
137  from Bio.Align import MultipleSeqAlignment 
138  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
139   
140   
141 -class StockholmWriter(SequentialAlignmentWriter):
142 """Stockholm/PFAM alignment writer.""" 143 144 #These dictionaries should be kept in sync with those 145 #defined in the StockholmIterator class. 146 pfam_gr_mapping = {"secondary_structure": "SS", 147 "surface_accessibility": "SA", 148 "transmembrane": "TM", 149 "posterior_probability": "PP", 150 "ligand_binding": "LI", 151 "active_site": "AS", 152 "intron": "IN"} 153 #Following dictionary deliberately does not cover AC, DE or DR 154 pfam_gs_mapping = {"organism": "OS", 155 "organism_classification": "OC", 156 "look": "LO"} 157
158 - def write_alignment(self, alignment):
159 """Use this to write (another) single alignment to an open file. 160 161 Note that sequences and their annotation are recorded 162 together (rather than having a block of annotation followed 163 by a block of aligned sequences). 164 """ 165 count = len(alignment) 166 167 self._length_of_sequences = alignment.get_alignment_length() 168 self._ids_written = [] 169 170 #NOTE - For now, the alignment object does not hold any per column 171 #or per alignment annotation - only per sequence. 172 173 if count == 0: 174 raise ValueError("Must have at least one sequence") 175 if self._length_of_sequences == 0: 176 raise ValueError("Non-empty sequences are required") 177 178 self.handle.write("# STOCKHOLM 1.0\n") 179 self.handle.write("#=GF SQ %i\n" % count) 180 for record in alignment: 181 self._write_record(record) 182 self.handle.write("//\n")
183
184 - def _write_record(self, record):
185 """Write a single SeqRecord to the file""" 186 if self._length_of_sequences != len(record.seq): 187 raise ValueError("Sequences must all be the same length") 188 189 #For the case for stockholm to stockholm, try and use record.name 190 seq_name = record.id 191 if record.name is not None: 192 if "accession" in record.annotations: 193 if record.id == record.annotations["accession"]: 194 seq_name = record.name 195 196 #In the Stockholm file format, spaces are not allowed in the id 197 seq_name = seq_name.replace(" ", "_") 198 199 if "start" in record.annotations \ 200 and "end" in record.annotations: 201 suffix = "/%s-%s" % (str(record.annotations["start"]), 202 str(record.annotations["end"])) 203 if seq_name[-len(suffix):] != suffix: 204 seq_name = "%s/%s-%s" % (seq_name, 205 str(record.annotations["start"]), 206 str(record.annotations["end"])) 207 208 if seq_name in self._ids_written: 209 raise ValueError("Duplicate record identifier: %s" % seq_name) 210 self._ids_written.append(seq_name) 211 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 212 213 #The recommended placement for GS lines (per sequence annotation) 214 #is above the alignment (as a header block) or just below the 215 #corresponding sequence. 216 # 217 #The recommended placement for GR lines (per sequence per column 218 #annotation such as secondary structure) is just below the 219 #corresponding sequence. 220 # 221 #We put both just below the corresponding sequence as this allows 222 #us to write the file using a single pass through the records. 223 224 #AC = Accession 225 if "accession" in record.annotations: 226 self.handle.write("#=GS %s AC %s\n" 227 % (seq_name, self.clean(record.annotations["accession"]))) 228 elif record.id: 229 self.handle.write("#=GS %s AC %s\n" 230 % (seq_name, self.clean(record.id))) 231 232 #DE = description 233 if record.description: 234 self.handle.write("#=GS %s DE %s\n" 235 % (seq_name, self.clean(record.description))) 236 237 #DE = database links 238 for xref in record.dbxrefs: 239 self.handle.write("#=GS %s DR %s\n" 240 % (seq_name, self.clean(xref))) 241 242 #GS = other per sequence annotation 243 for key, value in record.annotations.items(): 244 if key in self.pfam_gs_mapping: 245 data = self.clean(str(value)) 246 if data: 247 self.handle.write("#=GS %s %s %s\n" 248 % (seq_name, 249 self.clean(self.pfam_gs_mapping[key]), 250 data)) 251 else: 252 #It doesn't follow the PFAM standards, but should we record 253 #this data anyway? 254 pass 255 256 #GR = per row per column sequence annotation 257 for key, value in record.letter_annotations.items(): 258 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 259 data = self.clean(str(value)) 260 if data: 261 self.handle.write("#=GR %s %s %s\n" 262 % (seq_name, 263 self.clean(self.pfam_gr_mapping[key]), 264 data)) 265 else: 266 #It doesn't follow the PFAM standards, but should we record 267 #this data anyway? 268 pass
269 270
271 -class StockholmIterator(AlignmentIterator):
272 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 273 274 The file may contain multiple concatenated alignments, which are loaded 275 and returned incrementally. 276 277 This parser will detect if the Stockholm file follows the PFAM 278 conventions for sequence specific meta-data (lines starting #=GS 279 and #=GR) and populates the SeqRecord fields accordingly. 280 281 Any annotation which does not follow the PFAM conventions is currently 282 ignored. 283 284 If an accession is provided for an entry in the meta data, IT WILL NOT 285 be used as the record.id (it will be recorded in the record's 286 annotations). This is because some files have (sub) sequences from 287 different parts of the same accession (differentiated by different 288 start-end positions). 289 290 Wrap-around alignments are not supported - each sequences must be on 291 a single line. However, interlaced sequences should work. 292 293 For more information on the file format, please see: 294 http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format 295 http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html 296 297 For consistency with BioPerl and EMBOSS we call this the "stockholm" 298 format. 299 """ 300 301 #These dictionaries should be kept in sync with those 302 #defined in the PfamStockholmWriter class. 303 pfam_gr_mapping = {"SS": "secondary_structure", 304 "SA": "surface_accessibility", 305 "TM": "transmembrane", 306 "PP": "posterior_probability", 307 "LI": "ligand_binding", 308 "AS": "active_site", 309 "IN": "intron"} 310 #Following dictionary deliberately does not cover AC, DE or DR 311 pfam_gs_mapping = {"OS": "organism", 312 "OC": "organism_classification", 313 "LO": "look"} 314
315 - def __next__(self):
316 try: 317 line = self._header 318 del self._header 319 except AttributeError: 320 line = self.handle.readline() 321 if not line: 322 #Empty file - just give up. 323 raise StopIteration 324 if not line.strip() == '# STOCKHOLM 1.0': 325 raise ValueError("Did not find STOCKHOLM header") 326 327 # Note: If this file follows the PFAM conventions, there should be 328 # a line containing the number of sequences, e.g. "#=GF SQ 67" 329 # We do not check for this - perhaps we should, and verify that 330 # if present it agrees with our parsing. 331 332 seqs = {} 333 ids = [] 334 gs = {} 335 gr = {} 336 gf = {} 337 passed_end_alignment = False 338 while True: 339 line = self.handle.readline() 340 if not line: 341 break # end of file 342 line = line.strip() # remove trailing \n 343 if line == '# STOCKHOLM 1.0': 344 self._header = line 345 break 346 elif line == "//": 347 #The "//" line indicates the end of the alignment. 348 #There may still be more meta-data 349 passed_end_alignment = True 350 elif line == "": 351 #blank line, ignore 352 pass 353 elif line[0] != "#": 354 #Sequence 355 #Format: "<seqname> <sequence>" 356 assert not passed_end_alignment 357 parts = [x.strip() for x in line.split(" ", 1)] 358 if len(parts) != 2: 359 #This might be someone attempting to store a zero length sequence? 360 raise ValueError("Could not split line into identifier " 361 + "and sequence:\n" + line) 362 id, seq = parts 363 if id not in ids: 364 ids.append(id) 365 seqs.setdefault(id, '') 366 seqs[id] += seq.replace(".", "-") 367 elif len(line) >= 5: 368 #Comment line or meta-data 369 if line[:5] == "#=GF ": 370 #Generic per-File annotation, free text 371 #Format: #=GF <feature> <free text> 372 feature, text = line[5:].strip().split(None, 1) 373 #Each feature key could be used more than once, 374 #so store the entries as a list of strings. 375 if feature not in gf: 376 gf[feature] = [text] 377 else: 378 gf[feature].append(text) 379 elif line[:5] == '#=GC ': 380 #Generic per-Column annotation, exactly 1 char per column 381 #Format: "#=GC <feature> <exactly 1 char per column>" 382 pass 383 elif line[:5] == '#=GS ': 384 #Generic per-Sequence annotation, free text 385 #Format: "#=GS <seqname> <feature> <free text>" 386 id, feature, text = line[5:].strip().split(None, 2) 387 #if id not in ids: 388 # ids.append(id) 389 if id not in gs: 390 gs[id] = {} 391 if feature not in gs[id]: 392 gs[id][feature] = [text] 393 else: 394 gs[id][feature].append(text) 395 elif line[:5] == "#=GR ": 396 #Generic per-Sequence AND per-Column markup 397 #Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 398 id, feature, text = line[5:].strip().split(None, 2) 399 #if id not in ids: 400 # ids.append(id) 401 if id not in gr: 402 gr[id] = {} 403 if feature not in gr[id]: 404 gr[id][feature] = "" 405 gr[id][feature] += text.strip() # append to any previous entry 406 #TODO - Should we check the length matches the alignment length? 407 # For iterlaced sequences the GR data can be split over 408 # multiple lines 409 #Next line... 410 411 assert len(seqs) <= len(ids) 412 #assert len(gs) <= len(ids) 413 #assert len(gr) <= len(ids) 414 415 self.ids = ids 416 self.sequences = seqs 417 self.seq_annotation = gs 418 self.seq_col_annotation = gr 419 420 if ids and seqs: 421 422 if self.records_per_alignment is not None \ 423 and self.records_per_alignment != len(ids): 424 raise ValueError("Found %i records in this alignment, told to expect %i" 425 % (len(ids), self.records_per_alignment)) 426 427 alignment_length = len(list(seqs.values())[0]) 428 records = [] # Alignment obj will put them all in a list anyway 429 for id in ids: 430 seq = seqs[id] 431 if alignment_length != len(seq): 432 raise ValueError("Sequences have different lengths, or repeated identifier") 433 name, start, end = self._identifier_split(id) 434 record = SeqRecord(Seq(seq, self.alphabet), 435 id=id, name=name, description=id, 436 annotations={"accession": name}) 437 #Accession will be overridden by _populate_meta_data if an explicit 438 #accession is provided: 439 record.annotations["accession"] = name 440 441 if start is not None: 442 record.annotations["start"] = start 443 if end is not None: 444 record.annotations["end"] = end 445 446 self._populate_meta_data(id, record) 447 records.append(record) 448 alignment = MultipleSeqAlignment(records, self.alphabet) 449 450 #TODO - Introduce an annotated alignment class? 451 #For now, store the annotation a new private property: 452 alignment._annotations = gr 453 454 return alignment 455 else: 456 raise StopIteration
457
458 - def _identifier_split(self, identifier):
459 """Returns (name, start, end) string tuple from an identier.""" 460 if '/' in identifier: 461 name, start_end = identifier.rsplit("/", 1) 462 if start_end.count("-") == 1: 463 try: 464 start, end = start_end.split("-") 465 return name, int(start), int(end) 466 except ValueError: 467 # Non-integers after final '/' - fall through 468 pass 469 return identifier, None, None
470
471 - def _get_meta_data(self, identifier, meta_dict):
472 """Takes an itentifier and returns dict of all meta-data matching it. 473 474 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 475 this or "Q9PN73_CAMJE" which the identifier without its /start-end 476 suffix. 477 478 In the example below, the suffix is required to match the AC, but must 479 be removed to match the OS and OC meta-data:: 480 481 # STOCKHOLM 1.0 482 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 483 ... 484 Q9PN73_CAMJE/149-220 NKA... 485 ... 486 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 487 #=GS Q9PN73_CAMJE OC Bacteria 488 489 This function will return an empty dictionary if no data is found.""" 490 name, start, end = self._identifier_split(identifier) 491 if name == identifier: 492 identifier_keys = [identifier] 493 else: 494 identifier_keys = [identifier, name] 495 answer = {} 496 for identifier_key in identifier_keys: 497 try: 498 for feature_key in meta_dict[identifier_key]: 499 answer[feature_key] = meta_dict[identifier_key][feature_key] 500 except KeyError: 501 pass 502 return answer
503
504 - def _populate_meta_data(self, identifier, record):
505 """Adds meta-date to a SecRecord's annotations dictionary. 506 507 This function applies the PFAM conventions.""" 508 509 seq_data = self._get_meta_data(identifier, self.seq_annotation) 510 for feature in seq_data: 511 #Note this dictionary contains lists! 512 if feature == "AC": # ACcession number 513 assert len(seq_data[feature]) == 1 514 record.annotations["accession"] = seq_data[feature][0] 515 elif feature == "DE": # DEscription 516 record.description = "\n".join(seq_data[feature]) 517 elif feature == "DR": # Database Reference 518 #Should we try and parse the strings? 519 record.dbxrefs = seq_data[feature] 520 elif feature in self.pfam_gs_mapping: 521 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 522 else: 523 #Ignore it? 524 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 525 526 #Now record the per-letter-annotations 527 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 528 for feature in seq_col_data: 529 #Note this dictionary contains strings! 530 if feature in self.pfam_gr_mapping: 531 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 532 else: 533 #Ignore it? 534 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
535 536 537 if __name__ == "__main__": 538 from Bio._utils import run_doctest 539 run_doctest() 540