Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2015 by Peter Cock.  All rights reserved. 
  2  # 
  3  # This code is part of the Biopython distribution and governed by its 
  4  # license.  Please see the LICENSE file that should have been included 
  5  # as part of this package. 
  6  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
  7   
  8  You are expected to use this module via the Bio.AlignIO functions (or the 
  9  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 10   
 11  For example, consider a Stockholm alignment file containing the following:: 
 12   
 13      # STOCKHOLM 1.0 
 14      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 15      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 16      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 17      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 18      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 19   
 20      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 21      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 22      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 23      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 24      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 25      // 
 26   
 27  This is a single multiple sequence alignment, so you would probably load this 
 28  using the Bio.AlignIO.read() function: 
 29   
 30      >>> from Bio import AlignIO 
 31      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 32      >>> print(align) 
 33      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 34      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 35      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 36      >>> for record in align: 
 37      ...     print("%s %i" % (record.id, len(record))) 
 38      AP001509.1 104 
 39      AE007476.1 104 
 40   
 41  This example file is clearly using RNA, so you might want the alignment object 
 42  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 43  the default single letter alphabet as shown above.  You can do this with an 
 44  optional argument to the Bio.AlignIO.read() function: 
 45   
 46      >>> from Bio import AlignIO 
 47      >>> from Bio.Alphabet import generic_rna 
 48      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 49      ...                      alphabet=generic_rna) 
 50      >>> print(align) 
 51      RNAAlphabet() alignment with 2 rows and 104 columns 
 52      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 53      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 54   
 55  In addition to the sequences themselves, this example alignment also includes 
 56  some GR lines for the secondary structure of the sequences.  These are 
 57  strings, with one character for each letter in the associated sequence: 
 58   
 59      >>> for record in align: 
 60      ...     print(record.id) 
 61      ...     print(record.seq) 
 62      ...     print(record.letter_annotations['secondary_structure']) 
 63      AP001509.1 
 64      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 65      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 66      AE007476.1 
 67      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 68      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 69   
 70  Any general annotation for each row is recorded in the SeqRecord's annotations 
 71  dictionary.  You can output this alignment in many different file formats 
 72  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 73   
 74      >>> print(align.format("fasta")) 
 75      >AP001509.1 
 76      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 77      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 78      >AE007476.1 
 79      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 80      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 81      <BLANKLINE> 
 82   
 83  Most output formats won't be able to hold the annotation possible in a 
 84  Stockholm file: 
 85   
 86      >>> print(align.format("stockholm")) 
 87      # STOCKHOLM 1.0 
 88      #=GF SQ 2 
 89      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 90      #=GS AP001509.1 AC AP001509.1 
 91      #=GS AP001509.1 DE AP001509.1 
 92      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 93      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 94      #=GS AE007476.1 AC AE007476.1 
 95      #=GS AE007476.1 DE AE007476.1 
 96      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 97      // 
 98      <BLANKLINE> 
 99   
100  Note that when writing Stockholm files, AlignIO does not break long sequences 
101  up and interleave them (as in the input file shown above).  The standard 
102  allows this simpler layout, and it is more likely to be understood by other 
103  tools. 
104   
105  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
106  iterate over the alignment rows as SeqRecord objects - rather than working 
107  with Alignnment objects. Again, if you want to you can specify this is RNA: 
108   
109      >>> from Bio import SeqIO 
110      >>> from Bio.Alphabet import generic_rna 
111      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
112      ...                           alphabet=generic_rna): 
113      ...     print(record.id) 
114      ...     print(record.seq) 
115      ...     print(record.letter_annotations['secondary_structure']) 
116      AP001509.1 
117      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
118      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
119      AE007476.1 
120      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
121      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
122   
123  Remember that if you slice a SeqRecord, the per-letter-annotations like the 
124  secondary structure string here, are also sliced: 
125   
126      >>> sub_record = record[10:20] 
127      >>> print(sub_record.seq) 
128      AUCGUUUUAC 
129      >>> print(sub_record.letter_annotations['secondary_structure']) 
130      -------<<< 
131  """ 
132  from __future__ import print_function 
133   
134  __docformat__ = "restructuredtext en"  # not just plaintext 
135  from Bio.Seq import Seq 
136  from Bio.SeqRecord import SeqRecord 
137  from Bio.Align import MultipleSeqAlignment 
138  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
139   
140   
141 -class StockholmWriter(SequentialAlignmentWriter):
142 """Stockholm/PFAM alignment writer.""" 143 144 # These dictionaries should be kept in sync with those 145 # defined in the StockholmIterator class. 146 pfam_gr_mapping = {"secondary_structure": "SS", 147 "surface_accessibility": "SA", 148 "transmembrane": "TM", 149 "posterior_probability": "PP", 150 "ligand_binding": "LI", 151 "active_site": "AS", 152 "intron": "IN"} 153 # Following dictionary deliberately does not cover AC, DE or DR 154 pfam_gs_mapping = {"organism": "OS", 155 "organism_classification": "OC", 156 "look": "LO"} 157
158 - def write_alignment(self, alignment):
159 """Use this to write (another) single alignment to an open file. 160 161 Note that sequences and their annotation are recorded 162 together (rather than having a block of annotation followed 163 by a block of aligned sequences). 164 """ 165 count = len(alignment) 166 167 self._length_of_sequences = alignment.get_alignment_length() 168 self._ids_written = [] 169 170 # NOTE - For now, the alignment object does not hold any per column 171 # or per alignment annotation - only per sequence. 172 173 if count == 0: 174 raise ValueError("Must have at least one sequence") 175 if self._length_of_sequences == 0: 176 raise ValueError("Non-empty sequences are required") 177 178 self.handle.write("# STOCKHOLM 1.0\n") 179 self.handle.write("#=GF SQ %i\n" % count) 180 for record in alignment: 181 self._write_record(record) 182 self.handle.write("//\n")
183
184 - def _write_record(self, record):
185 """Write a single SeqRecord to the file""" 186 if self._length_of_sequences != len(record.seq): 187 raise ValueError("Sequences must all be the same length") 188 189 # For the case for stockholm to stockholm, try and use record.name 190 seq_name = record.id 191 if record.name is not None: 192 if "accession" in record.annotations: 193 if record.id == record.annotations["accession"]: 194 seq_name = record.name 195 196 # In the Stockholm file format, spaces are not allowed in the id 197 seq_name = seq_name.replace(" ", "_") 198 199 if "start" in record.annotations \ 200 and "end" in record.annotations: 201 suffix = "/%s-%s" % (str(record.annotations["start"]), 202 str(record.annotations["end"])) 203 if seq_name[-len(suffix):] != suffix: 204 seq_name = "%s/%s-%s" % (seq_name, 205 str(record.annotations["start"]), 206 str(record.annotations["end"])) 207 208 if seq_name in self._ids_written: 209 raise ValueError("Duplicate record identifier: %s" % seq_name) 210 self._ids_written.append(seq_name) 211 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 212 213 # The recommended placement for GS lines (per sequence annotation) 214 # is above the alignment (as a header block) or just below the 215 # corresponding sequence. 216 # 217 # The recommended placement for GR lines (per sequence per column 218 # annotation such as secondary structure) is just below the 219 # corresponding sequence. 220 # 221 # We put both just below the corresponding sequence as this allows 222 # us to write the file using a single pass through the records. 223 224 # AC = Accession 225 if "accession" in record.annotations: 226 self.handle.write("#=GS %s AC %s\n" 227 % (seq_name, self.clean(record.annotations["accession"]))) 228 elif record.id: 229 self.handle.write("#=GS %s AC %s\n" 230 % (seq_name, self.clean(record.id))) 231 232 # DE = description 233 if record.description: 234 self.handle.write("#=GS %s DE %s\n" 235 % (seq_name, self.clean(record.description))) 236 237 # DE = database links 238 for xref in record.dbxrefs: 239 self.handle.write("#=GS %s DR %s\n" 240 % (seq_name, self.clean(xref))) 241 242 # GS = other per sequence annotation 243 for key, value in record.annotations.items(): 244 if key in self.pfam_gs_mapping: 245 data = self.clean(str(value)) 246 if data: 247 self.handle.write("#=GS %s %s %s\n" 248 % (seq_name, 249 self.clean(self.pfam_gs_mapping[key]), 250 data)) 251 else: 252 # It doesn't follow the PFAM standards, but should we record 253 # this data anyway? 254 pass 255 256 # GR = per row per column sequence annotation 257 for key, value in record.letter_annotations.items(): 258 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 259 data = self.clean(str(value)) 260 if data: 261 self.handle.write("#=GR %s %s %s\n" 262 % (seq_name, 263 self.clean(self.pfam_gr_mapping[key]), 264 data)) 265 else: 266 # It doesn't follow the PFAM standards, but should we record 267 # this data anyway? 268 pass
269 270
271 -class StockholmIterator(AlignmentIterator):
272 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 273 274 The file may contain multiple concatenated alignments, which are loaded 275 and returned incrementally. 276 277 This parser will detect if the Stockholm file follows the PFAM 278 conventions for sequence specific meta-data (lines starting #=GS 279 and #=GR) and populates the SeqRecord fields accordingly. 280 281 Any annotation which does not follow the PFAM conventions is currently 282 ignored. 283 284 If an accession is provided for an entry in the meta data, IT WILL NOT 285 be used as the record.id (it will be recorded in the record's 286 annotations). This is because some files have (sub) sequences from 287 different parts of the same accession (differentiated by different 288 start-end positions). 289 290 Wrap-around alignments are not supported - each sequences must be on 291 a single line. However, interlaced sequences should work. 292 293 For more information on the file format, please see: 294 http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format 295 http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html 296 297 For consistency with BioPerl and EMBOSS we call this the "stockholm" 298 format. 299 """ 300 301 # These dictionaries should be kept in sync with those 302 # defined in the PfamStockholmWriter class. 303 pfam_gr_mapping = {"SS": "secondary_structure", 304 "SA": "surface_accessibility", 305 "TM": "transmembrane", 306 "PP": "posterior_probability", 307 "LI": "ligand_binding", 308 "AS": "active_site", 309 "IN": "intron"} 310 # Following dictionary deliberately does not cover AC, DE or DR 311 pfam_gs_mapping = {"OS": "organism", 312 "OC": "organism_classification", 313 "LO": "look"} 314 315 _header = None # for caching lines between __next__ calls 316
317 - def __next__(self):
318 handle = self.handle 319 320 if self._header is None: 321 line = handle.readline() 322 else: 323 # Header we saved from when we were parsing 324 # the previous alignment. 325 line = self._header 326 self._header = None 327 328 if not line: 329 # Empty file - just give up. 330 raise StopIteration 331 if not line.strip() == '# STOCKHOLM 1.0': 332 raise ValueError("Did not find STOCKHOLM header") 333 334 # Note: If this file follows the PFAM conventions, there should be 335 # a line containing the number of sequences, e.g. "#=GF SQ 67" 336 # We do not check for this - perhaps we should, and verify that 337 # if present it agrees with our parsing. 338 339 seqs = {} 340 ids = [] 341 gs = {} 342 gr = {} 343 gf = {} 344 passed_end_alignment = False 345 while True: 346 line = handle.readline() 347 if not line: 348 break # end of file 349 line = line.strip() # remove trailing \n 350 if line == '# STOCKHOLM 1.0': 351 self._header = line 352 break 353 elif line == "//": 354 # The "//" line indicates the end of the alignment. 355 # There may still be more meta-data 356 passed_end_alignment = True 357 elif line == "": 358 # blank line, ignore 359 pass 360 elif line[0] != "#": 361 # Sequence 362 # Format: "<seqname> <sequence>" 363 assert not passed_end_alignment 364 parts = [x.strip() for x in line.split(" ", 1)] 365 if len(parts) != 2: 366 # This might be someone attempting to store a zero length sequence? 367 raise ValueError("Could not split line into identifier " 368 + "and sequence:\n" + line) 369 id, seq = parts 370 if id not in ids: 371 ids.append(id) 372 seqs.setdefault(id, '') 373 seqs[id] += seq.replace(".", "-") 374 elif len(line) >= 5: 375 # Comment line or meta-data 376 if line[:5] == "#=GF ": 377 # Generic per-File annotation, free text 378 # Format: #=GF <feature> <free text> 379 feature, text = line[5:].strip().split(None, 1) 380 # Each feature key could be used more than once, 381 # so store the entries as a list of strings. 382 if feature not in gf: 383 gf[feature] = [text] 384 else: 385 gf[feature].append(text) 386 elif line[:5] == '#=GC ': 387 # Generic per-Column annotation, exactly 1 char per column 388 # Format: "#=GC <feature> <exactly 1 char per column>" 389 pass 390 elif line[:5] == '#=GS ': 391 # Generic per-Sequence annotation, free text 392 # Format: "#=GS <seqname> <feature> <free text>" 393 id, feature, text = line[5:].strip().split(None, 2) 394 # if id not in ids: 395 # ids.append(id) 396 if id not in gs: 397 gs[id] = {} 398 if feature not in gs[id]: 399 gs[id][feature] = [text] 400 else: 401 gs[id][feature].append(text) 402 elif line[:5] == "#=GR ": 403 # Generic per-Sequence AND per-Column markup 404 # Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 405 id, feature, text = line[5:].strip().split(None, 2) 406 # if id not in ids: 407 # ids.append(id) 408 if id not in gr: 409 gr[id] = {} 410 if feature not in gr[id]: 411 gr[id][feature] = "" 412 gr[id][feature] += text.strip() # append to any previous entry 413 # TODO - Should we check the length matches the alignment length? 414 # For iterlaced sequences the GR data can be split over 415 # multiple lines 416 # Next line... 417 418 assert len(seqs) <= len(ids) 419 # assert len(gs) <= len(ids) 420 # assert len(gr) <= len(ids) 421 422 self.ids = ids 423 self.sequences = seqs 424 self.seq_annotation = gs 425 self.seq_col_annotation = gr 426 427 if ids and seqs: 428 429 if self.records_per_alignment is not None \ 430 and self.records_per_alignment != len(ids): 431 raise ValueError("Found %i records in this alignment, told to expect %i" 432 % (len(ids), self.records_per_alignment)) 433 434 alignment_length = len(list(seqs.values())[0]) 435 records = [] # Alignment obj will put them all in a list anyway 436 for id in ids: 437 seq = seqs[id] 438 if alignment_length != len(seq): 439 raise ValueError("Sequences have different lengths, or repeated identifier") 440 name, start, end = self._identifier_split(id) 441 record = SeqRecord(Seq(seq, self.alphabet), 442 id=id, name=name, description=id, 443 annotations={"accession": name}) 444 # Accession will be overridden by _populate_meta_data if an explicit 445 # accession is provided: 446 record.annotations["accession"] = name 447 448 if start is not None: 449 record.annotations["start"] = start 450 if end is not None: 451 record.annotations["end"] = end 452 453 self._populate_meta_data(id, record) 454 records.append(record) 455 alignment = MultipleSeqAlignment(records, self.alphabet) 456 457 # TODO - Introduce an annotated alignment class? 458 # For now, store the annotation a new private property: 459 alignment._annotations = gr 460 461 return alignment 462 else: 463 raise StopIteration
464
465 - def _identifier_split(self, identifier):
466 """Returns (name, start, end) string tuple from an identier.""" 467 if '/' in identifier: 468 name, start_end = identifier.rsplit("/", 1) 469 if start_end.count("-") == 1: 470 try: 471 start, end = start_end.split("-") 472 return name, int(start), int(end) 473 except ValueError: 474 # Non-integers after final '/' - fall through 475 pass 476 return identifier, None, None
477
478 - def _get_meta_data(self, identifier, meta_dict):
479 """Takes an itentifier and returns dict of all meta-data matching it. 480 481 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 482 this or "Q9PN73_CAMJE" which the identifier without its /start-end 483 suffix. 484 485 In the example below, the suffix is required to match the AC, but must 486 be removed to match the OS and OC meta-data:: 487 488 # STOCKHOLM 1.0 489 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 490 ... 491 Q9PN73_CAMJE/149-220 NKA... 492 ... 493 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 494 #=GS Q9PN73_CAMJE OC Bacteria 495 496 This function will return an empty dictionary if no data is found.""" 497 name, start, end = self._identifier_split(identifier) 498 if name == identifier: 499 identifier_keys = [identifier] 500 else: 501 identifier_keys = [identifier, name] 502 answer = {} 503 for identifier_key in identifier_keys: 504 try: 505 for feature_key in meta_dict[identifier_key]: 506 answer[feature_key] = meta_dict[identifier_key][feature_key] 507 except KeyError: 508 pass 509 return answer
510
511 - def _populate_meta_data(self, identifier, record):
512 """Adds meta-date to a SecRecord's annotations dictionary. 513 514 This function applies the PFAM conventions.""" 515 516 seq_data = self._get_meta_data(identifier, self.seq_annotation) 517 for feature in seq_data: 518 # Note this dictionary contains lists! 519 if feature == "AC": # ACcession number 520 assert len(seq_data[feature]) == 1 521 record.annotations["accession"] = seq_data[feature][0] 522 elif feature == "DE": # DEscription 523 record.description = "\n".join(seq_data[feature]) 524 elif feature == "DR": # Database Reference 525 # Should we try and parse the strings? 526 record.dbxrefs = seq_data[feature] 527 elif feature in self.pfam_gs_mapping: 528 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 529 else: 530 # Ignore it? 531 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 532 533 # Now record the per-letter-annotations 534 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 535 for feature in seq_col_data: 536 # Note this dictionary contains strings! 537 if feature in self.pfam_gr_mapping: 538 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 539 else: 540 # Ignore it? 541 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
542 543 544 if __name__ == "__main__": 545 from Bio._utils import run_doctest 546 run_doctest() 547