Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2016 by Peter Cock.  All rights reserved. 
  2  # Revisions copyright 2015 by Ben Woodcroft.  All rights reserved. 
  3  # 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
  9   
 10  You are expected to use this module via the Bio.AlignIO functions (or the 
 11  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 12   
 13  For example, consider a Stockholm alignment file containing the following:: 
 14   
 15      # STOCKHOLM 1.0 
 16      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 17      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 18      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 19      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 20      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 21   
 22      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 23      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 24      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 25      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 26      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 27      // 
 28   
 29  This is a single multiple sequence alignment, so you would probably load this 
 30  using the Bio.AlignIO.read() function: 
 31   
 32      >>> from Bio import AlignIO 
 33      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 34      >>> print(align) 
 35      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 36      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 37      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 38      >>> for record in align: 
 39      ...     print("%s %i" % (record.id, len(record))) 
 40      AP001509.1 104 
 41      AE007476.1 104 
 42   
 43  This example file is clearly using RNA, so you might want the alignment object 
 44  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 45  the default single letter alphabet as shown above.  You can do this with an 
 46  optional argument to the Bio.AlignIO.read() function: 
 47   
 48      >>> from Bio import AlignIO 
 49      >>> from Bio.Alphabet import generic_rna 
 50      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 51      ...                      alphabet=generic_rna) 
 52      >>> print(align) 
 53      RNAAlphabet() alignment with 2 rows and 104 columns 
 54      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 55      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 56   
 57  In addition to the sequences themselves, this example alignment also includes 
 58  some GR lines for the secondary structure of the sequences.  These are 
 59  strings, with one character for each letter in the associated sequence: 
 60   
 61      >>> for record in align: 
 62      ...     print(record.id) 
 63      ...     print(record.seq) 
 64      ...     print(record.letter_annotations['secondary_structure']) 
 65      AP001509.1 
 66      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 67      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 68      AE007476.1 
 69      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 70      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 71   
 72  Any general annotation for each row is recorded in the SeqRecord's annotations 
 73  dictionary.  You can output this alignment in many different file formats 
 74  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 75   
 76      >>> print(align.format("fasta")) 
 77      >AP001509.1 
 78      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 79      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 80      >AE007476.1 
 81      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 82      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 83      <BLANKLINE> 
 84   
 85  Most output formats won't be able to hold the annotation possible in a 
 86  Stockholm file: 
 87   
 88      >>> print(align.format("stockholm")) 
 89      # STOCKHOLM 1.0 
 90      #=GF SQ 2 
 91      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 92      #=GS AP001509.1 AC AP001509.1 
 93      #=GS AP001509.1 DE AP001509.1 
 94      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 95      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 96      #=GS AE007476.1 AC AE007476.1 
 97      #=GS AE007476.1 DE AE007476.1 
 98      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 99      // 
100      <BLANKLINE> 
101   
102  Note that when writing Stockholm files, AlignIO does not break long sequences 
103  up and interleave them (as in the input file shown above).  The standard 
104  allows this simpler layout, and it is more likely to be understood by other 
105  tools. 
106   
107  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
108  iterate over the alignment rows as SeqRecord objects - rather than working 
109  with Alignnment objects. Again, if you want to you can specify this is RNA: 
110   
111      >>> from Bio import SeqIO 
112      >>> from Bio.Alphabet import generic_rna 
113      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
114      ...                           alphabet=generic_rna): 
115      ...     print(record.id) 
116      ...     print(record.seq) 
117      ...     print(record.letter_annotations['secondary_structure']) 
118      AP001509.1 
119      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
120      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
121      AE007476.1 
122      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
123      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
124   
125  Remember that if you slice a SeqRecord, the per-letter-annotations like the 
126  secondary structure string here, are also sliced: 
127   
128      >>> sub_record = record[10:20] 
129      >>> print(sub_record.seq) 
130      AUCGUUUUAC 
131      >>> print(sub_record.letter_annotations['secondary_structure']) 
132      -------<<< 
133   
134  """ 
135  from __future__ import print_function 
136   
137  from collections import OrderedDict 
138   
139  from Bio.Seq import Seq 
140  from Bio.SeqRecord import SeqRecord 
141  from Bio.Align import MultipleSeqAlignment 
142  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
143   
144   
145 -class StockholmWriter(SequentialAlignmentWriter):
146 """Stockholm/PFAM alignment writer.""" 147 148 # These dictionaries should be kept in sync with those 149 # defined in the StockholmIterator class. 150 pfam_gr_mapping = {"secondary_structure": "SS", 151 "surface_accessibility": "SA", 152 "transmembrane": "TM", 153 "posterior_probability": "PP", 154 "ligand_binding": "LI", 155 "active_site": "AS", 156 "intron": "IN"} 157 # Following dictionary deliberately does not cover AC, DE or DR 158 pfam_gs_mapping = {"organism": "OS", 159 "organism_classification": "OC", 160 "look": "LO"} 161
162 - def write_alignment(self, alignment):
163 """Use this to write (another) single alignment to an open file. 164 165 Note that sequences and their annotation are recorded 166 together (rather than having a block of annotation followed 167 by a block of aligned sequences). 168 """ 169 count = len(alignment) 170 171 self._length_of_sequences = alignment.get_alignment_length() 172 self._ids_written = [] 173 174 # NOTE - For now, the alignment object does not hold any per column 175 # or per alignment annotation - only per sequence. 176 177 if count == 0: 178 raise ValueError("Must have at least one sequence") 179 if self._length_of_sequences == 0: 180 raise ValueError("Non-empty sequences are required") 181 182 self.handle.write("# STOCKHOLM 1.0\n") 183 self.handle.write("#=GF SQ %i\n" % count) 184 for record in alignment: 185 self._write_record(record) 186 self.handle.write("//\n")
187
188 - def _write_record(self, record):
189 """Write a single SeqRecord to the file.""" 190 if self._length_of_sequences != len(record.seq): 191 raise ValueError("Sequences must all be the same length") 192 193 # For the case for stockholm to stockholm, try and use record.name 194 seq_name = record.id 195 if record.name is not None: 196 if "accession" in record.annotations: 197 if record.id == record.annotations["accession"]: 198 seq_name = record.name 199 200 # In the Stockholm file format, spaces are not allowed in the id 201 seq_name = seq_name.replace(" ", "_") 202 203 if "start" in record.annotations \ 204 and "end" in record.annotations: 205 suffix = "/%s-%s" % (str(record.annotations["start"]), 206 str(record.annotations["end"])) 207 if seq_name[-len(suffix):] != suffix: 208 seq_name = "%s/%s-%s" % ( 209 seq_name, 210 str(record.annotations["start"]), 211 str(record.annotations["end"])) 212 213 if seq_name in self._ids_written: 214 raise ValueError("Duplicate record identifier: %s" % seq_name) 215 self._ids_written.append(seq_name) 216 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 217 218 # The recommended placement for GS lines (per sequence annotation) 219 # is above the alignment (as a header block) or just below the 220 # corresponding sequence. 221 # 222 # The recommended placement for GR lines (per sequence per column 223 # annotation such as secondary structure) is just below the 224 # corresponding sequence. 225 # 226 # We put both just below the corresponding sequence as this allows 227 # us to write the file using a single pass through the records. 228 229 # AC = Accession 230 if "accession" in record.annotations: 231 self.handle.write("#=GS %s AC %s\n" % ( 232 seq_name, self.clean(record.annotations["accession"]))) 233 elif record.id: 234 self.handle.write("#=GS %s AC %s\n" % ( 235 seq_name, self.clean(record.id))) 236 237 # DE = description 238 if record.description: 239 self.handle.write("#=GS %s DE %s\n" % ( 240 seq_name, self.clean(record.description))) 241 242 # DE = database links 243 for xref in record.dbxrefs: 244 self.handle.write("#=GS %s DR %s\n" % ( 245 seq_name, self.clean(xref))) 246 247 # GS = other per sequence annotation 248 for key, value in record.annotations.items(): 249 if key in self.pfam_gs_mapping: 250 data = self.clean(str(value)) 251 if data: 252 self.handle.write("#=GS %s %s %s\n" 253 % (seq_name, 254 self.clean(self.pfam_gs_mapping[key]), 255 data)) 256 else: 257 # It doesn't follow the PFAM standards, but should we record 258 # this data anyway? 259 pass 260 261 # GR = per row per column sequence annotation 262 for key, value in record.letter_annotations.items(): 263 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 264 data = self.clean(str(value)) 265 if data: 266 self.handle.write("#=GR %s %s %s\n" 267 % (seq_name, 268 self.clean(self.pfam_gr_mapping[key]), 269 data)) 270 else: 271 # It doesn't follow the PFAM standards, but should we record 272 # this data anyway? 273 pass
274 275
276 -class StockholmIterator(AlignmentIterator):
277 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 278 279 The file may contain multiple concatenated alignments, which are loaded 280 and returned incrementally. 281 282 This parser will detect if the Stockholm file follows the PFAM 283 conventions for sequence specific meta-data (lines starting #=GS 284 and #=GR) and populates the SeqRecord fields accordingly. 285 286 Any annotation which does not follow the PFAM conventions is currently 287 ignored. 288 289 If an accession is provided for an entry in the meta data, IT WILL NOT 290 be used as the record.id (it will be recorded in the record's 291 annotations). This is because some files have (sub) sequences from 292 different parts of the same accession (differentiated by different 293 start-end positions). 294 295 Wrap-around alignments are not supported - each sequences must be on 296 a single line. However, interlaced sequences should work. 297 298 For more information on the file format, please see: 299 http://sonnhammer.sbc.su.se/Stockholm.html 300 https://en.wikipedia.org/wiki/Stockholm_format 301 http://bioperl.org/formats/alignment_formats/Stockholm_multiple_alignment_format.html 302 303 For consistency with BioPerl and EMBOSS we call this the "stockholm" 304 format. 305 """ 306 307 # These dictionaries should be kept in sync with those 308 # defined in the PfamStockholmWriter class. 309 pfam_gr_mapping = {"SS": "secondary_structure", 310 "SA": "surface_accessibility", 311 "TM": "transmembrane", 312 "PP": "posterior_probability", 313 "LI": "ligand_binding", 314 "AS": "active_site", 315 "IN": "intron"} 316 # Following dictionary deliberately does not cover AC, DE or DR 317 pfam_gs_mapping = {"OS": "organism", 318 "OC": "organism_classification", 319 "LO": "look"} 320 321 _header = None # for caching lines between __next__ calls 322
323 - def __next__(self):
324 """Parse the next alignment from the handle.""" 325 handle = self.handle 326 327 if self._header is None: 328 line = handle.readline() 329 else: 330 # Header we saved from when we were parsing 331 # the previous alignment. 332 line = self._header 333 self._header = None 334 335 if not line: 336 # Empty file - just give up. 337 raise StopIteration 338 if line.strip() != '# STOCKHOLM 1.0': 339 raise ValueError("Did not find STOCKHOLM header") 340 341 # Note: If this file follows the PFAM conventions, there should be 342 # a line containing the number of sequences, e.g. "#=GF SQ 67" 343 # We do not check for this - perhaps we should, and verify that 344 # if present it agrees with our parsing. 345 346 seqs = {} 347 ids = OrderedDict() # Really only need an OrderedSet, but python lacks this 348 gs = {} 349 gr = {} 350 gf = {} 351 passed_end_alignment = False 352 while True: 353 line = handle.readline() 354 if not line: 355 break # end of file 356 line = line.strip() # remove trailing \n 357 if line == '# STOCKHOLM 1.0': 358 self._header = line 359 break 360 elif line == "//": 361 # The "//" line indicates the end of the alignment. 362 # There may still be more meta-data 363 passed_end_alignment = True 364 elif line == "": 365 # blank line, ignore 366 pass 367 elif line[0] != "#": 368 # Sequence 369 # Format: "<seqname> <sequence>" 370 assert not passed_end_alignment 371 parts = [x.strip() for x in line.split(" ", 1)] 372 if len(parts) != 2: 373 # This might be someone attempting to store a zero length sequence? 374 raise ValueError( 375 "Could not split line into identifier " 376 "and sequence:\n" + line) 377 seq_id, seq = parts 378 if seq_id not in ids: 379 ids[seq_id] = True 380 seqs.setdefault(seq_id, '') 381 seqs[seq_id] += seq.replace(".", "-") 382 elif len(line) >= 5: 383 # Comment line or meta-data 384 if line[:5] == "#=GF ": 385 # Generic per-File annotation, free text 386 # Format: #=GF <feature> <free text> 387 feature, text = line[5:].strip().split(None, 1) 388 # Each feature key could be used more than once, 389 # so store the entries as a list of strings. 390 if feature not in gf: 391 gf[feature] = [text] 392 else: 393 gf[feature].append(text) 394 elif line[:5] == '#=GC ': 395 # Generic per-Column annotation, exactly 1 char per column 396 # Format: "#=GC <feature> <exactly 1 char per column>" 397 pass 398 elif line[:5] == '#=GS ': 399 # Generic per-Sequence annotation, free text 400 # Format: "#=GS <seqname> <feature> <free text>" 401 seq_id, feature, text = line[5:].strip().split(None, 2) 402 # if seq_id not in ids: 403 # ids.append(seq_id) 404 if seq_id not in gs: 405 gs[seq_id] = {} 406 if feature not in gs[seq_id]: 407 gs[seq_id][feature] = [text] 408 else: 409 gs[seq_id][feature].append(text) 410 elif line[:5] == "#=GR ": 411 # Generic per-Sequence AND per-Column markup 412 # Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 413 seq_id, feature, text = line[5:].strip().split(None, 2) 414 # if seq_id not in ids: 415 # ids.append(seq_id) 416 if seq_id not in gr: 417 gr[seq_id] = {} 418 if feature not in gr[seq_id]: 419 gr[seq_id][feature] = "" 420 gr[seq_id][feature] += text.strip() # append to any previous entry 421 # TODO - Should we check the length matches the alignment length? 422 # For iterlaced sequences the GR data can be split over 423 # multiple lines 424 # Next line... 425 426 assert len(seqs) <= len(ids) 427 # assert len(gs) <= len(ids) 428 # assert len(gr) <= len(ids) 429 430 self.ids = ids.keys() 431 self.sequences = seqs 432 self.seq_annotation = gs 433 self.seq_col_annotation = gr 434 435 if ids and seqs: 436 437 if self.records_per_alignment is not None \ 438 and self.records_per_alignment != len(ids): 439 raise ValueError("Found %i records in this alignment, told to expect %i" 440 % (len(ids), self.records_per_alignment)) 441 442 alignment_length = len(list(seqs.values())[0]) 443 records = [] # Alignment obj will put them all in a list anyway 444 for seq_id in ids: 445 seq = seqs[seq_id] 446 if alignment_length != len(seq): 447 raise ValueError("Sequences have different lengths, or repeated identifier") 448 name, start, end = self._identifier_split(seq_id) 449 record = SeqRecord(Seq(seq, self.alphabet), 450 id=seq_id, name=name, description=seq_id, 451 annotations={"accession": name}) 452 # Accession will be overridden by _populate_meta_data if an explicit 453 # accession is provided: 454 record.annotations["accession"] = name 455 456 if start is not None: 457 record.annotations["start"] = start 458 if end is not None: 459 record.annotations["end"] = end 460 461 self._populate_meta_data(seq_id, record) 462 records.append(record) 463 alignment = MultipleSeqAlignment(records, self.alphabet) 464 465 # TODO - Introduce an annotated alignment class? 466 # For now, store the annotation a new private property: 467 alignment._annotations = gr 468 469 return alignment 470 else: 471 raise StopIteration
472
473 - def _identifier_split(self, identifier):
474 """Return (name, start, end) string tuple from an identier.""" 475 if '/' in identifier: 476 name, start_end = identifier.rsplit("/", 1) 477 if start_end.count("-") == 1: 478 try: 479 start, end = start_end.split("-") 480 return name, int(start), int(end) 481 except ValueError: 482 # Non-integers after final '/' - fall through 483 pass 484 return identifier, None, None
485
486 - def _get_meta_data(self, identifier, meta_dict):
487 """Take an itentifier and returns dict of all meta-data matching it. 488 489 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 490 this or "Q9PN73_CAMJE" which the identifier without its /start-end 491 suffix. 492 493 In the example below, the suffix is required to match the AC, but must 494 be removed to match the OS and OC meta-data:: 495 496 # STOCKHOLM 1.0 497 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 498 ... 499 Q9PN73_CAMJE/149-220 NKA... 500 ... 501 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 502 #=GS Q9PN73_CAMJE OC Bacteria 503 504 This function will return an empty dictionary if no data is found. 505 """ 506 name, start, end = self._identifier_split(identifier) 507 if name == identifier: 508 identifier_keys = [identifier] 509 else: 510 identifier_keys = [identifier, name] 511 answer = {} 512 for identifier_key in identifier_keys: 513 try: 514 for feature_key in meta_dict[identifier_key]: 515 answer[feature_key] = meta_dict[identifier_key][feature_key] 516 except KeyError: 517 pass 518 return answer
519
520 - def _populate_meta_data(self, identifier, record):
521 """Add meta-date to a SecRecord's annotations dictionary. 522 523 This function applies the PFAM conventions. 524 """ 525 seq_data = self._get_meta_data(identifier, self.seq_annotation) 526 for feature in seq_data: 527 # Note this dictionary contains lists! 528 if feature == "AC": # ACcession number 529 assert len(seq_data[feature]) == 1 530 record.annotations["accession"] = seq_data[feature][0] 531 elif feature == "DE": # DEscription 532 record.description = "\n".join(seq_data[feature]) 533 elif feature == "DR": # Database Reference 534 # Should we try and parse the strings? 535 record.dbxrefs = seq_data[feature] 536 elif feature in self.pfam_gs_mapping: 537 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 538 else: 539 # Ignore it? 540 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 541 542 # Now record the per-letter-annotations 543 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 544 for feature in seq_col_data: 545 # Note this dictionary contains strings! 546 if feature in self.pfam_gr_mapping: 547 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 548 else: 549 # Ignore it? 550 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
551 552 553 if __name__ == "__main__": 554 from Bio._utils import run_doctest 555 run_doctest() 556