Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2016 by Peter Cock.  All rights reserved. 
  2  # Revisions copyright 2015 by Ben Woodcroft.  All rights reserved. 
  3  # 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
  9   
 10  You are expected to use this module via the Bio.AlignIO functions (or the 
 11  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 12   
 13  For example, consider a Stockholm alignment file containing the following:: 
 14   
 15      # STOCKHOLM 1.0 
 16      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 17      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 18      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 19      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 20      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 21   
 22      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 23      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 24      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 25      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 26      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 27      // 
 28   
 29  This is a single multiple sequence alignment, so you would probably load this 
 30  using the Bio.AlignIO.read() function: 
 31   
 32      >>> from Bio import AlignIO 
 33      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 34      >>> print(align) 
 35      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 36      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 37      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 38      >>> for record in align: 
 39      ...     print("%s %i" % (record.id, len(record))) 
 40      AP001509.1 104 
 41      AE007476.1 104 
 42   
 43  This example file is clearly using RNA, so you might want the alignment object 
 44  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 45  the default single letter alphabet as shown above.  You can do this with an 
 46  optional argument to the Bio.AlignIO.read() function: 
 47   
 48      >>> from Bio import AlignIO 
 49      >>> from Bio.Alphabet import generic_rna 
 50      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 51      ...                      alphabet=generic_rna) 
 52      >>> print(align) 
 53      RNAAlphabet() alignment with 2 rows and 104 columns 
 54      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 55      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 56   
 57  In addition to the sequences themselves, this example alignment also includes 
 58  some GR lines for the secondary structure of the sequences.  These are 
 59  strings, with one character for each letter in the associated sequence: 
 60   
 61      >>> for record in align: 
 62      ...     print(record.id) 
 63      ...     print(record.seq) 
 64      ...     print(record.letter_annotations['secondary_structure']) 
 65      AP001509.1 
 66      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 67      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 68      AE007476.1 
 69      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 70      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 71   
 72  Any general annotation for each row is recorded in the SeqRecord's annotations 
 73  dictionary.  You can output this alignment in many different file formats 
 74  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 75   
 76      >>> print(align.format("fasta")) 
 77      >AP001509.1 
 78      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 79      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 80      >AE007476.1 
 81      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 82      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 83      <BLANKLINE> 
 84   
 85  Most output formats won't be able to hold the annotation possible in a 
 86  Stockholm file: 
 87   
 88      >>> print(align.format("stockholm")) 
 89      # STOCKHOLM 1.0 
 90      #=GF SQ 2 
 91      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 92      #=GS AP001509.1 AC AP001509.1 
 93      #=GS AP001509.1 DE AP001509.1 
 94      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 95      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 96      #=GS AE007476.1 AC AE007476.1 
 97      #=GS AE007476.1 DE AE007476.1 
 98      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 99      // 
100      <BLANKLINE> 
101   
102  Note that when writing Stockholm files, AlignIO does not break long sequences 
103  up and interleave them (as in the input file shown above).  The standard 
104  allows this simpler layout, and it is more likely to be understood by other 
105  tools. 
106   
107  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
108  iterate over the alignment rows as SeqRecord objects - rather than working 
109  with Alignnment objects. Again, if you want to you can specify this is RNA: 
110   
111      >>> from Bio import SeqIO 
112      >>> from Bio.Alphabet import generic_rna 
113      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
114      ...                           alphabet=generic_rna): 
115      ...     print(record.id) 
116      ...     print(record.seq) 
117      ...     print(record.letter_annotations['secondary_structure']) 
118      AP001509.1 
119      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
120      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
121      AE007476.1 
122      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
123      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
124   
125  Remember that if you slice a SeqRecord, the per-letter-annotations like the 
126  secondary structure string here, are also sliced: 
127   
128      >>> sub_record = record[10:20] 
129      >>> print(sub_record.seq) 
130      AUCGUUUUAC 
131      >>> print(sub_record.letter_annotations['secondary_structure']) 
132      -------<<< 
133  """ 
134  from __future__ import print_function 
135   
136  from Bio.Seq import Seq 
137  from Bio.SeqRecord import SeqRecord 
138  from Bio.Align import MultipleSeqAlignment 
139  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
140  from Bio._py3k import OrderedDict 
141   
142   
143 -class StockholmWriter(SequentialAlignmentWriter):
144 """Stockholm/PFAM alignment writer.""" 145 146 # These dictionaries should be kept in sync with those 147 # defined in the StockholmIterator class. 148 pfam_gr_mapping = {"secondary_structure": "SS", 149 "surface_accessibility": "SA", 150 "transmembrane": "TM", 151 "posterior_probability": "PP", 152 "ligand_binding": "LI", 153 "active_site": "AS", 154 "intron": "IN"} 155 # Following dictionary deliberately does not cover AC, DE or DR 156 pfam_gs_mapping = {"organism": "OS", 157 "organism_classification": "OC", 158 "look": "LO"} 159
160 - def write_alignment(self, alignment):
161 """Use this to write (another) single alignment to an open file. 162 163 Note that sequences and their annotation are recorded 164 together (rather than having a block of annotation followed 165 by a block of aligned sequences). 166 """ 167 count = len(alignment) 168 169 self._length_of_sequences = alignment.get_alignment_length() 170 self._ids_written = [] 171 172 # NOTE - For now, the alignment object does not hold any per column 173 # or per alignment annotation - only per sequence. 174 175 if count == 0: 176 raise ValueError("Must have at least one sequence") 177 if self._length_of_sequences == 0: 178 raise ValueError("Non-empty sequences are required") 179 180 self.handle.write("# STOCKHOLM 1.0\n") 181 self.handle.write("#=GF SQ %i\n" % count) 182 for record in alignment: 183 self._write_record(record) 184 self.handle.write("//\n")
185
186 - def _write_record(self, record):
187 """Write a single SeqRecord to the file""" 188 if self._length_of_sequences != len(record.seq): 189 raise ValueError("Sequences must all be the same length") 190 191 # For the case for stockholm to stockholm, try and use record.name 192 seq_name = record.id 193 if record.name is not None: 194 if "accession" in record.annotations: 195 if record.id == record.annotations["accession"]: 196 seq_name = record.name 197 198 # In the Stockholm file format, spaces are not allowed in the id 199 seq_name = seq_name.replace(" ", "_") 200 201 if "start" in record.annotations \ 202 and "end" in record.annotations: 203 suffix = "/%s-%s" % (str(record.annotations["start"]), 204 str(record.annotations["end"])) 205 if seq_name[-len(suffix):] != suffix: 206 seq_name = "%s/%s-%s" % (seq_name, 207 str(record.annotations["start"]), 208 str(record.annotations["end"])) 209 210 if seq_name in self._ids_written: 211 raise ValueError("Duplicate record identifier: %s" % seq_name) 212 self._ids_written.append(seq_name) 213 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 214 215 # The recommended placement for GS lines (per sequence annotation) 216 # is above the alignment (as a header block) or just below the 217 # corresponding sequence. 218 # 219 # The recommended placement for GR lines (per sequence per column 220 # annotation such as secondary structure) is just below the 221 # corresponding sequence. 222 # 223 # We put both just below the corresponding sequence as this allows 224 # us to write the file using a single pass through the records. 225 226 # AC = Accession 227 if "accession" in record.annotations: 228 self.handle.write("#=GS %s AC %s\n" 229 % (seq_name, self.clean(record.annotations["accession"]))) 230 elif record.id: 231 self.handle.write("#=GS %s AC %s\n" 232 % (seq_name, self.clean(record.id))) 233 234 # DE = description 235 if record.description: 236 self.handle.write("#=GS %s DE %s\n" 237 % (seq_name, self.clean(record.description))) 238 239 # DE = database links 240 for xref in record.dbxrefs: 241 self.handle.write("#=GS %s DR %s\n" 242 % (seq_name, self.clean(xref))) 243 244 # GS = other per sequence annotation 245 for key, value in record.annotations.items(): 246 if key in self.pfam_gs_mapping: 247 data = self.clean(str(value)) 248 if data: 249 self.handle.write("#=GS %s %s %s\n" 250 % (seq_name, 251 self.clean(self.pfam_gs_mapping[key]), 252 data)) 253 else: 254 # It doesn't follow the PFAM standards, but should we record 255 # this data anyway? 256 pass 257 258 # GR = per row per column sequence annotation 259 for key, value in record.letter_annotations.items(): 260 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 261 data = self.clean(str(value)) 262 if data: 263 self.handle.write("#=GR %s %s %s\n" 264 % (seq_name, 265 self.clean(self.pfam_gr_mapping[key]), 266 data)) 267 else: 268 # It doesn't follow the PFAM standards, but should we record 269 # this data anyway? 270 pass
271 272
273 -class StockholmIterator(AlignmentIterator):
274 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 275 276 The file may contain multiple concatenated alignments, which are loaded 277 and returned incrementally. 278 279 This parser will detect if the Stockholm file follows the PFAM 280 conventions for sequence specific meta-data (lines starting #=GS 281 and #=GR) and populates the SeqRecord fields accordingly. 282 283 Any annotation which does not follow the PFAM conventions is currently 284 ignored. 285 286 If an accession is provided for an entry in the meta data, IT WILL NOT 287 be used as the record.id (it will be recorded in the record's 288 annotations). This is because some files have (sub) sequences from 289 different parts of the same accession (differentiated by different 290 start-end positions). 291 292 Wrap-around alignments are not supported - each sequences must be on 293 a single line. However, interlaced sequences should work. 294 295 For more information on the file format, please see: 296 http://www.bioperl.org/wiki/Stockholm_multiple_alignment_format 297 http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html 298 299 For consistency with BioPerl and EMBOSS we call this the "stockholm" 300 format. 301 """ 302 303 # These dictionaries should be kept in sync with those 304 # defined in the PfamStockholmWriter class. 305 pfam_gr_mapping = {"SS": "secondary_structure", 306 "SA": "surface_accessibility", 307 "TM": "transmembrane", 308 "PP": "posterior_probability", 309 "LI": "ligand_binding", 310 "AS": "active_site", 311 "IN": "intron"} 312 # Following dictionary deliberately does not cover AC, DE or DR 313 pfam_gs_mapping = {"OS": "organism", 314 "OC": "organism_classification", 315 "LO": "look"} 316 317 _header = None # for caching lines between __next__ calls 318
319 - def __next__(self):
320 handle = self.handle 321 322 if self._header is None: 323 line = handle.readline() 324 else: 325 # Header we saved from when we were parsing 326 # the previous alignment. 327 line = self._header 328 self._header = None 329 330 if not line: 331 # Empty file - just give up. 332 raise StopIteration 333 if not line.strip() == '# STOCKHOLM 1.0': 334 raise ValueError("Did not find STOCKHOLM header") 335 336 # Note: If this file follows the PFAM conventions, there should be 337 # a line containing the number of sequences, e.g. "#=GF SQ 67" 338 # We do not check for this - perhaps we should, and verify that 339 # if present it agrees with our parsing. 340 341 seqs = {} 342 ids = OrderedDict() # Really only need an OrderedSet, but python lacks this 343 gs = {} 344 gr = {} 345 gf = {} 346 passed_end_alignment = False 347 while True: 348 line = handle.readline() 349 if not line: 350 break # end of file 351 line = line.strip() # remove trailing \n 352 if line == '# STOCKHOLM 1.0': 353 self._header = line 354 break 355 elif line == "//": 356 # The "//" line indicates the end of the alignment. 357 # There may still be more meta-data 358 passed_end_alignment = True 359 elif line == "": 360 # blank line, ignore 361 pass 362 elif line[0] != "#": 363 # Sequence 364 # Format: "<seqname> <sequence>" 365 assert not passed_end_alignment 366 parts = [x.strip() for x in line.split(" ", 1)] 367 if len(parts) != 2: 368 # This might be someone attempting to store a zero length sequence? 369 raise ValueError("Could not split line into identifier " 370 "and sequence:\n" + line) 371 id, seq = parts 372 if id not in ids: 373 ids[id] = True 374 seqs.setdefault(id, '') 375 seqs[id] += seq.replace(".", "-") 376 elif len(line) >= 5: 377 # Comment line or meta-data 378 if line[:5] == "#=GF ": 379 # Generic per-File annotation, free text 380 # Format: #=GF <feature> <free text> 381 feature, text = line[5:].strip().split(None, 1) 382 # Each feature key could be used more than once, 383 # so store the entries as a list of strings. 384 if feature not in gf: 385 gf[feature] = [text] 386 else: 387 gf[feature].append(text) 388 elif line[:5] == '#=GC ': 389 # Generic per-Column annotation, exactly 1 char per column 390 # Format: "#=GC <feature> <exactly 1 char per column>" 391 pass 392 elif line[:5] == '#=GS ': 393 # Generic per-Sequence annotation, free text 394 # Format: "#=GS <seqname> <feature> <free text>" 395 id, feature, text = line[5:].strip().split(None, 2) 396 # if id not in ids: 397 # ids.append(id) 398 if id not in gs: 399 gs[id] = {} 400 if feature not in gs[id]: 401 gs[id][feature] = [text] 402 else: 403 gs[id][feature].append(text) 404 elif line[:5] == "#=GR ": 405 # Generic per-Sequence AND per-Column markup 406 # Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 407 id, feature, text = line[5:].strip().split(None, 2) 408 # if id not in ids: 409 # ids.append(id) 410 if id not in gr: 411 gr[id] = {} 412 if feature not in gr[id]: 413 gr[id][feature] = "" 414 gr[id][feature] += text.strip() # append to any previous entry 415 # TODO - Should we check the length matches the alignment length? 416 # For iterlaced sequences the GR data can be split over 417 # multiple lines 418 # Next line... 419 420 assert len(seqs) <= len(ids) 421 # assert len(gs) <= len(ids) 422 # assert len(gr) <= len(ids) 423 424 self.ids = ids.keys() 425 self.sequences = seqs 426 self.seq_annotation = gs 427 self.seq_col_annotation = gr 428 429 if ids and seqs: 430 431 if self.records_per_alignment is not None \ 432 and self.records_per_alignment != len(ids): 433 raise ValueError("Found %i records in this alignment, told to expect %i" 434 % (len(ids), self.records_per_alignment)) 435 436 alignment_length = len(list(seqs.values())[0]) 437 records = [] # Alignment obj will put them all in a list anyway 438 for id in ids: 439 seq = seqs[id] 440 if alignment_length != len(seq): 441 raise ValueError("Sequences have different lengths, or repeated identifier") 442 name, start, end = self._identifier_split(id) 443 record = SeqRecord(Seq(seq, self.alphabet), 444 id=id, name=name, description=id, 445 annotations={"accession": name}) 446 # Accession will be overridden by _populate_meta_data if an explicit 447 # accession is provided: 448 record.annotations["accession"] = name 449 450 if start is not None: 451 record.annotations["start"] = start 452 if end is not None: 453 record.annotations["end"] = end 454 455 self._populate_meta_data(id, record) 456 records.append(record) 457 alignment = MultipleSeqAlignment(records, self.alphabet) 458 459 # TODO - Introduce an annotated alignment class? 460 # For now, store the annotation a new private property: 461 alignment._annotations = gr 462 463 return alignment 464 else: 465 raise StopIteration
466
467 - def _identifier_split(self, identifier):
468 """Returns (name, start, end) string tuple from an identier.""" 469 if '/' in identifier: 470 name, start_end = identifier.rsplit("/", 1) 471 if start_end.count("-") == 1: 472 try: 473 start, end = start_end.split("-") 474 return name, int(start), int(end) 475 except ValueError: 476 # Non-integers after final '/' - fall through 477 pass 478 return identifier, None, None
479
480 - def _get_meta_data(self, identifier, meta_dict):
481 """Takes an itentifier and returns dict of all meta-data matching it. 482 483 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 484 this or "Q9PN73_CAMJE" which the identifier without its /start-end 485 suffix. 486 487 In the example below, the suffix is required to match the AC, but must 488 be removed to match the OS and OC meta-data:: 489 490 # STOCKHOLM 1.0 491 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 492 ... 493 Q9PN73_CAMJE/149-220 NKA... 494 ... 495 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 496 #=GS Q9PN73_CAMJE OC Bacteria 497 498 This function will return an empty dictionary if no data is found.""" 499 name, start, end = self._identifier_split(identifier) 500 if name == identifier: 501 identifier_keys = [identifier] 502 else: 503 identifier_keys = [identifier, name] 504 answer = {} 505 for identifier_key in identifier_keys: 506 try: 507 for feature_key in meta_dict[identifier_key]: 508 answer[feature_key] = meta_dict[identifier_key][feature_key] 509 except KeyError: 510 pass 511 return answer
512
513 - def _populate_meta_data(self, identifier, record):
514 """Adds meta-date to a SecRecord's annotations dictionary. 515 516 This function applies the PFAM conventions.""" 517 518 seq_data = self._get_meta_data(identifier, self.seq_annotation) 519 for feature in seq_data: 520 # Note this dictionary contains lists! 521 if feature == "AC": # ACcession number 522 assert len(seq_data[feature]) == 1 523 record.annotations["accession"] = seq_data[feature][0] 524 elif feature == "DE": # DEscription 525 record.description = "\n".join(seq_data[feature]) 526 elif feature == "DR": # Database Reference 527 # Should we try and parse the strings? 528 record.dbxrefs = seq_data[feature] 529 elif feature in self.pfam_gs_mapping: 530 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 531 else: 532 # Ignore it? 533 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 534 535 # Now record the per-letter-annotations 536 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 537 for feature in seq_col_data: 538 # Note this dictionary contains strings! 539 if feature in self.pfam_gr_mapping: 540 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 541 else: 542 # Ignore it? 543 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
544 545 546 if __name__ == "__main__": 547 from Bio._utils import run_doctest 548 run_doctest() 549