Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2016 by Peter Cock.  All rights reserved. 
  2  # Revisions copyright 2015 by Ben Woodcroft.  All rights reserved. 
  3  # 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
  9   
 10  You are expected to use this module via the Bio.AlignIO functions (or the 
 11  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 12   
 13  For example, consider a Stockholm alignment file containing the following:: 
 14   
 15      # STOCKHOLM 1.0 
 16      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 17      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 18      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 19      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 20      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 21   
 22      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 23      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 24      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 25      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 26      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 27      // 
 28   
 29  This is a single multiple sequence alignment, so you would probably load this 
 30  using the Bio.AlignIO.read() function: 
 31   
 32      >>> from Bio import AlignIO 
 33      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 34      >>> print(align) 
 35      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 36      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 37      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 38      >>> for record in align: 
 39      ...     print("%s %i" % (record.id, len(record))) 
 40      AP001509.1 104 
 41      AE007476.1 104 
 42   
 43  This example file is clearly using RNA, so you might want the alignment object 
 44  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 45  the default single letter alphabet as shown above.  You can do this with an 
 46  optional argument to the Bio.AlignIO.read() function: 
 47   
 48      >>> from Bio import AlignIO 
 49      >>> from Bio.Alphabet import generic_rna 
 50      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 51      ...                      alphabet=generic_rna) 
 52      >>> print(align) 
 53      RNAAlphabet() alignment with 2 rows and 104 columns 
 54      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 55      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 56   
 57  In addition to the sequences themselves, this example alignment also includes 
 58  some GR lines for the secondary structure of the sequences.  These are 
 59  strings, with one character for each letter in the associated sequence: 
 60   
 61      >>> for record in align: 
 62      ...     print(record.id) 
 63      ...     print(record.seq) 
 64      ...     print(record.letter_annotations['secondary_structure']) 
 65      AP001509.1 
 66      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 67      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 68      AE007476.1 
 69      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 70      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 71   
 72  Any general annotation for each row is recorded in the SeqRecord's annotations 
 73  dictionary.  You can output this alignment in many different file formats 
 74  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 75   
 76      >>> print(align.format("fasta")) 
 77      >AP001509.1 
 78      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 79      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 80      >AE007476.1 
 81      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 82      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 83      <BLANKLINE> 
 84   
 85  Most output formats won't be able to hold the annotation possible in a 
 86  Stockholm file: 
 87   
 88      >>> print(align.format("stockholm")) 
 89      # STOCKHOLM 1.0 
 90      #=GF SQ 2 
 91      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 92      #=GS AP001509.1 AC AP001509.1 
 93      #=GS AP001509.1 DE AP001509.1 
 94      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 95      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 96      #=GS AE007476.1 AC AE007476.1 
 97      #=GS AE007476.1 DE AE007476.1 
 98      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 99      // 
100      <BLANKLINE> 
101   
102  Note that when writing Stockholm files, AlignIO does not break long sequences 
103  up and interleave them (as in the input file shown above).  The standard 
104  allows this simpler layout, and it is more likely to be understood by other 
105  tools. 
106   
107  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
108  iterate over the alignment rows as SeqRecord objects - rather than working 
109  with Alignnment objects. Again, if you want to you can specify this is RNA: 
110   
111      >>> from Bio import SeqIO 
112      >>> from Bio.Alphabet import generic_rna 
113      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
114      ...                           alphabet=generic_rna): 
115      ...     print(record.id) 
116      ...     print(record.seq) 
117      ...     print(record.letter_annotations['secondary_structure']) 
118      AP001509.1 
119      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
120      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
121      AE007476.1 
122      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
123      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
124   
125  Remember that if you slice a SeqRecord, the per-letter-annotations like the 
126  secondary structure string here, are also sliced: 
127   
128      >>> sub_record = record[10:20] 
129      >>> print(sub_record.seq) 
130      AUCGUUUUAC 
131      >>> print(sub_record.letter_annotations['secondary_structure']) 
132      -------<<< 
133  """ 
134  from __future__ import print_function 
135   
136  from collections import OrderedDict 
137   
138  from Bio.Seq import Seq 
139  from Bio.SeqRecord import SeqRecord 
140  from Bio.Align import MultipleSeqAlignment 
141  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
142   
143   
144 -class StockholmWriter(SequentialAlignmentWriter):
145 """Stockholm/PFAM alignment writer.""" 146 147 # These dictionaries should be kept in sync with those 148 # defined in the StockholmIterator class. 149 pfam_gr_mapping = {"secondary_structure": "SS", 150 "surface_accessibility": "SA", 151 "transmembrane": "TM", 152 "posterior_probability": "PP", 153 "ligand_binding": "LI", 154 "active_site": "AS", 155 "intron": "IN"} 156 # Following dictionary deliberately does not cover AC, DE or DR 157 pfam_gs_mapping = {"organism": "OS", 158 "organism_classification": "OC", 159 "look": "LO"} 160
161 - def write_alignment(self, alignment):
162 """Use this to write (another) single alignment to an open file. 163 164 Note that sequences and their annotation are recorded 165 together (rather than having a block of annotation followed 166 by a block of aligned sequences). 167 """ 168 count = len(alignment) 169 170 self._length_of_sequences = alignment.get_alignment_length() 171 self._ids_written = [] 172 173 # NOTE - For now, the alignment object does not hold any per column 174 # or per alignment annotation - only per sequence. 175 176 if count == 0: 177 raise ValueError("Must have at least one sequence") 178 if self._length_of_sequences == 0: 179 raise ValueError("Non-empty sequences are required") 180 181 self.handle.write("# STOCKHOLM 1.0\n") 182 self.handle.write("#=GF SQ %i\n" % count) 183 for record in alignment: 184 self._write_record(record) 185 self.handle.write("//\n")
186
187 - def _write_record(self, record):
188 """Write a single SeqRecord to the file""" 189 if self._length_of_sequences != len(record.seq): 190 raise ValueError("Sequences must all be the same length") 191 192 # For the case for stockholm to stockholm, try and use record.name 193 seq_name = record.id 194 if record.name is not None: 195 if "accession" in record.annotations: 196 if record.id == record.annotations["accession"]: 197 seq_name = record.name 198 199 # In the Stockholm file format, spaces are not allowed in the id 200 seq_name = seq_name.replace(" ", "_") 201 202 if "start" in record.annotations \ 203 and "end" in record.annotations: 204 suffix = "/%s-%s" % (str(record.annotations["start"]), 205 str(record.annotations["end"])) 206 if seq_name[-len(suffix):] != suffix: 207 seq_name = "%s/%s-%s" % ( 208 seq_name, 209 str(record.annotations["start"]), 210 str(record.annotations["end"])) 211 212 if seq_name in self._ids_written: 213 raise ValueError("Duplicate record identifier: %s" % seq_name) 214 self._ids_written.append(seq_name) 215 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 216 217 # The recommended placement for GS lines (per sequence annotation) 218 # is above the alignment (as a header block) or just below the 219 # corresponding sequence. 220 # 221 # The recommended placement for GR lines (per sequence per column 222 # annotation such as secondary structure) is just below the 223 # corresponding sequence. 224 # 225 # We put both just below the corresponding sequence as this allows 226 # us to write the file using a single pass through the records. 227 228 # AC = Accession 229 if "accession" in record.annotations: 230 self.handle.write("#=GS %s AC %s\n" % ( 231 seq_name, self.clean(record.annotations["accession"]))) 232 elif record.id: 233 self.handle.write("#=GS %s AC %s\n" % ( 234 seq_name, self.clean(record.id))) 235 236 # DE = description 237 if record.description: 238 self.handle.write("#=GS %s DE %s\n" % ( 239 seq_name, self.clean(record.description))) 240 241 # DE = database links 242 for xref in record.dbxrefs: 243 self.handle.write("#=GS %s DR %s\n" % ( 244 seq_name, self.clean(xref))) 245 246 # GS = other per sequence annotation 247 for key, value in record.annotations.items(): 248 if key in self.pfam_gs_mapping: 249 data = self.clean(str(value)) 250 if data: 251 self.handle.write("#=GS %s %s %s\n" 252 % (seq_name, 253 self.clean(self.pfam_gs_mapping[key]), 254 data)) 255 else: 256 # It doesn't follow the PFAM standards, but should we record 257 # this data anyway? 258 pass 259 260 # GR = per row per column sequence annotation 261 for key, value in record.letter_annotations.items(): 262 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 263 data = self.clean(str(value)) 264 if data: 265 self.handle.write("#=GR %s %s %s\n" 266 % (seq_name, 267 self.clean(self.pfam_gr_mapping[key]), 268 data)) 269 else: 270 # It doesn't follow the PFAM standards, but should we record 271 # this data anyway? 272 pass
273 274
275 -class StockholmIterator(AlignmentIterator):
276 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 277 278 The file may contain multiple concatenated alignments, which are loaded 279 and returned incrementally. 280 281 This parser will detect if the Stockholm file follows the PFAM 282 conventions for sequence specific meta-data (lines starting #=GS 283 and #=GR) and populates the SeqRecord fields accordingly. 284 285 Any annotation which does not follow the PFAM conventions is currently 286 ignored. 287 288 If an accession is provided for an entry in the meta data, IT WILL NOT 289 be used as the record.id (it will be recorded in the record's 290 annotations). This is because some files have (sub) sequences from 291 different parts of the same accession (differentiated by different 292 start-end positions). 293 294 Wrap-around alignments are not supported - each sequences must be on 295 a single line. However, interlaced sequences should work. 296 297 For more information on the file format, please see: 298 http://sonnhammer.sbc.su.se/Stockholm.html 299 https://en.wikipedia.org/wiki/Stockholm_format 300 http://bioperl.org/formats/alignment_formats/Stockholm_multiple_alignment_format.html 301 302 For consistency with BioPerl and EMBOSS we call this the "stockholm" 303 format. 304 """ 305 306 # These dictionaries should be kept in sync with those 307 # defined in the PfamStockholmWriter class. 308 pfam_gr_mapping = {"SS": "secondary_structure", 309 "SA": "surface_accessibility", 310 "TM": "transmembrane", 311 "PP": "posterior_probability", 312 "LI": "ligand_binding", 313 "AS": "active_site", 314 "IN": "intron"} 315 # Following dictionary deliberately does not cover AC, DE or DR 316 pfam_gs_mapping = {"OS": "organism", 317 "OC": "organism_classification", 318 "LO": "look"} 319 320 _header = None # for caching lines between __next__ calls 321
322 - def __next__(self):
323 handle = self.handle 324 325 if self._header is None: 326 line = handle.readline() 327 else: 328 # Header we saved from when we were parsing 329 # the previous alignment. 330 line = self._header 331 self._header = None 332 333 if not line: 334 # Empty file - just give up. 335 raise StopIteration 336 if line.strip() != '# STOCKHOLM 1.0': 337 raise ValueError("Did not find STOCKHOLM header") 338 339 # Note: If this file follows the PFAM conventions, there should be 340 # a line containing the number of sequences, e.g. "#=GF SQ 67" 341 # We do not check for this - perhaps we should, and verify that 342 # if present it agrees with our parsing. 343 344 seqs = {} 345 ids = OrderedDict() # Really only need an OrderedSet, but python lacks this 346 gs = {} 347 gr = {} 348 gf = {} 349 passed_end_alignment = False 350 while True: 351 line = handle.readline() 352 if not line: 353 break # end of file 354 line = line.strip() # remove trailing \n 355 if line == '# STOCKHOLM 1.0': 356 self._header = line 357 break 358 elif line == "//": 359 # The "//" line indicates the end of the alignment. 360 # There may still be more meta-data 361 passed_end_alignment = True 362 elif line == "": 363 # blank line, ignore 364 pass 365 elif line[0] != "#": 366 # Sequence 367 # Format: "<seqname> <sequence>" 368 assert not passed_end_alignment 369 parts = [x.strip() for x in line.split(" ", 1)] 370 if len(parts) != 2: 371 # This might be someone attempting to store a zero length sequence? 372 raise ValueError( 373 "Could not split line into identifier " 374 "and sequence:\n" + line) 375 seq_id, seq = parts 376 if seq_id not in ids: 377 ids[seq_id] = True 378 seqs.setdefault(seq_id, '') 379 seqs[seq_id] += seq.replace(".", "-") 380 elif len(line) >= 5: 381 # Comment line or meta-data 382 if line[:5] == "#=GF ": 383 # Generic per-File annotation, free text 384 # Format: #=GF <feature> <free text> 385 feature, text = line[5:].strip().split(None, 1) 386 # Each feature key could be used more than once, 387 # so store the entries as a list of strings. 388 if feature not in gf: 389 gf[feature] = [text] 390 else: 391 gf[feature].append(text) 392 elif line[:5] == '#=GC ': 393 # Generic per-Column annotation, exactly 1 char per column 394 # Format: "#=GC <feature> <exactly 1 char per column>" 395 pass 396 elif line[:5] == '#=GS ': 397 # Generic per-Sequence annotation, free text 398 # Format: "#=GS <seqname> <feature> <free text>" 399 seq_id, feature, text = line[5:].strip().split(None, 2) 400 # if seq_id not in ids: 401 # ids.append(seq_id) 402 if seq_id not in gs: 403 gs[seq_id] = {} 404 if feature not in gs[seq_id]: 405 gs[seq_id][feature] = [text] 406 else: 407 gs[seq_id][feature].append(text) 408 elif line[:5] == "#=GR ": 409 # Generic per-Sequence AND per-Column markup 410 # Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 411 seq_id, feature, text = line[5:].strip().split(None, 2) 412 # if seq_id not in ids: 413 # ids.append(seq_id) 414 if seq_id not in gr: 415 gr[seq_id] = {} 416 if feature not in gr[seq_id]: 417 gr[seq_id][feature] = "" 418 gr[seq_id][feature] += text.strip() # append to any previous entry 419 # TODO - Should we check the length matches the alignment length? 420 # For iterlaced sequences the GR data can be split over 421 # multiple lines 422 # Next line... 423 424 assert len(seqs) <= len(ids) 425 # assert len(gs) <= len(ids) 426 # assert len(gr) <= len(ids) 427 428 self.ids = ids.keys() 429 self.sequences = seqs 430 self.seq_annotation = gs 431 self.seq_col_annotation = gr 432 433 if ids and seqs: 434 435 if self.records_per_alignment is not None \ 436 and self.records_per_alignment != len(ids): 437 raise ValueError("Found %i records in this alignment, told to expect %i" 438 % (len(ids), self.records_per_alignment)) 439 440 alignment_length = len(list(seqs.values())[0]) 441 records = [] # Alignment obj will put them all in a list anyway 442 for seq_id in ids: 443 seq = seqs[seq_id] 444 if alignment_length != len(seq): 445 raise ValueError("Sequences have different lengths, or repeated identifier") 446 name, start, end = self._identifier_split(seq_id) 447 record = SeqRecord(Seq(seq, self.alphabet), 448 id=seq_id, name=name, description=seq_id, 449 annotations={"accession": name}) 450 # Accession will be overridden by _populate_meta_data if an explicit 451 # accession is provided: 452 record.annotations["accession"] = name 453 454 if start is not None: 455 record.annotations["start"] = start 456 if end is not None: 457 record.annotations["end"] = end 458 459 self._populate_meta_data(seq_id, record) 460 records.append(record) 461 alignment = MultipleSeqAlignment(records, self.alphabet) 462 463 # TODO - Introduce an annotated alignment class? 464 # For now, store the annotation a new private property: 465 alignment._annotations = gr 466 467 return alignment 468 else: 469 raise StopIteration
470
471 - def _identifier_split(self, identifier):
472 """Returns (name, start, end) string tuple from an identier.""" 473 if '/' in identifier: 474 name, start_end = identifier.rsplit("/", 1) 475 if start_end.count("-") == 1: 476 try: 477 start, end = start_end.split("-") 478 return name, int(start), int(end) 479 except ValueError: 480 # Non-integers after final '/' - fall through 481 pass 482 return identifier, None, None
483
484 - def _get_meta_data(self, identifier, meta_dict):
485 """Takes an itentifier and returns dict of all meta-data matching it. 486 487 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 488 this or "Q9PN73_CAMJE" which the identifier without its /start-end 489 suffix. 490 491 In the example below, the suffix is required to match the AC, but must 492 be removed to match the OS and OC meta-data:: 493 494 # STOCKHOLM 1.0 495 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 496 ... 497 Q9PN73_CAMJE/149-220 NKA... 498 ... 499 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 500 #=GS Q9PN73_CAMJE OC Bacteria 501 502 This function will return an empty dictionary if no data is found.""" 503 name, start, end = self._identifier_split(identifier) 504 if name == identifier: 505 identifier_keys = [identifier] 506 else: 507 identifier_keys = [identifier, name] 508 answer = {} 509 for identifier_key in identifier_keys: 510 try: 511 for feature_key in meta_dict[identifier_key]: 512 answer[feature_key] = meta_dict[identifier_key][feature_key] 513 except KeyError: 514 pass 515 return answer
516
517 - def _populate_meta_data(self, identifier, record):
518 """Adds meta-date to a SecRecord's annotations dictionary. 519 520 This function applies the PFAM conventions.""" 521 seq_data = self._get_meta_data(identifier, self.seq_annotation) 522 for feature in seq_data: 523 # Note this dictionary contains lists! 524 if feature == "AC": # ACcession number 525 assert len(seq_data[feature]) == 1 526 record.annotations["accession"] = seq_data[feature][0] 527 elif feature == "DE": # DEscription 528 record.description = "\n".join(seq_data[feature]) 529 elif feature == "DR": # Database Reference 530 # Should we try and parse the strings? 531 record.dbxrefs = seq_data[feature] 532 elif feature in self.pfam_gs_mapping: 533 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 534 else: 535 # Ignore it? 536 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 537 538 # Now record the per-letter-annotations 539 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 540 for feature in seq_col_data: 541 # Note this dictionary contains strings! 542 if feature in self.pfam_gr_mapping: 543 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 544 else: 545 # Ignore it? 546 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
547 548 549 if __name__ == "__main__": 550 from Bio._utils import run_doctest 551 run_doctest() 552