Package Bio :: Package AlignIO :: Module StockholmIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.AlignIO.StockholmIO

  1  # Copyright 2006-2016 by Peter Cock.  All rights reserved. 
  2  # Revisions copyright 2015 by Ben Woodcroft.  All rights reserved. 
  3  # 
  4  # This file is part of the Biopython distribution and governed by your 
  5  # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". 
  6  # Please see the LICENSE file that should have been included as part of this 
  7  # package. 
  8   
  9  """Bio.AlignIO support for "stockholm" format (used in the PFAM database). 
 10   
 11  You are expected to use this module via the Bio.AlignIO functions (or the 
 12  Bio.SeqIO functions if you want to work directly with the gapped sequences). 
 13   
 14  For example, consider a Stockholm alignment file containing the following:: 
 15   
 16      # STOCKHOLM 1.0 
 17      #=GC SS_cons       .................<<<<<<<<...<<<<<<<........>>>>>>>.. 
 18      AP001509.1         UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGU 
 19      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..-- 
 20      AE007476.1         AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGU 
 21      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>---- 
 22   
 23      #=GC SS_cons       ......<<<<<<<.......>>>>>>>..>>>>>>>>............... 
 24      AP001509.1         CUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 25      #=GR AP001509.1 SS -------<<<<<--------->>>>>--->>>>>>>>--------------- 
 26      AE007476.1         UUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 27      #=GR AE007476.1 SS ------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 28      // 
 29   
 30  This is a single multiple sequence alignment, so you would probably load this 
 31  using the Bio.AlignIO.read() function: 
 32   
 33      >>> from Bio import AlignIO 
 34      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm") 
 35      >>> print(align) 
 36      SingleLetterAlphabet() alignment with 2 rows and 104 columns 
 37      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 38      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 39      >>> for record in align: 
 40      ...     print("%s %i" % (record.id, len(record))) 
 41      AP001509.1 104 
 42      AE007476.1 104 
 43   
 44  This example file is clearly using RNA, so you might want the alignment object 
 45  (and the SeqRecord objects it holds) to reflect this, rather than simple using 
 46  the default single letter alphabet as shown above.  You can do this with an 
 47  optional argument to the Bio.AlignIO.read() function: 
 48   
 49      >>> from Bio import AlignIO 
 50      >>> from Bio.Alphabet import generic_rna 
 51      >>> align = AlignIO.read("Stockholm/simple.sth", "stockholm", 
 52      ...                      alphabet=generic_rna) 
 53      >>> print(align) 
 54      RNAAlphabet() alignment with 2 rows and 104 columns 
 55      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-G...UGU AP001509.1 
 56      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-C...GAU AE007476.1 
 57   
 58  In addition to the sequences themselves, this example alignment also includes 
 59  some GR lines for the secondary structure of the sequences.  These are 
 60  strings, with one character for each letter in the associated sequence: 
 61   
 62      >>> for record in align: 
 63      ...     print(record.id) 
 64      ...     print(record.seq) 
 65      ...     print(record.letter_annotations['secondary_structure']) 
 66      AP001509.1 
 67      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 68      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
 69      AE007476.1 
 70      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 71      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
 72   
 73  Any general annotation for each row is recorded in the SeqRecord's annotations 
 74  dictionary.  Any per-column annotation for the entire alignment in in the 
 75  alignment's column annotations dictionary, such as the secondary structure 
 76  consensus in this example: 
 77   
 78      >>> sorted(align.column_annotations.keys()) 
 79      ['secondary_structure'] 
 80      >>> align.column_annotations["secondary_structure"] 
 81      '.................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>...............' 
 82   
 83  You can output this alignment in many different file formats 
 84  using Bio.AlignIO.write(), or the MultipleSeqAlignment object's format method: 
 85   
 86      >>> print(align.format("fasta")) 
 87      >AP001509.1 
 88      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-A 
 89      GGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
 90      >AE007476.1 
 91      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAA 
 92      GGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
 93      <BLANKLINE> 
 94   
 95  Most output formats won't be able to hold the annotation possible in a 
 96  Stockholm file: 
 97   
 98      >>> print(align.format("stockholm")) 
 99      # STOCKHOLM 1.0 
100      #=GF SQ 2 
101      AP001509.1 UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
102      #=GS AP001509.1 AC AP001509.1 
103      #=GS AP001509.1 DE AP001509.1 
104      #=GR AP001509.1 SS -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
105      AE007476.1 AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
106      #=GS AE007476.1 AC AE007476.1 
107      #=GS AE007476.1 DE AE007476.1 
108      #=GR AE007476.1 SS -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
109      #=GC SS_cons .................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>............... 
110      // 
111      <BLANKLINE> 
112   
113  Note that when writing Stockholm files, AlignIO does not break long sequences 
114  up and interleave them (as in the input file shown above).  The standard 
115  allows this simpler layout, and it is more likely to be understood by other 
116  tools. 
117   
118  Finally, as an aside, it can sometimes be useful to use Bio.SeqIO.parse() to 
119  iterate over the alignment rows as SeqRecord objects - rather than working 
120  with Alignnment objects. Again, if you want to you can specify this is RNA: 
121   
122      >>> from Bio import SeqIO 
123      >>> from Bio.Alphabet import generic_rna 
124      >>> for record in SeqIO.parse("Stockholm/simple.sth", "stockholm", 
125      ...                           alphabet=generic_rna): 
126      ...     print(record.id) 
127      ...     print(record.seq) 
128      ...     print(record.letter_annotations['secondary_structure']) 
129      AP001509.1 
130      UUAAUCGAGCUCAACACUCUUCGUAUAUCCUC-UCAAUAUGG-GAUGAGGGUCUCUAC-AGGUA-CCGUAAA-UACCUAGCUACGAAAAGAAUGCAGUUAAUGU 
131      -----------------<<<<<<<<---..<<-<<-------->>->>..---------<<<<<--------->>>>>--->>>>>>>>--------------- 
132      AE007476.1 
133      AAAAUUGAAUAUCGUUUUACUUGUUUAU-GUCGUGAAU-UGG-CACGA-CGUUUCUACAAGGUG-CCGG-AA-CACCUAACAAUAAGUAAGUCAGCAGUGAGAU 
134      -----------------<<<<<<<<-----<<.<<-------->>.>>----------.<<<<<--------->>>>>.-->>>>>>>>--------------- 
135   
136  Remember that if you slice a SeqRecord, the per-letter-annotations like the 
137  secondary structure string here, are also sliced: 
138   
139      >>> sub_record = record[10:20] 
140      >>> print(sub_record.seq) 
141      AUCGUUUUAC 
142      >>> print(sub_record.letter_annotations['secondary_structure']) 
143      -------<<< 
144   
145  Likewise with the alignment object, as long as you are not dropping any rows, 
146  slicing specific columns of an alignment will slice any per-column-annotations: 
147   
148      >>> align.column_annotations["secondary_structure"] 
149      '.................<<<<<<<<...<<<<<<<........>>>>>>>........<<<<<<<.......>>>>>>>..>>>>>>>>...............' 
150      >>> part_align = align[:,10:20] 
151      >>> part_align.column_annotations["secondary_structure"] 
152      '.......<<<' 
153   
154  You can also see this in the Stockholm output of this partial-alignment: 
155   
156      >>> print(part_align.format("stockholm")) 
157      # STOCKHOLM 1.0 
158      #=GF SQ 2 
159      AP001509.1 UCAACACUCU 
160      #=GS AP001509.1 AC AP001509.1 
161      #=GS AP001509.1 DE AP001509.1 
162      #=GR AP001509.1 SS -------<<< 
163      AE007476.1 AUCGUUUUAC 
164      #=GS AE007476.1 AC AE007476.1 
165      #=GS AE007476.1 DE AE007476.1 
166      #=GR AE007476.1 SS -------<<< 
167      #=GC SS_cons .......<<< 
168      // 
169      <BLANKLINE> 
170   
171  """ 
172  from __future__ import print_function 
173   
174  from collections import OrderedDict 
175   
176  from Bio.Seq import Seq 
177  from Bio.SeqRecord import SeqRecord 
178  from Bio.Align import MultipleSeqAlignment 
179  from .Interfaces import AlignmentIterator, SequentialAlignmentWriter 
180   
181   
182 -class StockholmWriter(SequentialAlignmentWriter):
183 """Stockholm/PFAM alignment writer.""" 184 185 # These dictionaries should be kept in sync with those 186 # defined in the StockholmIterator class. 187 pfam_gr_mapping = {"secondary_structure": "SS", 188 "surface_accessibility": "SA", 189 "transmembrane": "TM", 190 "posterior_probability": "PP", 191 "ligand_binding": "LI", 192 "active_site": "AS", 193 "intron": "IN"} 194 # These GC mappings are in addition to *_cons in GR mapping: 195 pfam_gc_mapping = {"reference_annotation": "RF", 196 "model_mask": "MM"} 197 # Following dictionary deliberately does not cover AC, DE or DR 198 pfam_gs_mapping = {"organism": "OS", 199 "organism_classification": "OC", 200 "look": "LO"} 201
202 - def write_alignment(self, alignment):
203 """Use this to write (another) single alignment to an open file. 204 205 Note that sequences and their annotation are recorded 206 together (rather than having a block of annotation followed 207 by a block of aligned sequences). 208 """ 209 count = len(alignment) 210 211 self._length_of_sequences = alignment.get_alignment_length() 212 self._ids_written = [] 213 214 if count == 0: 215 raise ValueError("Must have at least one sequence") 216 if self._length_of_sequences == 0: 217 raise ValueError("Non-empty sequences are required") 218 219 self.handle.write("# STOCKHOLM 1.0\n") 220 self.handle.write("#=GF SQ %i\n" % count) 221 for record in alignment: 222 self._write_record(record) 223 # This shouldn't be None... but just in case, 224 if alignment.column_annotations: 225 for k, v in sorted(alignment.column_annotations.items()): 226 if k in self.pfam_gc_mapping: 227 self.handle.write("#=GC %s %s\n" % (self.pfam_gc_mapping[k], v)) 228 elif k in self.pfam_gr_mapping: 229 self.handle.write("#=GC %s %s\n" % (self.pfam_gr_mapping[k] + "_cons", v)) 230 else: 231 # It doesn't follow the PFAM standards, but should we record 232 # this data anyway? 233 pass 234 self.handle.write("//\n")
235
236 - def _write_record(self, record):
237 """Write a single SeqRecord to the file (PRIVATE).""" 238 if self._length_of_sequences != len(record.seq): 239 raise ValueError("Sequences must all be the same length") 240 241 # For the case for stockholm to stockholm, try and use record.name 242 seq_name = record.id 243 if record.name is not None: 244 if "accession" in record.annotations: 245 if record.id == record.annotations["accession"]: 246 seq_name = record.name 247 248 # In the Stockholm file format, spaces are not allowed in the id 249 seq_name = seq_name.replace(" ", "_") 250 251 if "start" in record.annotations \ 252 and "end" in record.annotations: 253 suffix = "/%s-%s" % (str(record.annotations["start"]), 254 str(record.annotations["end"])) 255 if seq_name[-len(suffix):] != suffix: 256 seq_name = "%s/%s-%s" % ( 257 seq_name, 258 str(record.annotations["start"]), 259 str(record.annotations["end"])) 260 261 if seq_name in self._ids_written: 262 raise ValueError("Duplicate record identifier: %s" % seq_name) 263 self._ids_written.append(seq_name) 264 self.handle.write("%s %s\n" % (seq_name, str(record.seq))) 265 266 # The recommended placement for GS lines (per sequence annotation) 267 # is above the alignment (as a header block) or just below the 268 # corresponding sequence. 269 # 270 # The recommended placement for GR lines (per sequence per column 271 # annotation such as secondary structure) is just below the 272 # corresponding sequence. 273 # 274 # We put both just below the corresponding sequence as this allows 275 # us to write the file using a single pass through the records. 276 277 # AC = Accession 278 if "accession" in record.annotations: 279 self.handle.write("#=GS %s AC %s\n" % ( 280 seq_name, self.clean(record.annotations["accession"]))) 281 elif record.id: 282 self.handle.write("#=GS %s AC %s\n" % ( 283 seq_name, self.clean(record.id))) 284 285 # DE = description 286 if record.description: 287 self.handle.write("#=GS %s DE %s\n" % ( 288 seq_name, self.clean(record.description))) 289 290 # DE = database links 291 for xref in record.dbxrefs: 292 self.handle.write("#=GS %s DR %s\n" % ( 293 seq_name, self.clean(xref))) 294 295 # GS = other per sequence annotation 296 for key, value in record.annotations.items(): 297 if key in self.pfam_gs_mapping: 298 data = self.clean(str(value)) 299 if data: 300 self.handle.write("#=GS %s %s %s\n" 301 % (seq_name, 302 self.clean(self.pfam_gs_mapping[key]), 303 data)) 304 else: 305 # It doesn't follow the PFAM standards, but should we record 306 # this data anyway? 307 pass 308 309 # GR = per row per column sequence annotation 310 for key, value in record.letter_annotations.items(): 311 if key in self.pfam_gr_mapping and len(str(value)) == len(record.seq): 312 data = self.clean(str(value)) 313 if data: 314 self.handle.write("#=GR %s %s %s\n" 315 % (seq_name, 316 self.clean(self.pfam_gr_mapping[key]), 317 data)) 318 else: 319 # It doesn't follow the PFAM standards, but should we record 320 # this data anyway? 321 pass
322 323
324 -class StockholmIterator(AlignmentIterator):
325 """Loads a Stockholm file from PFAM into MultipleSeqAlignment objects. 326 327 The file may contain multiple concatenated alignments, which are loaded 328 and returned incrementally. 329 330 This parser will detect if the Stockholm file follows the PFAM 331 conventions for sequence specific meta-data (lines starting #=GS 332 and #=GR) and populates the SeqRecord fields accordingly. 333 334 Any annotation which does not follow the PFAM conventions is currently 335 ignored. 336 337 If an accession is provided for an entry in the meta data, IT WILL NOT 338 be used as the record.id (it will be recorded in the record's 339 annotations). This is because some files have (sub) sequences from 340 different parts of the same accession (differentiated by different 341 start-end positions). 342 343 Wrap-around alignments are not supported - each sequences must be on 344 a single line. However, interlaced sequences should work. 345 346 For more information on the file format, please see: 347 http://sonnhammer.sbc.su.se/Stockholm.html 348 https://en.wikipedia.org/wiki/Stockholm_format 349 http://bioperl.org/formats/alignment_formats/Stockholm_multiple_alignment_format.html 350 351 For consistency with BioPerl and EMBOSS we call this the "stockholm" 352 format. 353 """ 354 355 # These dictionaries should be kept in sync with those 356 # defined in the PfamStockholmWriter class. 357 pfam_gr_mapping = {"SS": "secondary_structure", 358 "SA": "surface_accessibility", 359 "TM": "transmembrane", 360 "PP": "posterior_probability", 361 "LI": "ligand_binding", 362 "AS": "active_site", 363 "IN": "intron"} 364 # These GC mappings are in addition to *_cons in GR mapping: 365 pfam_gc_mapping = {"RF": "reference_annotation", 366 "MM": "model_mask"} 367 # Following dictionary deliberately does not cover AC, DE or DR 368 pfam_gs_mapping = {"OS": "organism", 369 "OC": "organism_classification", 370 "LO": "look"} 371 372 _header = None # for caching lines between __next__ calls 373
374 - def __next__(self):
375 """Parse the next alignment from the handle.""" 376 handle = self.handle 377 378 if self._header is None: 379 line = handle.readline() 380 else: 381 # Header we saved from when we were parsing 382 # the previous alignment. 383 line = self._header 384 self._header = None 385 386 if not line: 387 # Empty file - just give up. 388 raise StopIteration 389 if line.strip() != '# STOCKHOLM 1.0': 390 raise ValueError("Did not find STOCKHOLM header") 391 392 # Note: If this file follows the PFAM conventions, there should be 393 # a line containing the number of sequences, e.g. "#=GF SQ 67" 394 # We do not check for this - perhaps we should, and verify that 395 # if present it agrees with our parsing. 396 397 seqs = {} 398 ids = OrderedDict() # Really only need an OrderedSet, but python lacks this 399 gs = {} 400 gr = {} 401 gf = {} 402 gc = {} 403 passed_end_alignment = False 404 while True: 405 line = handle.readline() 406 if not line: 407 break # end of file 408 line = line.strip() # remove trailing \n 409 if line == '# STOCKHOLM 1.0': 410 self._header = line 411 break 412 elif line == "//": 413 # The "//" line indicates the end of the alignment. 414 # There may still be more meta-data 415 passed_end_alignment = True 416 elif line == "": 417 # blank line, ignore 418 pass 419 elif line[0] != "#": 420 # Sequence 421 # Format: "<seqname> <sequence>" 422 assert not passed_end_alignment 423 parts = [x.strip() for x in line.split(" ", 1)] 424 if len(parts) != 2: 425 # This might be someone attempting to store a zero length sequence? 426 raise ValueError( 427 "Could not split line into identifier " 428 "and sequence:\n" + line) 429 seq_id, seq = parts 430 if seq_id not in ids: 431 ids[seq_id] = True 432 seqs.setdefault(seq_id, '') 433 seqs[seq_id] += seq.replace(".", "-") 434 elif len(line) >= 5: 435 # Comment line or meta-data 436 if line[:5] == "#=GF ": 437 # Generic per-File annotation, free text 438 # Format: #=GF <feature> <free text> 439 feature, text = line[5:].strip().split(None, 1) 440 # Each feature key could be used more than once, 441 # so store the entries as a list of strings. 442 if feature not in gf: 443 gf[feature] = [text] 444 else: 445 gf[feature].append(text) 446 elif line[:5] == '#=GC ': 447 # Generic per-Column annotation, exactly 1 char per column 448 # Format: "#=GC <feature> <exactly 1 char per column>" 449 feature, text = line[5:].strip().split(None, 2) 450 if feature not in gc: 451 gc[feature] = "" 452 gc[feature] += text.strip() # append to any previous entry 453 # Might be interleaved blocks, so can't check length yet 454 elif line[:5] == '#=GS ': 455 # Generic per-Sequence annotation, free text 456 # Format: "#=GS <seqname> <feature> <free text>" 457 seq_id, feature, text = line[5:].strip().split(None, 2) 458 # if seq_id not in ids: 459 # ids.append(seq_id) 460 if seq_id not in gs: 461 gs[seq_id] = {} 462 if feature not in gs[seq_id]: 463 gs[seq_id][feature] = [text] 464 else: 465 gs[seq_id][feature].append(text) 466 elif line[:5] == "#=GR ": 467 # Generic per-Sequence AND per-Column markup 468 # Format: "#=GR <seqname> <feature> <exactly 1 char per column>" 469 seq_id, feature, text = line[5:].strip().split(None, 2) 470 # if seq_id not in ids: 471 # ids.append(seq_id) 472 if seq_id not in gr: 473 gr[seq_id] = {} 474 if feature not in gr[seq_id]: 475 gr[seq_id][feature] = "" 476 gr[seq_id][feature] += text.strip() # append to any previous entry 477 # Might be interleaved blocks, so can't check length yet 478 # Next line... 479 480 assert len(seqs) <= len(ids) 481 # assert len(gs) <= len(ids) 482 # assert len(gr) <= len(ids) 483 484 self.ids = ids.keys() 485 self.sequences = seqs 486 self.seq_annotation = gs 487 self.seq_col_annotation = gr 488 489 if ids and seqs: 490 491 if self.records_per_alignment is not None \ 492 and self.records_per_alignment != len(ids): 493 raise ValueError("Found %i records in this alignment, told to expect %i" 494 % (len(ids), self.records_per_alignment)) 495 496 alignment_length = len(list(seqs.values())[0]) 497 records = [] # Alignment obj will put them all in a list anyway 498 for seq_id in ids: 499 seq = seqs[seq_id] 500 if alignment_length != len(seq): 501 raise ValueError("Sequences have different lengths, or repeated identifier") 502 name, start, end = self._identifier_split(seq_id) 503 record = SeqRecord(Seq(seq, self.alphabet), 504 id=seq_id, name=name, description=seq_id, 505 annotations={"accession": name}) 506 # Accession will be overridden by _populate_meta_data if an explicit 507 # accession is provided: 508 record.annotations["accession"] = name 509 510 if start is not None: 511 record.annotations["start"] = start 512 if end is not None: 513 record.annotations["end"] = end 514 515 self._populate_meta_data(seq_id, record) 516 records.append(record) 517 for k, v in gc.items(): 518 if len(v) != alignment_length: 519 raise ValueError("%s length %i, expected %i" 520 % (k, len(v), alignment_length)) 521 alignment = MultipleSeqAlignment(records, self.alphabet) 522 523 for k, v in sorted(gc.items()): 524 if k in self.pfam_gc_mapping: 525 alignment.column_annotations[self.pfam_gc_mapping[k]] = v 526 elif k.endswith("_cons") and k[:-5] in self.pfam_gr_mapping: 527 alignment.column_annotations[self.pfam_gr_mapping[k[:-5]]] = v 528 else: 529 # Ignore it? 530 alignment.column_annotations["GC:" + k] = v 531 532 # TODO - Introduce an annotated alignment class? 533 # For now, store the annotation a new private property: 534 alignment._annotations = gr 535 536 return alignment 537 else: 538 raise StopIteration
539
540 - def _identifier_split(self, identifier):
541 """Return (name, start, end) string tuple from an identier (PRIVATE).""" 542 if '/' in identifier: 543 name, start_end = identifier.rsplit("/", 1) 544 if start_end.count("-") == 1: 545 try: 546 start, end = start_end.split("-") 547 return name, int(start), int(end) 548 except ValueError: 549 # Non-integers after final '/' - fall through 550 pass 551 return identifier, None, None
552
553 - def _get_meta_data(self, identifier, meta_dict):
554 """Take an itentifier and returns dict of all meta-data matching it (PRIVATE). 555 556 For example, given "Q9PN73_CAMJE/149-220" will return all matches to 557 this or "Q9PN73_CAMJE" which the identifier without its /start-end 558 suffix. 559 560 In the example below, the suffix is required to match the AC, but must 561 be removed to match the OS and OC meta-data:: 562 563 # STOCKHOLM 1.0 564 #=GS Q9PN73_CAMJE/149-220 AC Q9PN73 565 ... 566 Q9PN73_CAMJE/149-220 NKA... 567 ... 568 #=GS Q9PN73_CAMJE OS Campylobacter jejuni 569 #=GS Q9PN73_CAMJE OC Bacteria 570 571 This function will return an empty dictionary if no data is found. 572 """ 573 name, start, end = self._identifier_split(identifier) 574 if name == identifier: 575 identifier_keys = [identifier] 576 else: 577 identifier_keys = [identifier, name] 578 answer = {} 579 for identifier_key in identifier_keys: 580 try: 581 for feature_key in meta_dict[identifier_key]: 582 answer[feature_key] = meta_dict[identifier_key][feature_key] 583 except KeyError: 584 pass 585 return answer
586
587 - def _populate_meta_data(self, identifier, record):
588 """Add meta-date to a SecRecord's annotations dictionary (PRIVATE). 589 590 This function applies the PFAM conventions. 591 """ 592 seq_data = self._get_meta_data(identifier, self.seq_annotation) 593 for feature in seq_data: 594 # Note this dictionary contains lists! 595 if feature == "AC": # ACcession number 596 assert len(seq_data[feature]) == 1 597 record.annotations["accession"] = seq_data[feature][0] 598 elif feature == "DE": # DEscription 599 record.description = "\n".join(seq_data[feature]) 600 elif feature == "DR": # Database Reference 601 # Should we try and parse the strings? 602 record.dbxrefs = seq_data[feature] 603 elif feature in self.pfam_gs_mapping: 604 record.annotations[self.pfam_gs_mapping[feature]] = ", ".join(seq_data[feature]) 605 else: 606 # Ignore it? 607 record.annotations["GS:" + feature] = ", ".join(seq_data[feature]) 608 609 # Now record the per-letter-annotations 610 seq_col_data = self._get_meta_data(identifier, self.seq_col_annotation) 611 for feature in seq_col_data: 612 # Note this dictionary contains strings! 613 if feature in self.pfam_gr_mapping: 614 record.letter_annotations[self.pfam_gr_mapping[feature]] = seq_col_data[feature] 615 else: 616 # Ignore it? 617 record.letter_annotations["GR:" + feature] = seq_col_data[feature]
618 619 620 if __name__ == "__main__": 621 from Bio._utils import run_doctest 622 run_doctest() 623