Package Bio :: Package SeqIO :: Module PirIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.PirIO

  1  # Copyright 2008-2015 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5  # 
  6  # This module is for reading and writing PIR or NBRF format files as 
  7  # SeqRecord objects.  The code is based on Bio.SeqIO.FastaIO 
  8   
  9  """Bio.SeqIO support for the "pir" (aka PIR or NBRF) file format. 
 10   
 11  You are expected to use this module via the Bio.SeqIO functions, or if 
 12  the file contains a sequence alignment, optionally via Bio.AlignIO instead. 
 13   
 14  This format was introduced for the Protein Information Resource (PIR), a 
 15  project of the National Biomedical Research Foundation (NBRF).  The PIR 
 16  database itself is now part of UniProt. 
 17   
 18  The file format is described online at: 
 19  http://www.ebi.ac.uk/help/pir_frame.html 
 20  http://www.cmbi.kun.nl/bioinf/tools/crab_pir.html (currently down) 
 21   
 22  An example file in this format would be:: 
 23   
 24    >P1;CRAB_ANAPL 
 25    ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). 
 26      MDITIHNPLI RRPLFSWLAP SRIFDQIFGE HLQESELLPA SPSLSPFLMR 
 27      SPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKV KVLGDMVEIH 
 28      GKHEERQDEH GFIAREFNRK YRIPADVDPL TITSSLSLDG VLTVSAPRKQ 
 29      SDVPERSIPI TREEKPAIAG AQRK* 
 30   
 31    >P1;CRAB_BOVIN 
 32    ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN). 
 33      MDIAIHHPWI RRPFFPFHSP SRLFDQFFGE HLLESDLFPA STSLSPFYLR 
 34      PPSFLRAPSW IDTGLSEMRL EKDRFSVNLD VKHFSPEELK VKVLGDVIEV 
 35      HGKHEERQDE HGFISREFHR KYRIPADVDP LAITSSLSSD GVLTVNGPRK 
 36      QASGPERTIP ITREEKPAVT AAPKK* 
 37   
 38  Or, an example of a multiple sequence alignment:: 
 39   
 40    >P1;S27231 
 41    rhodopsin - northern leopard frog 
 42    MNGTEGPNFY IPMSNKTGVV RSPFDYPQYY LAEPWKYSVL AAYMFLLILL GLPINFMTLY 
 43    VTIQHKKLRT PLNYILLNLG VCNHFMVLCG FTITMYTSLH GYFVFGQTGC YFEGFFATLG 
 44    GEIALWSLVV LAIERYIVVC KPMSNFRFGE NHAMMGVAFT WIMALACAVP PLFGWSRYIP 
 45    EGMQCSCGVD YYTLKPEVNN ESFVIYMFVV HFLIPLIIIS FCYGRLVCTV KEAAAQQQES 
 46    ATTQKAEKEV TRMVIIMVIF FLICWVPYAY VAFYIFTHQG SEFGPIFMTV PAFFAKSSAI 
 47    YNPVIYIMLN KQFRNCMITT LCCGKNPFGD DDASSAATSK TEATSVSTSQ VSPA* 
 48   
 49    >P1;I51200 
 50    rhodopsin - African clawed frog 
 51    MNGTEGPNFY VPMSNKTGVV RSPFDYPQYY LAEPWQYSAL AAYMFLLILL GLPINFMTLF 
 52    VTIQHKKLRT PLNYILLNLV FANHFMVLCG FTVTMYTSMH GYFIFGPTGC YIEGFFATLG 
 53    GEVALWSLVV LAVERYIVVC KPMANFRFGE NHAIMGVAFT WIMALSCAAP PLFGWSRYIP 
 54    EGMQCSCGVD YYTLKPEVNN ESFVIYMFIV HFTIPLIVIF FCYGRLLCTV KEAAAQQQES 
 55    LTTQKAEKEV TRMVVIMVVF FLICWVPYAY VAFYIFTHQG SNFGPVFMTV PAFFAKSSAI 
 56    YNPVIYIVLN KQFRNCLITT LCCGKNPFGD EDGSSAATSK TEASSVSSSQ VSPA* 
 57   
 58    >P1;JN0120 
 59    rhodopsin - Japanese lamprey 
 60    MNGTEGDNFY VPFSNKTGLA RSPYEYPQYY LAEPWKYSAL AAYMFFLILV GFPVNFLTLF 
 61    VTVQHKKLRT PLNYILLNLA MANLFMVLFG FTVTMYTSMN GYFVFGPTMC SIEGFFATLG 
 62    GEVALWSLVV LAIERYIVIC KPMGNFRFGN THAIMGVAFT WIMALACAAP PLVGWSRYIP 
 63    EGMQCSCGPD YYTLNPNFNN ESYVVYMFVV HFLVPFVIIF FCYGRLLCTV KEAAAAQQES 
 64    ASTQKAEKEV TRMVVLMVIG FLVCWVPYAS VAFYIFTHQG SDFGATFMTL PAFFAKSSAL 
 65    YNPVIYILMN KQFRNCMITT LCCGKNPLGD DE-SGASTSKT EVSSVSTSPV SPA* 
 66   
 67   
 68  As with the FASTA format, each record starts with a line beginning with ">" 
 69  character.  There is then a two letter sequence type (P1, F1, DL, DC, RL, 
 70  RC, or XX), a semi colon, and the identification code.  The second like is 
 71  free text description.  The remaining lines contain the sequence itself, 
 72  terminating in an asterisk.  Space separated blocks of ten letters as shown 
 73  above are typical. 
 74   
 75  Sequence codes and their meanings: 
 76   
 77    - P1 - Protein (complete) 
 78    - F1 - Protein (fragment) 
 79    - D1 - DNA (e.g. EMBOSS seqret output) 
 80    - DL - DNA (linear) 
 81    - DC - DNA (circular) 
 82    - RL - RNA (linear) 
 83    - RC - RNA (circular) 
 84    - N3 - tRNA 
 85    - N1 - Other functional RNA 
 86    - XX - Unknown 
 87  """ 
 88   
 89  from __future__ import print_function 
 90   
 91  from Bio.Alphabet import single_letter_alphabet, generic_protein, \ 
 92      generic_dna, generic_rna 
 93  from Bio.Seq import Seq 
 94  from Bio.SeqRecord import SeqRecord 
 95   
 96   
 97  _pir_alphabets = {"P1": generic_protein, 
 98                    "F1": generic_protein, 
 99                    "D1": generic_dna, 
100                    "DL": generic_dna, 
101                    "DC": generic_dna, 
102                    "RL": generic_rna, 
103                    "RC": generic_rna, 
104                    "N3": generic_rna, 
105                    "XX": single_letter_alphabet, 
106                    } 
107   
108   
109 -def PirIterator(handle):
110 """Generator function to iterate over Fasta records (as SeqRecord objects). 111 112 handle - input file 113 alphabet - optional alphabet 114 title2ids - A function that, when given the title of the FASTA 115 file (without the beginning >), will return the id, name and 116 description (in that order) for the record as a tuple of strings. 117 118 If this is not given, then the entire title line will be used 119 as the description, and the first word as the id and name. 120 121 Note that use of title2ids matches that of Bio.Fasta.SequenceParser 122 but the defaults are slightly different. 123 124 Example: 125 126 >>> with open("NBRF/DMB_prot.pir") as handle: 127 ... for record in PirIterator(handle): 128 ... print("%s length %i" % (record.id, len(record))) 129 HLA:HLA00489 length 263 130 HLA:HLA00490 length 94 131 HLA:HLA00491 length 94 132 HLA:HLA00492 length 80 133 HLA:HLA00493 length 175 134 HLA:HLA01083 length 188 135 136 """ 137 # Skip any text before the first record (e.g. blank lines, comments) 138 while True: 139 line = handle.readline() 140 if line == "": 141 return # Premature end of file, or just empty? 142 if line[0] == ">": 143 break 144 145 while True: 146 if line[0] != ">": 147 raise ValueError( 148 "Records in PIR files should start with '>' character") 149 pir_type = line[1:3] 150 if pir_type not in _pir_alphabets or line[3] != ";": 151 raise ValueError( 152 "Records should start with '>XX;' " 153 "where XX is a valid sequence type") 154 identifier = line[4:].strip() 155 description = handle.readline().strip() 156 157 lines = [] 158 line = handle.readline() 159 while True: 160 if not line: 161 break 162 if line[0] == ">": 163 break 164 # Remove trailing whitespace, and any internal spaces 165 lines.append(line.rstrip().replace(" ", "")) 166 line = handle.readline() 167 seq = "".join(lines) 168 if seq[-1] != "*": 169 # Note the * terminator is present on nucleotide sequences too, 170 # it is not a stop codon! 171 raise ValueError( 172 "Sequences in PIR files should include a * terminator!") 173 174 # Return the record and then continue... 175 record = SeqRecord(Seq(seq[:-1], _pir_alphabets[pir_type]), 176 id=identifier, name=identifier, 177 description=description) 178 record.annotations["PIR-type"] = pir_type 179 yield record 180 181 if not line: 182 return # StopIteration 183 assert False, "Should not reach this line"
184 185 if __name__ == "__main__": 186 from Bio._utils import run_doctest 187 run_doctest(verbose=0) 188