Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by 
  12  a whole chapter in our tutorial: 
  13   
  14    - `HTML Tutorial`_ 
  15    - `PDF Tutorial`_ 
  16   
  17  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  18  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  19  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  20   
  21  Input 
  22  ----- 
  23  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  24  (or in recent versions of Biopython alternatively a filename as a string), 
  25  and format string.  This returns an iterator giving SeqRecord objects: 
  26   
  27  >>> from Bio import SeqIO 
  28  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  29  ...     print("%s %i" % (record.id, len(record))) 
  30  gi|1348912|gb|G26680|G26680 633 
  31  gi|1348917|gb|G26685|G26685 413 
  32  gi|1592936|gb|G29385|G29385 471 
  33   
  34  Note that the parse() function will invoke the relevant parser for the 
  35  format with its default settings.  You may want more control, in which case 
  36  you need to create a format specific sequence iterator directly. 
  37   
  38  Input - Single Records 
  39  ---------------------- 
  40  If you expect your file to contain one-and-only-one record, then we provide 
  41  the following 'helper' function which will return a single SeqRecord, or 
  42  raise an exception if there are no records or more than one record: 
  43   
  44  >>> from Bio import SeqIO 
  45  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  46  >>> print("%s %i" % (record.id, len(record))) 
  47  gi|3318709|pdb|1A91| 79 
  48   
  49  This style is useful when you expect a single record only (and would 
  50  consider multiple records an error).  For example, when dealing with GenBank 
  51  files for bacterial genomes or chromosomes, there is normally only a single 
  52  record.  Alternatively, use this with a handle when downloading a single 
  53  record from the internet. 
  54   
  55  However, if you just want the first record from a file containing multiple 
  56  record, use the next() function on the iterator (or under Python 2, the 
  57  iterator's next() method): 
  58   
  59  >>> from Bio import SeqIO 
  60  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  61  >>> print("%s %i" % (record.id, len(record))) 
  62  gi|1348912|gb|G26680|G26680 633 
  63   
  64  The above code will work as long as the file contains at least one record. 
  65  Note that if there is more than one record, the remaining records will be 
  66  silently ignored. 
  67   
  68   
  69  Input - Multiple Records 
  70  ------------------------ 
  71  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  72  using a sequence iterator can save you a lot of memory (RAM).  There is 
  73  less benefit for interlaced file formats (e.g. most multiple alignment file 
  74  formats).  However, an iterator only lets you access the records one by one. 
  75   
  76  If you want random access to the records by number, turn this into a list: 
  77   
  78  >>> from Bio import SeqIO 
  79  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  80  >>> len(records) 
  81  3 
  82  >>> print(records[1].id) 
  83  gi|1348917|gb|G26685|G26685 
  84   
  85  If you want random access to the records by a key such as the record id, 
  86  turn the iterator into a dictionary: 
  87   
  88  >>> from Bio import SeqIO 
  89  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  90  >>> len(record_dict) 
  91  3 
  92  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
  93  413 
  94   
  95  However, using list() or the to_dict() function will load all the records 
  96  into memory at once, and therefore is not possible on very large files. 
  97  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
  98  providing dictionary like access to any record. For example, 
  99   
 100  >>> from Bio import SeqIO 
 101  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 102  >>> len(record_dict) 
 103  3 
 104  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 105  413 
 106  >>> record_dict.close() 
 107   
 108  Many but not all of the supported input file formats can be indexed like 
 109  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 110  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 111   
 112  In most cases you can also use SeqIO.index to get the record from the file 
 113  as a raw string (not a SeqRecord). This can be useful for example to extract 
 114  a sub-set of records from a file where SeqIO cannot output the file format 
 115  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 116  keep the output 100% identical to the input). For example, 
 117   
 118  >>> from Bio import SeqIO 
 119  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 120  >>> len(record_dict) 
 121  3 
 122  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 123  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 124  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 125  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 126  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 127  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 128  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 129  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 130  <BLANKLINE> 
 131  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 134  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 135  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 136  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 137  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 138  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 139  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 140  <BLANKLINE> 
 141  >>> record_dict.close() 
 142   
 143  Here the original file and what Biopython would output differ in the line 
 144  wrapping. Also note that under Python 3, the get_raw method will return a 
 145  bytes string, hence the use of decode to turn it into a (unicode) string. 
 146  This is unnecessary on Python 2. 
 147   
 148  Also note that the get_raw method will preserve the newline endings. This 
 149  example FASTQ file uses Unix style endings (b"\n" only), 
 150   
 151  >>> from Bio import SeqIO 
 152  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 153  >>> len(fastq_dict) 
 154  3 
 155  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 156  >>> raw.count(b"\n") 
 157  4 
 158  >>> raw.count(b"\r\n") 
 159  0 
 160  >>> b"\r" in raw 
 161  False 
 162  >>> len(raw) 
 163  78 
 164  >>> fastq_dict.close() 
 165   
 166  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 167   
 168  >>> from Bio import SeqIO 
 169  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 170  >>> len(fastq_dict) 
 171  3 
 172  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 173  >>> raw.count(b"\n") 
 174  4 
 175  >>> raw.count(b"\r\n") 
 176  4 
 177  >>> b"\r\n" in raw 
 178  True 
 179  >>> len(raw) 
 180  82 
 181  >>> fastq_dict.close() 
 182   
 183  Because this uses two bytes for each new line, the file is longer than 
 184  the Unix equivalent with only one byte. 
 185   
 186   
 187  Input - Alignments 
 188  ------------------ 
 189  You can read in alignment files as alignment objects using Bio.AlignIO. 
 190  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 191  you a SeqRecord for each row of each alignment: 
 192   
 193  >>> from Bio import SeqIO 
 194  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 195  ...     print("%s %i" % (record.id, len(record))) 
 196  gi|167877390|gb|EDS40773.1| 447 
 197  gi|167234445|ref|NP_001107837. 447 
 198  gi|74100009|gb|AAZ99217.1| 447 
 199  gi|13990994|dbj|BAA33523.2| 447 
 200  gi|56122354|gb|AAV74328.1| 447 
 201   
 202   
 203  Output 
 204  ------ 
 205  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 206  SeqRecord objects (either as a list, or an iterator), an output file handle 
 207  (or in recent versions of Biopython an output filename as a string) and of 
 208  course the file format:: 
 209   
 210    from Bio import SeqIO 
 211    records = ... 
 212    SeqIO.write(records, "example.faa", "fasta") 
 213   
 214  Or, using a handle:: 
 215   
 216      from Bio import SeqIO 
 217      records = ... 
 218      with open("example.faa", "w") as handle: 
 219        SeqIO.write(records, handle, "fasta") 
 220   
 221  You are expected to call this function once (with all your records) and if 
 222  using a handle, make sure you close it to flush the data to the hard disk. 
 223   
 224   
 225  Output - Advanced 
 226  ----------------- 
 227  The effect of calling write() multiple times on a single file will vary 
 228  depending on the file format, and is best avoided unless you have a strong 
 229  reason to do so. 
 230   
 231  If you give a filename, then each time you call write() the existing file 
 232  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 233  "record block" holds a single sequence.  For these files it would probably 
 234  be safe to call write() multiple times by re-using the same handle. 
 235   
 236   
 237  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 238  stockholm) would have the effect of concatenating several multiple sequence 
 239  alignments together.  Such files are created by the PHYLIP suite of programs 
 240  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 241   
 242   
 243  Conversion 
 244  ---------- 
 245  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 246  file format conversions. Additionally, it may use file format specific 
 247  optimisations so this should be the fastest way too. 
 248   
 249  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 250  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 251  generator expressions or generator functions provides a memory efficient way 
 252  to perform filtering or other extra operations as part of the process. 
 253   
 254   
 255  File Formats 
 256  ------------ 
 257  When specifying the file format, use lowercase strings.  The same format 
 258  names are also used in Bio.AlignIO and include the following: 
 259   
 260      - abif    - Applied Biosystem's sequencing trace format 
 261      - ace     - Reads the contig sequences from an ACE assembly file. 
 262      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 263      - fasta   - The generic sequence file format where each record starts with 
 264        an identifer line starting with a ">" character, followed by 
 265        lines of sequence. 
 266      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 267        sequence quality values (with an ASCII offset of 33). 
 268      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 269      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 270        encodes Solexa quality scores (not PHRED quality scores) with an 
 271        ASCII offset of 64. 
 272      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 273        which encodes PHRED quality scores with an ASCII offset of 64 
 274        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 275        will produce FASTQ files using the standard Sanger encoding. 
 276      - genbank - The GenBank or GenPept flat file format. 
 277      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 278      - ig      - The IntelliGenetics file format, apparently the same as the 
 279        MASE alignment format. 
 280      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 281        indented to allow for longer feature types. 
 282      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 283        complete protein sequence as it appears in the header (no dependencies). 
 284      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 285        it appears in the structure based on the atom coordinate section of the 
 286        file (requires NumPy for Bio.PDB). 
 287      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 288      - pir     - A "FASTA like" format introduced by the National Biomedical 
 289        Research Foundation (NBRF) for the Protein Information Resource 
 290        (PIR) database, now part of UniProt. 
 291      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 292      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 293      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 294      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 295      - tab     - Simple two column tab separated sequence files, where each 
 296        line holds a record's identifier and sequence. For example, 
 297        this is used as by Aligent's eArray software when saving 
 298        microarray probes in a minimal tab delimited text file. 
 299      - qual    - A "FASTA like" format holding PHRED quality values from 
 300        sequencing DNA, but no actual sequences (usually provided 
 301        in separate FASTA files). 
 302      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 303        text format which we call "swiss") 
 304   
 305  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 306  write to all of them. 
 307   
 308  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 309  "phylip" and "stockholm", which gives you access to the individual sequences 
 310  making up each alignment as SeqRecords. 
 311  """ 
 312   
 313  from __future__ import print_function 
 314  from Bio._py3k import basestring 
 315   
 316  __docformat__ = "restructuredtext en"  # not just plaintext 
 317   
 318  # TODO 
 319  # - define policy on reading aligned sequences with gaps in 
 320  #   (e.g. - and . characters) including how the alphabet interacts 
 321  # 
 322  # - How best to handle unique/non unique record.id when writing. 
 323  #   For most file formats reading such files is fine; The stockholm 
 324  #   parser would fail. 
 325  # 
 326  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 327  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 328   
 329  """ 
 330  FAO BioPython Developers 
 331  ------------------------ 
 332  The way I envision this SeqIO system working as that for any sequence file 
 333  format we have an iterator that returns SeqRecord objects. 
 334   
 335  This also applies to interlaced file formats (like clustal - although that 
 336  is now handled via Bio.AlignIO instead) where the file cannot be read record 
 337  by record.  You should still return an iterator, even if the implementation 
 338  could just as easily return a list. 
 339   
 340  These file format specific sequence iterators may be implemented as: 
 341      - Classes which take a handle for __init__ and provide the __iter__ method 
 342      - Functions that take a handle, and return an iterator object 
 343      - Generator functions that take a handle, and yield SeqRecord objects 
 344   
 345  It is then trivial to turn this iterator into a list of SeqRecord objects, 
 346  an in memory dictionary, or a multiple sequence alignment object. 
 347   
 348  For building the dictionary by default the id property of each SeqRecord is 
 349  used as the key.  You should always populate the id property, and it should 
 350  be unique in most cases. For some file formats the accession number is a good 
 351  choice.  If the file itself contains ambiguous identifiers, don't try and 
 352  dis-ambiguate them - return them as is. 
 353   
 354  When adding a new file format, please use the same lower case format name 
 355  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 356   
 357  See also http://biopython.org/wiki/SeqIO_dev 
 358   
 359  --Peter 
 360  """ 
 361   
 362   
 363  from Bio.File import as_handle 
 364  from Bio.SeqRecord import SeqRecord 
 365  from Bio.Align import MultipleSeqAlignment 
 366  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 367   
 368  from . import AbiIO 
 369  from . import AceIO 
 370  from . import FastaIO 
 371  from . import IgIO  # IntelliGenetics or MASE format 
 372  from . import InsdcIO  # EMBL and GenBank 
 373  from . import PdbIO 
 374  from . import PhdIO 
 375  from . import PirIO 
 376  from . import SeqXmlIO 
 377  from . import SffIO 
 378  from . import SwissIO 
 379  from . import TabIO 
 380  from . import QualityIO  # FastQ and qual files 
 381  from . import UniprotIO 
 382   
 383   
 384  # Convention for format names is "mainname-subtype" in lower case. 
 385  # Please use the same names as BioPerl or EMBOSS where possible. 
 386  # 
 387  # Note that this simple system copes with defining 
 388  # multiple possible iterators for a given format/extension 
 389  # with the -subtype suffix 
 390  # 
 391  # Most alignment file formats will be handled via Bio.AlignIO 
 392   
 393  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 394                       "gb": InsdcIO.GenBankIterator, 
 395                       "genbank": InsdcIO.GenBankIterator, 
 396                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 397                       "embl": InsdcIO.EmblIterator, 
 398                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 399                       "imgt": InsdcIO.ImgtIterator, 
 400                       "ig": IgIO.IgIterator, 
 401                       "swiss": SwissIO.SwissIterator, 
 402                       "pdb-atom": PdbIO.PdbAtomIterator, 
 403                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 404                       "phd": PhdIO.PhdIterator, 
 405                       "ace": AceIO.AceIterator, 
 406                       "tab": TabIO.TabIterator, 
 407                       "pir": PirIO.PirIterator, 
 408                       "fastq": QualityIO.FastqPhredIterator, 
 409                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 410                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 411                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 412                       "qual": QualityIO.QualPhredIterator, 
 413                       "sff": SffIO.SffIterator, 
 414                       # Not sure about this in the long run: 
 415                       "sff-trim": SffIO._SffTrimIterator, 
 416                       "uniprot-xml": UniprotIO.UniprotIterator, 
 417                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 418                       "abi": AbiIO.AbiIterator, 
 419                       "abi-trim": AbiIO._AbiTrimIterator, 
 420                       } 
 421   
 422  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 423                     "gb": InsdcIO.GenBankWriter, 
 424                     "genbank": InsdcIO.GenBankWriter, 
 425                     "embl": InsdcIO.EmblWriter, 
 426                     "imgt": InsdcIO.ImgtWriter, 
 427                     "tab": TabIO.TabWriter, 
 428                     "fastq": QualityIO.FastqPhredWriter, 
 429                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 430                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 431                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 432                     "phd": PhdIO.PhdWriter, 
 433                     "qual": QualityIO.QualPhredWriter, 
 434                     "sff": SffIO.SffWriter, 
 435                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 436                     } 
 437   
 438  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 439   
 440   
441 -def write(sequences, handle, format):
442 """Write complete set of sequences to a file. 443 444 - sequences - A list (or iterator) of SeqRecord objects, or (if using 445 Biopython 1.54 or later) a single SeqRecord. 446 - handle - File handle object to write to, or filename as string 447 (note older versions of Biopython only took a handle). 448 - format - lower case string describing the file format to write. 449 450 You should close the handle after calling this function. 451 452 Returns the number of records written (as an integer). 453 """ 454 from Bio import AlignIO 455 456 # Try and give helpful error messages: 457 if not isinstance(format, basestring): 458 raise TypeError("Need a string for the file format (lower case)") 459 if not format: 460 raise ValueError("Format required (lower case string)") 461 if format != format.lower(): 462 raise ValueError("Format string '%s' should be lower case" % format) 463 464 if isinstance(handle, SeqRecord): 465 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 466 if isinstance(handle, list): 467 # e.g. list of SeqRecord objects 468 raise TypeError("Check arguments, handle should NOT be a list") 469 470 if isinstance(sequences, SeqRecord): 471 # This raised an exception in order version of Biopython 472 sequences = [sequences] 473 474 if format in _BinaryFormats: 475 mode = 'wb' 476 else: 477 mode = 'w' 478 479 with as_handle(handle, mode) as fp: 480 # Map the file format to a writer class 481 if format in _FormatToWriter: 482 writer_class = _FormatToWriter[format] 483 count = writer_class(fp).write_file(sequences) 484 elif format in AlignIO._FormatToWriter: 485 # Try and turn all the records into a single alignment, 486 # and write that using Bio.AlignIO 487 alignment = MultipleSeqAlignment(sequences) 488 alignment_count = AlignIO.write([alignment], fp, format) 489 assert alignment_count == 1, \ 490 "Internal error - the underlying writer " \ 491 " should have returned 1, not %s" % repr(alignment_count) 492 count = len(alignment) 493 del alignment_count, alignment 494 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 495 raise ValueError("Reading format '%s' is supported, but not writing" 496 % format) 497 else: 498 raise ValueError("Unknown format '%s'" % format) 499 500 assert isinstance(count, int), "Internal error - the underlying %s " \ 501 "writer should have returned the record count, not %s" \ 502 % (format, repr(count)) 503 504 return count
505 506
507 -def parse(handle, format, alphabet=None):
508 r"""Turns a sequence file into an iterator returning SeqRecords. 509 510 - handle - handle to the file, or the filename as a string 511 (note older versions of Biopython only took a handle). 512 - format - lower case string describing the file format. 513 - alphabet - optional Alphabet object, useful when the sequence type 514 cannot be automatically inferred from the file itself 515 (e.g. format="fasta" or "tab") 516 517 Typical usage, opening a file to read in, and looping over the record(s): 518 519 >>> from Bio import SeqIO 520 >>> filename = "Fasta/sweetpea.nu" 521 >>> for record in SeqIO.parse(filename, "fasta"): 522 ... print("ID %s" % record.id) 523 ... print("Sequence length %i" % len(record)) 524 ... print("Sequence alphabet %s" % record.seq.alphabet) 525 ID gi|3176602|gb|U78617.1|LOU78617 526 Sequence length 309 527 Sequence alphabet SingleLetterAlphabet() 528 529 For file formats like FASTA where the alphabet cannot be determined, it 530 may be useful to specify the alphabet explicitly: 531 532 >>> from Bio import SeqIO 533 >>> from Bio.Alphabet import generic_dna 534 >>> filename = "Fasta/sweetpea.nu" 535 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 536 ... print("ID %s" % record.id) 537 ... print("Sequence length %i" % len(record)) 538 ... print("Sequence alphabet %s" % record.seq.alphabet) 539 ID gi|3176602|gb|U78617.1|LOU78617 540 Sequence length 309 541 Sequence alphabet DNAAlphabet() 542 543 If you have a string 'data' containing the file contents, you must 544 first turn this into a handle in order to parse it: 545 546 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 547 >>> from Bio import SeqIO 548 >>> try: 549 ... from StringIO import StringIO # Python 2 550 ... except ImportError: 551 ... from io import StringIO # Python 3 552 ... 553 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 554 ... print("%s %s" % (record.id, record.seq)) 555 Alpha ACCGGATGTA 556 Beta AGGCTCGGTTA 557 558 Use the Bio.SeqIO.read(...) function when you expect a single record 559 only. 560 """ 561 # NOTE - The above docstring has some raw \n characters needed 562 # for the StringIO example, hence the whole docstring is in raw 563 # string mode (see the leading r before the opening quote). 564 from Bio import AlignIO 565 566 # Hack for SFF, will need to make this more general in future 567 if format in _BinaryFormats: 568 mode = 'rb' 569 else: 570 mode = 'rU' 571 572 # Try and give helpful error messages: 573 if not isinstance(format, basestring): 574 raise TypeError("Need a string for the file format (lower case)") 575 if not format: 576 raise ValueError("Format required (lower case string)") 577 if format != format.lower(): 578 raise ValueError("Format string '%s' should be lower case" % format) 579 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 580 isinstance(alphabet, AlphabetEncoder)): 581 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 582 583 with as_handle(handle, mode) as fp: 584 # Map the file format to a sequence iterator: 585 if format in _FormatToIterator: 586 iterator_generator = _FormatToIterator[format] 587 if alphabet is None: 588 i = iterator_generator(fp) 589 else: 590 try: 591 i = iterator_generator(fp, alphabet=alphabet) 592 except TypeError: 593 i = _force_alphabet(iterator_generator(fp), alphabet) 594 elif format in AlignIO._FormatToIterator: 595 # Use Bio.AlignIO to read in the alignments 596 i = (r for alignment in AlignIO.parse(fp, format, 597 alphabet=alphabet) 598 for r in alignment) 599 else: 600 raise ValueError("Unknown format '%s'" % format) 601 # This imposes some overhead... wait until we drop Python 2.4 to fix it 602 for r in i: 603 yield r
604 605
606 -def _force_alphabet(record_iterator, alphabet):
607 """Iterate over records, over-riding the alphabet (PRIVATE).""" 608 # Assume the alphabet argument has been pre-validated 609 given_base_class = _get_base_alphabet(alphabet).__class__ 610 for record in record_iterator: 611 if isinstance(_get_base_alphabet(record.seq.alphabet), 612 given_base_class): 613 record.seq.alphabet = alphabet 614 yield record 615 else: 616 raise ValueError("Specified alphabet %s clashes with " 617 "that determined from the file, %s" 618 % (repr(alphabet), repr(record.seq.alphabet)))
619 620
621 -def read(handle, format, alphabet=None):
622 """Turns a sequence file into a single SeqRecord. 623 624 - handle - handle to the file, or the filename as a string 625 (note older versions of Biopython only took a handle). 626 - format - string describing the file format. 627 - alphabet - optional Alphabet object, useful when the sequence type 628 cannot be automatically inferred from the file itself 629 (e.g. format="fasta" or "tab") 630 631 This function is for use parsing sequence files containing 632 exactly one record. For example, reading a GenBank file: 633 634 >>> from Bio import SeqIO 635 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 636 >>> print("ID %s" % record.id) 637 ID AC007323.5 638 >>> print("Sequence length %i" % len(record)) 639 Sequence length 86436 640 >>> print("Sequence alphabet %s" % record.seq.alphabet) 641 Sequence alphabet IUPACAmbiguousDNA() 642 643 If the handle contains no records, or more than one record, 644 an exception is raised. For example: 645 646 >>> from Bio import SeqIO 647 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 648 Traceback (most recent call last): 649 ... 650 ValueError: More than one record found in handle 651 652 If however you want the first record from a file containing 653 multiple records this function would raise an exception (as 654 shown in the example above). Instead use: 655 656 >>> from Bio import SeqIO 657 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 658 >>> print("First record's ID %s" % record.id) 659 First record's ID X55053.1 660 661 Use the Bio.SeqIO.parse(handle, format) function if you want 662 to read multiple records from the handle. 663 """ 664 iterator = parse(handle, format, alphabet) 665 try: 666 first = next(iterator) 667 except StopIteration: 668 first = None 669 if first is None: 670 raise ValueError("No records found in handle") 671 try: 672 second = next(iterator) 673 except StopIteration: 674 second = None 675 if second is not None: 676 raise ValueError("More than one record found in handle") 677 return first
678 679
680 -def to_dict(sequences, key_function=None):
681 """Turns a sequence iterator or list into a dictionary. 682 683 - sequences - An iterator that returns SeqRecord objects, 684 or simply a list of SeqRecord objects. 685 - key_function - Optional callback function which when given a 686 SeqRecord should return a unique key for the dictionary. 687 688 e.g. key_function = lambda rec : rec.name 689 or, key_function = lambda rec : rec.description.split()[0] 690 691 If key_function is omitted then record.id is used, on the assumption 692 that the records objects returned are SeqRecords with a unique id. 693 694 If there are duplicate keys, an error is raised. 695 696 Example usage, defaulting to using the record.id as key: 697 698 >>> from Bio import SeqIO 699 >>> filename = "GenBank/cor6_6.gb" 700 >>> format = "genbank" 701 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 702 >>> print(sorted(id_dict)) 703 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 704 >>> print(id_dict["L31939.1"].description) 705 Brassica rapa (clone bif72) kin mRNA, complete cds. 706 707 A more complex example, using the key_function argument in order to 708 use a sequence checksum as the dictionary key: 709 710 >>> from Bio import SeqIO 711 >>> from Bio.SeqUtils.CheckSum import seguid 712 >>> filename = "GenBank/cor6_6.gb" 713 >>> format = "genbank" 714 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 715 ... key_function = lambda rec : seguid(rec.seq)) 716 >>> for key, record in sorted(seguid_dict.items()): 717 ... print("%s %s" % (key, record.id)) 718 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 719 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 720 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 721 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 722 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 723 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 724 725 This approach is not suitable for very large sets of sequences, as all 726 the SeqRecord objects are held in memory. Instead, consider using the 727 Bio.SeqIO.index() function (if it supports your particular file format). 728 """ 729 if key_function is None: 730 key_function = lambda rec: rec.id 731 732 d = dict() 733 for record in sequences: 734 key = key_function(record) 735 if key in d: 736 raise ValueError("Duplicate key '%s'" % key) 737 d[key] = record 738 return d
739 740
741 -def index(filename, format, alphabet=None, key_function=None):
742 """Indexes a sequence file and returns a dictionary like object. 743 744 - filename - string giving name of file to be indexed 745 - format - lower case string describing the file format 746 - alphabet - optional Alphabet object, useful when the sequence type 747 cannot be automatically inferred from the file itself 748 (e.g. format="fasta" or "tab") 749 - key_function - Optional callback function which when given a 750 SeqRecord identifier string should return a unique 751 key for the dictionary. 752 753 This indexing function will return a dictionary like object, giving the 754 SeqRecord objects as values: 755 756 >>> from Bio import SeqIO 757 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 758 >>> len(records) 759 3 760 >>> sorted(records) 761 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 762 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 763 >EAS54_6_R1_2_1_540_792 764 TTGGCAGGCCAAGGCCGATGGATCA 765 <BLANKLINE> 766 >>> "EAS54_6_R1_2_1_540_792" in records 767 True 768 >>> print(records.get("Missing", None)) 769 None 770 >>> records.close() 771 772 If the file is BGZF compressed, this is detected automatically. Ordinary 773 GZIP files are not supported: 774 775 >>> from Bio import SeqIO 776 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 777 >>> len(records) 778 3 779 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 780 TTGGCAGGCCAAGGCCGATGGATCA 781 >>> records.close() 782 783 Note that this pseudo dictionary will not support all the methods of a 784 true Python dictionary, for example values() is not defined since this 785 would require loading all of the records into memory at once. 786 787 When you call the index function, it will scan through the file, noting 788 the location of each record. When you access a particular record via the 789 dictionary methods, the code will jump to the appropriate part of the 790 file and then parse that section into a SeqRecord. 791 792 Note that not all the input formats supported by Bio.SeqIO can be used 793 with this index function. It is designed to work only with sequential 794 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 795 interlaced file format (e.g. alignment formats such as "clustal"). 796 797 For small files, it may be more efficient to use an in memory Python 798 dictionary, e.g. 799 800 >>> from Bio import SeqIO 801 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 802 >>> len(records) 803 3 804 >>> sorted(records) 805 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 806 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 807 >EAS54_6_R1_2_1_540_792 808 TTGGCAGGCCAAGGCCGATGGATCA 809 <BLANKLINE> 810 811 As with the to_dict() function, by default the id string of each record 812 is used as the key. You can specify a callback function to transform 813 this (the record identifier string) into your preferred key. For example: 814 815 >>> from Bio import SeqIO 816 >>> def make_tuple(identifier): 817 ... parts = identifier.split("_") 818 ... return int(parts[-2]), int(parts[-1]) 819 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 820 ... key_function=make_tuple) 821 >>> len(records) 822 3 823 >>> sorted(records) 824 [(413, 324), (443, 348), (540, 792)] 825 >>> print(records[(540, 792)].format("fasta")) 826 >EAS54_6_R1_2_1_540_792 827 TTGGCAGGCCAAGGCCGATGGATCA 828 <BLANKLINE> 829 >>> (540, 792) in records 830 True 831 >>> "EAS54_6_R1_2_1_540_792" in records 832 False 833 >>> print(records.get("Missing", None)) 834 None 835 >>> records.close() 836 837 Another common use case would be indexing an NCBI style FASTA file, 838 where you might want to extract the GI number from the FASTA identifier 839 to use as the dictionary key. 840 841 Notice that unlike the to_dict() function, here the key_function does 842 not get given the full SeqRecord to use to generate the key. Doing so 843 would impose a severe performance penalty as it would require the file 844 to be completely parsed while building the index. Right now this is 845 usually avoided. 846 847 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 848 """ 849 # Try and give helpful error messages: 850 if not isinstance(filename, basestring): 851 raise TypeError("Need a filename (not a handle)") 852 if not isinstance(format, basestring): 853 raise TypeError("Need a string for the file format (lower case)") 854 if not format: 855 raise ValueError("Format required (lower case string)") 856 if format != format.lower(): 857 raise ValueError("Format string '%s' should be lower case" % format) 858 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 859 isinstance(alphabet, AlphabetEncoder)): 860 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 861 862 # Map the file format to a sequence iterator: 863 from ._index import _FormatToRandomAccess # Lazy import 864 from Bio.File import _IndexedSeqFileDict 865 try: 866 proxy_class = _FormatToRandomAccess[format] 867 except KeyError: 868 raise ValueError("Unsupported format %r" % format) 869 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 870 % (filename, format, alphabet, key_function) 871 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 872 key_function, repr, "SeqRecord")
873 874
875 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 876 key_function=None):
877 """Index several sequence files and return a dictionary like object. 878 879 The index is stored in an SQLite database rather than in memory (as in the 880 Bio.SeqIO.index(...) function). 881 882 - index_filename - Where to store the SQLite index 883 - filenames - list of strings specifying file(s) to be indexed, or when 884 indexing a single file this can be given as a string. 885 (optional if reloading an existing index, but must match) 886 - format - lower case string describing the file format 887 (optional if reloading an existing index, but must match) 888 - alphabet - optional Alphabet object, useful when the sequence type 889 cannot be automatically inferred from the file itself 890 (e.g. format="fasta" or "tab") 891 - key_function - Optional callback function which when given a 892 SeqRecord identifier string should return a unique 893 key for the dictionary. 894 895 This indexing function will return a dictionary like object, giving the 896 SeqRecord objects as values: 897 898 >>> from Bio.Alphabet import generic_protein 899 >>> from Bio import SeqIO 900 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 901 >>> def get_gi(name): 902 ... parts = name.split("|") 903 ... i = parts.index("gi") 904 ... assert i != -1 905 ... return parts[i+1] 906 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 907 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 908 >>> len(records) 909 95 910 >>> records["7525076"].description 911 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 912 >>> records["45478717"].description 913 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 914 >>> records.close() 915 916 In this example the two files contain 85 and 10 records respectively. 917 918 BGZF compressed files are supported, and detected automatically. Ordinary 919 GZIP compressed files are not supported. 920 921 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 922 glob which is useful for building lists of files. 923 """ 924 # Try and give helpful error messages: 925 if not isinstance(index_filename, basestring): 926 raise TypeError("Need a string for the index filename") 927 if isinstance(filenames, basestring): 928 # Make the API a little more friendly, and more similar 929 # to Bio.SeqIO.index(...) for indexing just one file. 930 filenames = [filenames] 931 if filenames is not None and not isinstance(filenames, list): 932 raise TypeError( 933 "Need a list of filenames (as strings), or one filename") 934 if format is not None and not isinstance(format, basestring): 935 raise TypeError("Need a string for the file format (lower case)") 936 if format and format != format.lower(): 937 raise ValueError("Format string '%s' should be lower case" % format) 938 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 939 isinstance(alphabet, AlphabetEncoder)): 940 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 941 942 # Map the file format to a sequence iterator: 943 from ._index import _FormatToRandomAccess # Lazy import 944 from Bio.File import _SQLiteManySeqFilesDict 945 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 946 % (index_filename, filenames, format, alphabet, key_function) 947 948 def proxy_factory(format, filename=None): 949 """Given a filename returns proxy object, else boolean if format OK.""" 950 if filename: 951 return _FormatToRandomAccess[format](filename, format, alphabet) 952 else: 953 return format in _FormatToRandomAccess
954 955 return _SQLiteManySeqFilesDict(index_filename, filenames, 956 proxy_factory, format, 957 key_function, repr) 958 959
960 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
961 """Convert between two sequence file formats, return number of records. 962 963 - in_file - an input handle or filename 964 - in_format - input file format, lower case string 965 - out_file - an output handle or filename 966 - out_format - output file format, lower case string 967 - alphabet - optional alphabet to assume 968 969 **NOTE** - If you provide an output filename, it will be opened which will 970 overwrite any existing file without warning. This may happen if even 971 the conversion is aborted (e.g. an invalid out_format name is given). 972 973 For example, going from a filename to a handle: 974 975 >>> from Bio import SeqIO 976 >>> try: 977 ... from StringIO import StringIO # Python 2 978 ... except ImportError: 979 ... from io import StringIO # Python 3 980 ... 981 >>> handle = StringIO("") 982 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 983 3 984 >>> print(handle.getvalue()) 985 >EAS54_6_R1_2_1_413_324 986 CCCTTCTTGTCTTCAGCGTTTCTCC 987 >EAS54_6_R1_2_1_540_792 988 TTGGCAGGCCAAGGCCGATGGATCA 989 >EAS54_6_R1_2_1_443_348 990 GTTGCTTCTGGCGTGGGTGGGGGGG 991 <BLANKLINE> 992 """ 993 # Hack for SFF, will need to make this more general in future 994 if in_format in _BinaryFormats: 995 in_mode = 'rb' 996 else: 997 in_mode = 'rU' 998 999 # Don't open the output file until we've checked the input is OK? 1000 if out_format in ["sff", "sff_trim"]: 1001 out_mode = 'wb' 1002 else: 1003 out_mode = 'w' 1004 1005 # This will check the arguments and issue error messages, 1006 # after we have opened the file which is a shame. 1007 from ._convert import _handle_convert # Lazy import 1008 with as_handle(in_file, in_mode) as in_handle: 1009 with as_handle(out_file, out_mode) as out_handle: 1010 count = _handle_convert(in_handle, in_format, 1011 out_handle, out_format, 1012 alphabet) 1013 return count
1014 1015 1016 # This helpful trick for testing no longer works with the 1017 # local imports :( 1018 # 1019 # if __name__ == "__main__": 1020 # from Bio._utils import run_doctest 1021 # run_doctest() 1022