Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2015 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by 
  12  a whole chapter in our tutorial: 
  13   
  14    - `HTML Tutorial`_ 
  15    - `PDF Tutorial`_ 
  16   
  17  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  18  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  19  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  20   
  21  Input 
  22  ----- 
  23  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  24  (or in recent versions of Biopython alternatively a filename as a string), 
  25  and format string.  This returns an iterator giving SeqRecord objects: 
  26   
  27  >>> from Bio import SeqIO 
  28  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  29  ...     print("%s %i" % (record.id, len(record))) 
  30  gi|1348912|gb|G26680|G26680 633 
  31  gi|1348917|gb|G26685|G26685 413 
  32  gi|1592936|gb|G29385|G29385 471 
  33   
  34  Note that the parse() function will invoke the relevant parser for the 
  35  format with its default settings.  You may want more control, in which case 
  36  you need to create a format specific sequence iterator directly. 
  37   
  38  Input - Single Records 
  39  ---------------------- 
  40  If you expect your file to contain one-and-only-one record, then we provide 
  41  the following 'helper' function which will return a single SeqRecord, or 
  42  raise an exception if there are no records or more than one record: 
  43   
  44  >>> from Bio import SeqIO 
  45  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  46  >>> print("%s %i" % (record.id, len(record))) 
  47  gi|3318709|pdb|1A91| 79 
  48   
  49  This style is useful when you expect a single record only (and would 
  50  consider multiple records an error).  For example, when dealing with GenBank 
  51  files for bacterial genomes or chromosomes, there is normally only a single 
  52  record.  Alternatively, use this with a handle when downloading a single 
  53  record from the internet. 
  54   
  55  However, if you just want the first record from a file containing multiple 
  56  record, use the next() function on the iterator (or under Python 2, the 
  57  iterator's next() method): 
  58   
  59  >>> from Bio import SeqIO 
  60  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  61  >>> print("%s %i" % (record.id, len(record))) 
  62  gi|1348912|gb|G26680|G26680 633 
  63   
  64  The above code will work as long as the file contains at least one record. 
  65  Note that if there is more than one record, the remaining records will be 
  66  silently ignored. 
  67   
  68   
  69  Input - Multiple Records 
  70  ------------------------ 
  71  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  72  using a sequence iterator can save you a lot of memory (RAM).  There is 
  73  less benefit for interlaced file formats (e.g. most multiple alignment file 
  74  formats).  However, an iterator only lets you access the records one by one. 
  75   
  76  If you want random access to the records by number, turn this into a list: 
  77   
  78  >>> from Bio import SeqIO 
  79  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  80  >>> len(records) 
  81  3 
  82  >>> print(records[1].id) 
  83  gi|1348917|gb|G26685|G26685 
  84   
  85  If you want random access to the records by a key such as the record id, 
  86  turn the iterator into a dictionary: 
  87   
  88  >>> from Bio import SeqIO 
  89  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  90  >>> len(record_dict) 
  91  3 
  92  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
  93  413 
  94   
  95  However, using list() or the to_dict() function will load all the records 
  96  into memory at once, and therefore is not possible on very large files. 
  97  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
  98  providing dictionary like access to any record. For example, 
  99   
 100  >>> from Bio import SeqIO 
 101  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 102  >>> len(record_dict) 
 103  3 
 104  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 105  413 
 106  >>> record_dict.close() 
 107   
 108  Many but not all of the supported input file formats can be indexed like 
 109  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 110  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 111   
 112  In most cases you can also use SeqIO.index to get the record from the file 
 113  as a raw string (not a SeqRecord). This can be useful for example to extract 
 114  a sub-set of records from a file where SeqIO cannot output the file format 
 115  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 116  keep the output 100% identical to the input). For example, 
 117   
 118  >>> from Bio import SeqIO 
 119  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 120  >>> len(record_dict) 
 121  3 
 122  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 123  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 124  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 125  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 126  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 127  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 128  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 129  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 130  <BLANKLINE> 
 131  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 134  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 135  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 136  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 137  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 138  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 139  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 140  <BLANKLINE> 
 141  >>> record_dict.close() 
 142   
 143  Here the original file and what Biopython would output differ in the line 
 144  wrapping. Also note that under Python 3, the get_raw method will return a 
 145  bytes string, hence the use of decode to turn it into a (unicode) string. 
 146  This is unnecessary on Python 2. 
 147   
 148  Also note that the get_raw method will preserve the newline endings. This 
 149  example FASTQ file uses Unix style endings (b"\n" only), 
 150   
 151  >>> from Bio import SeqIO 
 152  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 153  >>> len(fastq_dict) 
 154  3 
 155  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 156  >>> raw.count(b"\n") 
 157  4 
 158  >>> raw.count(b"\r\n") 
 159  0 
 160  >>> b"\r" in raw 
 161  False 
 162  >>> len(raw) 
 163  78 
 164  >>> fastq_dict.close() 
 165   
 166  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 167   
 168  >>> from Bio import SeqIO 
 169  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 170  >>> len(fastq_dict) 
 171  3 
 172  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 173  >>> raw.count(b"\n") 
 174  4 
 175  >>> raw.count(b"\r\n") 
 176  4 
 177  >>> b"\r\n" in raw 
 178  True 
 179  >>> len(raw) 
 180  82 
 181  >>> fastq_dict.close() 
 182   
 183  Because this uses two bytes for each new line, the file is longer than 
 184  the Unix equivalent with only one byte. 
 185   
 186   
 187  Input - Alignments 
 188  ------------------ 
 189  You can read in alignment files as alignment objects using Bio.AlignIO. 
 190  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 191  you a SeqRecord for each row of each alignment: 
 192   
 193  >>> from Bio import SeqIO 
 194  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 195  ...     print("%s %i" % (record.id, len(record))) 
 196  gi|167877390|gb|EDS40773.1| 447 
 197  gi|167234445|ref|NP_001107837. 447 
 198  gi|74100009|gb|AAZ99217.1| 447 
 199  gi|13990994|dbj|BAA33523.2| 447 
 200  gi|56122354|gb|AAV74328.1| 447 
 201   
 202   
 203  Output 
 204  ------ 
 205  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 206  SeqRecord objects (either as a list, or an iterator), an output file handle 
 207  (or in recent versions of Biopython an output filename as a string) and of 
 208  course the file format:: 
 209   
 210    from Bio import SeqIO 
 211    records = ... 
 212    SeqIO.write(records, "example.faa", "fasta") 
 213   
 214  Or, using a handle:: 
 215   
 216      from Bio import SeqIO 
 217      records = ... 
 218      with open("example.faa", "w") as handle: 
 219        SeqIO.write(records, handle, "fasta") 
 220   
 221  You are expected to call this function once (with all your records) and if 
 222  using a handle, make sure you close it to flush the data to the hard disk. 
 223   
 224   
 225  Output - Advanced 
 226  ----------------- 
 227  The effect of calling write() multiple times on a single file will vary 
 228  depending on the file format, and is best avoided unless you have a strong 
 229  reason to do so. 
 230   
 231  If you give a filename, then each time you call write() the existing file 
 232  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 233  "record block" holds a single sequence.  For these files it would probably 
 234  be safe to call write() multiple times by re-using the same handle. 
 235   
 236   
 237  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 238  stockholm) would have the effect of concatenating several multiple sequence 
 239  alignments together.  Such files are created by the PHYLIP suite of programs 
 240  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 241   
 242   
 243  Conversion 
 244  ---------- 
 245  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 246  file format conversions. Additionally, it may use file format specific 
 247  optimisations so this should be the fastest way too. 
 248   
 249  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 250  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 251  generator expressions or generator functions provides a memory efficient way 
 252  to perform filtering or other extra operations as part of the process. 
 253   
 254   
 255  File Formats 
 256  ------------ 
 257  When specifying the file format, use lowercase strings.  The same format 
 258  names are also used in Bio.AlignIO and include the following: 
 259   
 260      - abif    - Applied Biosystem's sequencing trace format 
 261      - ace     - Reads the contig sequences from an ACE assembly file. 
 262      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 263      - fasta   - The generic sequence file format where each record starts with 
 264        an identifer line starting with a ">" character, followed by 
 265        lines of sequence. 
 266      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 267        sequence quality values (with an ASCII offset of 33). 
 268      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 269      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 270        encodes Solexa quality scores (not PHRED quality scores) with an 
 271        ASCII offset of 64. 
 272      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 273        which encodes PHRED quality scores with an ASCII offset of 64 
 274        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 275        will produce FASTQ files using the standard Sanger encoding. 
 276      - genbank - The GenBank or GenPept flat file format. 
 277      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 278      - ig      - The IntelliGenetics file format, apparently the same as the 
 279        MASE alignment format. 
 280      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 281        indented to allow for longer feature types. 
 282      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 283        complete protein sequence as it appears in the header (no dependencies). 
 284      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 285        it appears in the structure based on the atom coordinate section of the 
 286        file (requires NumPy for Bio.PDB). 
 287      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 288      - pir     - A "FASTA like" format introduced by the National Biomedical 
 289        Research Foundation (NBRF) for the Protein Information Resource 
 290        (PIR) database, now part of UniProt. 
 291      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 292      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 293      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 294      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 295      - tab     - Simple two column tab separated sequence files, where each 
 296        line holds a record's identifier and sequence. For example, 
 297        this is used as by Aligent's eArray software when saving 
 298        microarray probes in a minimal tab delimited text file. 
 299      - qual    - A "FASTA like" format holding PHRED quality values from 
 300        sequencing DNA, but no actual sequences (usually provided 
 301        in separate FASTA files). 
 302      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 303        text format which we call "swiss") 
 304   
 305  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 306  write to all of them. 
 307   
 308  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 309  "phylip" and "stockholm", which gives you access to the individual sequences 
 310  making up each alignment as SeqRecords. 
 311  """ 
 312   
 313  from __future__ import print_function 
 314  from Bio._py3k import basestring 
 315   
 316  # TODO 
 317  # - define policy on reading aligned sequences with gaps in 
 318  #   (e.g. - and . characters) including how the alphabet interacts 
 319  # 
 320  # - How best to handle unique/non unique record.id when writing. 
 321  #   For most file formats reading such files is fine; The stockholm 
 322  #   parser would fail. 
 323  # 
 324  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 325  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 326  # 
 327  # FAO BioPython Developers 
 328  # ------------------------ 
 329  # The way I envision this SeqIO system working as that for any sequence file 
 330  # format we have an iterator that returns SeqRecord objects. 
 331  # 
 332  # This also applies to interlaced file formats (like clustal - although that 
 333  # is now handled via Bio.AlignIO instead) where the file cannot be read record 
 334  # by record.  You should still return an iterator, even if the implementation 
 335  # could just as easily return a list. 
 336  # 
 337  # These file format specific sequence iterators may be implemented as: 
 338  #    - Classes which take a handle for __init__ and provide the __iter__ method 
 339  #    - Functions that take a handle, and return an iterator object 
 340  #    - Generator functions that take a handle, and yield SeqRecord objects 
 341  # 
 342  # It is then trivial to turn this iterator into a list of SeqRecord objects, 
 343  # an in memory dictionary, or a multiple sequence alignment object. 
 344  # 
 345  # For building the dictionary by default the id property of each SeqRecord is 
 346  # used as the key.  You should always populate the id property, and it should 
 347  # be unique in most cases. For some file formats the accession number is a good 
 348  # choice.  If the file itself contains ambiguous identifiers, don't try and 
 349  # dis-ambiguate them - return them as is. 
 350  # 
 351  # When adding a new file format, please use the same lower case format name 
 352  # as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 353  # 
 354  # See also http://biopython.org/wiki/SeqIO_dev 
 355  # 
 356  # --Peter 
 357   
 358  from Bio.File import as_handle 
 359  from Bio.SeqRecord import SeqRecord 
 360  from Bio.Align import MultipleSeqAlignment 
 361  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 362   
 363  from . import AbiIO 
 364  from . import AceIO 
 365  from . import FastaIO 
 366  from . import IgIO  # IntelliGenetics or MASE format 
 367  from . import InsdcIO  # EMBL and GenBank 
 368  from . import PdbIO 
 369  from . import PhdIO 
 370  from . import PirIO 
 371  from . import SeqXmlIO 
 372  from . import SffIO 
 373  from . import SwissIO 
 374  from . import TabIO 
 375  from . import QualityIO  # FastQ and qual files 
 376  from . import UniprotIO 
 377   
 378   
 379  # Convention for format names is "mainname-subtype" in lower case. 
 380  # Please use the same names as BioPerl or EMBOSS where possible. 
 381  # 
 382  # Note that this simple system copes with defining 
 383  # multiple possible iterators for a given format/extension 
 384  # with the -subtype suffix 
 385  # 
 386  # Most alignment file formats will be handled via Bio.AlignIO 
 387   
 388  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 389                       "gb": InsdcIO.GenBankIterator, 
 390                       "genbank": InsdcIO.GenBankIterator, 
 391                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 392                       "embl": InsdcIO.EmblIterator, 
 393                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 394                       "imgt": InsdcIO.ImgtIterator, 
 395                       "ig": IgIO.IgIterator, 
 396                       "swiss": SwissIO.SwissIterator, 
 397                       "pdb-atom": PdbIO.PdbAtomIterator, 
 398                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 399                       "phd": PhdIO.PhdIterator, 
 400                       "ace": AceIO.AceIterator, 
 401                       "tab": TabIO.TabIterator, 
 402                       "pir": PirIO.PirIterator, 
 403                       "fastq": QualityIO.FastqPhredIterator, 
 404                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 405                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 406                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 407                       "qual": QualityIO.QualPhredIterator, 
 408                       "sff": SffIO.SffIterator, 
 409                       # Not sure about this in the long run: 
 410                       "sff-trim": SffIO._SffTrimIterator, 
 411                       "uniprot-xml": UniprotIO.UniprotIterator, 
 412                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 413                       "abi": AbiIO.AbiIterator, 
 414                       "abi-trim": AbiIO._AbiTrimIterator, 
 415                       } 
 416   
 417  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 418                     "gb": InsdcIO.GenBankWriter, 
 419                     "genbank": InsdcIO.GenBankWriter, 
 420                     "embl": InsdcIO.EmblWriter, 
 421                     "imgt": InsdcIO.ImgtWriter, 
 422                     "tab": TabIO.TabWriter, 
 423                     "fastq": QualityIO.FastqPhredWriter, 
 424                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 425                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 426                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 427                     "phd": PhdIO.PhdWriter, 
 428                     "qual": QualityIO.QualPhredWriter, 
 429                     "sff": SffIO.SffWriter, 
 430                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 431                     } 
 432   
 433  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 434   
 435   
436 -def write(sequences, handle, format):
437 """Write complete set of sequences to a file. 438 439 - sequences - A list (or iterator) of SeqRecord objects, or (if using 440 Biopython 1.54 or later) a single SeqRecord. 441 - handle - File handle object to write to, or filename as string 442 (note older versions of Biopython only took a handle). 443 - format - lower case string describing the file format to write. 444 445 You should close the handle after calling this function. 446 447 Returns the number of records written (as an integer). 448 """ 449 from Bio import AlignIO 450 451 # Try and give helpful error messages: 452 if not isinstance(format, basestring): 453 raise TypeError("Need a string for the file format (lower case)") 454 if not format: 455 raise ValueError("Format required (lower case string)") 456 if format != format.lower(): 457 raise ValueError("Format string '%s' should be lower case" % format) 458 459 if isinstance(handle, SeqRecord): 460 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 461 if isinstance(handle, list): 462 # e.g. list of SeqRecord objects 463 raise TypeError("Check arguments, handle should NOT be a list") 464 465 if isinstance(sequences, SeqRecord): 466 # This raised an exception in order version of Biopython 467 sequences = [sequences] 468 469 if format in _BinaryFormats: 470 mode = 'wb' 471 else: 472 mode = 'w' 473 474 with as_handle(handle, mode) as fp: 475 # Map the file format to a writer class 476 if format in _FormatToWriter: 477 writer_class = _FormatToWriter[format] 478 count = writer_class(fp).write_file(sequences) 479 elif format in AlignIO._FormatToWriter: 480 # Try and turn all the records into a single alignment, 481 # and write that using Bio.AlignIO 482 alignment = MultipleSeqAlignment(sequences) 483 alignment_count = AlignIO.write([alignment], fp, format) 484 assert alignment_count == 1, \ 485 "Internal error - the underlying writer " \ 486 " should have returned 1, not %r" % alignment_count 487 count = len(alignment) 488 del alignment_count, alignment 489 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 490 raise ValueError("Reading format '%s' is supported, but not writing" 491 % format) 492 else: 493 raise ValueError("Unknown format '%s'" % format) 494 495 assert isinstance(count, int), "Internal error - the underlying %s " \ 496 "writer should have returned the record count, not %r" \ 497 % (format, count) 498 499 return count
500 501
502 -def parse(handle, format, alphabet=None):
503 r"""Turns a sequence file into an iterator returning SeqRecords. 504 505 - handle - handle to the file, or the filename as a string 506 (note older versions of Biopython only took a handle). 507 - format - lower case string describing the file format. 508 - alphabet - optional Alphabet object, useful when the sequence type 509 cannot be automatically inferred from the file itself 510 (e.g. format="fasta" or "tab") 511 512 Typical usage, opening a file to read in, and looping over the record(s): 513 514 >>> from Bio import SeqIO 515 >>> filename = "Fasta/sweetpea.nu" 516 >>> for record in SeqIO.parse(filename, "fasta"): 517 ... print("ID %s" % record.id) 518 ... print("Sequence length %i" % len(record)) 519 ... print("Sequence alphabet %s" % record.seq.alphabet) 520 ID gi|3176602|gb|U78617.1|LOU78617 521 Sequence length 309 522 Sequence alphabet SingleLetterAlphabet() 523 524 For file formats like FASTA where the alphabet cannot be determined, it 525 may be useful to specify the alphabet explicitly: 526 527 >>> from Bio import SeqIO 528 >>> from Bio.Alphabet import generic_dna 529 >>> filename = "Fasta/sweetpea.nu" 530 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 531 ... print("ID %s" % record.id) 532 ... print("Sequence length %i" % len(record)) 533 ... print("Sequence alphabet %s" % record.seq.alphabet) 534 ID gi|3176602|gb|U78617.1|LOU78617 535 Sequence length 309 536 Sequence alphabet DNAAlphabet() 537 538 If you have a string 'data' containing the file contents, you must 539 first turn this into a handle in order to parse it: 540 541 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 542 >>> from Bio import SeqIO 543 >>> try: 544 ... from StringIO import StringIO # Python 2 545 ... except ImportError: 546 ... from io import StringIO # Python 3 547 ... 548 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 549 ... print("%s %s" % (record.id, record.seq)) 550 Alpha ACCGGATGTA 551 Beta AGGCTCGGTTA 552 553 Use the Bio.SeqIO.read(...) function when you expect a single record 554 only. 555 """ 556 # NOTE - The above docstring has some raw \n characters needed 557 # for the StringIO example, hence the whole docstring is in raw 558 # string mode (see the leading r before the opening quote). 559 from Bio import AlignIO 560 561 # Hack for SFF, will need to make this more general in future 562 if format in _BinaryFormats: 563 mode = 'rb' 564 else: 565 mode = 'rU' 566 567 # Try and give helpful error messages: 568 if not isinstance(format, basestring): 569 raise TypeError("Need a string for the file format (lower case)") 570 if not format: 571 raise ValueError("Format required (lower case string)") 572 if format != format.lower(): 573 raise ValueError("Format string '%s' should be lower case" % format) 574 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 575 isinstance(alphabet, AlphabetEncoder)): 576 raise ValueError("Invalid alphabet, %r" % alphabet) 577 578 with as_handle(handle, mode) as fp: 579 # Map the file format to a sequence iterator: 580 if format in _FormatToIterator: 581 iterator_generator = _FormatToIterator[format] 582 if alphabet is None: 583 i = iterator_generator(fp) 584 else: 585 try: 586 i = iterator_generator(fp, alphabet=alphabet) 587 except TypeError: 588 i = _force_alphabet(iterator_generator(fp), alphabet) 589 elif format in AlignIO._FormatToIterator: 590 # Use Bio.AlignIO to read in the alignments 591 i = (r for alignment in AlignIO.parse(fp, format, 592 alphabet=alphabet) 593 for r in alignment) 594 else: 595 raise ValueError("Unknown format '%s'" % format) 596 # This imposes some overhead... wait until we drop Python 2.4 to fix it 597 for r in i: 598 yield r
599 600
601 -def _force_alphabet(record_iterator, alphabet):
602 """Iterate over records, over-riding the alphabet (PRIVATE).""" 603 # Assume the alphabet argument has been pre-validated 604 given_base_class = _get_base_alphabet(alphabet).__class__ 605 for record in record_iterator: 606 if isinstance(_get_base_alphabet(record.seq.alphabet), 607 given_base_class): 608 record.seq.alphabet = alphabet 609 yield record 610 else: 611 raise ValueError("Specified alphabet %r clashes with " 612 "that determined from the file, %r" 613 % (alphabet, record.seq.alphabet))
614 615
616 -def read(handle, format, alphabet=None):
617 """Turns a sequence file into a single SeqRecord. 618 619 - handle - handle to the file, or the filename as a string 620 (note older versions of Biopython only took a handle). 621 - format - string describing the file format. 622 - alphabet - optional Alphabet object, useful when the sequence type 623 cannot be automatically inferred from the file itself 624 (e.g. format="fasta" or "tab") 625 626 This function is for use parsing sequence files containing 627 exactly one record. For example, reading a GenBank file: 628 629 >>> from Bio import SeqIO 630 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 631 >>> print("ID %s" % record.id) 632 ID AC007323.5 633 >>> print("Sequence length %i" % len(record)) 634 Sequence length 86436 635 >>> print("Sequence alphabet %s" % record.seq.alphabet) 636 Sequence alphabet IUPACAmbiguousDNA() 637 638 If the handle contains no records, or more than one record, 639 an exception is raised. For example: 640 641 >>> from Bio import SeqIO 642 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 643 Traceback (most recent call last): 644 ... 645 ValueError: More than one record found in handle 646 647 If however you want the first record from a file containing 648 multiple records this function would raise an exception (as 649 shown in the example above). Instead use: 650 651 >>> from Bio import SeqIO 652 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 653 >>> print("First record's ID %s" % record.id) 654 First record's ID X55053.1 655 656 Use the Bio.SeqIO.parse(handle, format) function if you want 657 to read multiple records from the handle. 658 """ 659 iterator = parse(handle, format, alphabet) 660 try: 661 first = next(iterator) 662 except StopIteration: 663 first = None 664 if first is None: 665 raise ValueError("No records found in handle") 666 try: 667 second = next(iterator) 668 except StopIteration: 669 second = None 670 if second is not None: 671 raise ValueError("More than one record found in handle") 672 return first
673 674
675 -def to_dict(sequences, key_function=None):
676 """Turns a sequence iterator or list into a dictionary. 677 678 - sequences - An iterator that returns SeqRecord objects, 679 or simply a list of SeqRecord objects. 680 - key_function - Optional callback function which when given a 681 SeqRecord should return a unique key for the dictionary. 682 683 e.g. key_function = lambda rec : rec.name 684 or, key_function = lambda rec : rec.description.split()[0] 685 686 If key_function is omitted then record.id is used, on the assumption 687 that the records objects returned are SeqRecords with a unique id. 688 689 If there are duplicate keys, an error is raised. 690 691 Example usage, defaulting to using the record.id as key: 692 693 >>> from Bio import SeqIO 694 >>> filename = "GenBank/cor6_6.gb" 695 >>> format = "genbank" 696 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 697 >>> print(sorted(id_dict)) 698 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 699 >>> print(id_dict["L31939.1"].description) 700 Brassica rapa (clone bif72) kin mRNA, complete cds. 701 702 A more complex example, using the key_function argument in order to 703 use a sequence checksum as the dictionary key: 704 705 >>> from Bio import SeqIO 706 >>> from Bio.SeqUtils.CheckSum import seguid 707 >>> filename = "GenBank/cor6_6.gb" 708 >>> format = "genbank" 709 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 710 ... key_function = lambda rec : seguid(rec.seq)) 711 >>> for key, record in sorted(seguid_dict.items()): 712 ... print("%s %s" % (key, record.id)) 713 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 714 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 715 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 716 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 717 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 718 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 719 720 This approach is not suitable for very large sets of sequences, as all 721 the SeqRecord objects are held in memory. Instead, consider using the 722 Bio.SeqIO.index() function (if it supports your particular file format). 723 """ 724 if key_function is None: 725 key_function = lambda rec: rec.id 726 727 d = dict() 728 for record in sequences: 729 key = key_function(record) 730 if key in d: 731 raise ValueError("Duplicate key '%s'" % key) 732 d[key] = record 733 return d
734 735
736 -def index(filename, format, alphabet=None, key_function=None):
737 """Indexes a sequence file and returns a dictionary like object. 738 739 - filename - string giving name of file to be indexed 740 - format - lower case string describing the file format 741 - alphabet - optional Alphabet object, useful when the sequence type 742 cannot be automatically inferred from the file itself 743 (e.g. format="fasta" or "tab") 744 - key_function - Optional callback function which when given a 745 SeqRecord identifier string should return a unique 746 key for the dictionary. 747 748 This indexing function will return a dictionary like object, giving the 749 SeqRecord objects as values: 750 751 >>> from Bio import SeqIO 752 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 753 >>> len(records) 754 3 755 >>> sorted(records) 756 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 757 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 758 >EAS54_6_R1_2_1_540_792 759 TTGGCAGGCCAAGGCCGATGGATCA 760 <BLANKLINE> 761 >>> "EAS54_6_R1_2_1_540_792" in records 762 True 763 >>> print(records.get("Missing", None)) 764 None 765 >>> records.close() 766 767 If the file is BGZF compressed, this is detected automatically. Ordinary 768 GZIP files are not supported: 769 770 >>> from Bio import SeqIO 771 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 772 >>> len(records) 773 3 774 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 775 TTGGCAGGCCAAGGCCGATGGATCA 776 >>> records.close() 777 778 Note that this pseudo dictionary will not support all the methods of a 779 true Python dictionary, for example values() is not defined since this 780 would require loading all of the records into memory at once. 781 782 When you call the index function, it will scan through the file, noting 783 the location of each record. When you access a particular record via the 784 dictionary methods, the code will jump to the appropriate part of the 785 file and then parse that section into a SeqRecord. 786 787 Note that not all the input formats supported by Bio.SeqIO can be used 788 with this index function. It is designed to work only with sequential 789 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 790 interlaced file format (e.g. alignment formats such as "clustal"). 791 792 For small files, it may be more efficient to use an in memory Python 793 dictionary, e.g. 794 795 >>> from Bio import SeqIO 796 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 797 >>> len(records) 798 3 799 >>> sorted(records) 800 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 801 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 802 >EAS54_6_R1_2_1_540_792 803 TTGGCAGGCCAAGGCCGATGGATCA 804 <BLANKLINE> 805 806 As with the to_dict() function, by default the id string of each record 807 is used as the key. You can specify a callback function to transform 808 this (the record identifier string) into your preferred key. For example: 809 810 >>> from Bio import SeqIO 811 >>> def make_tuple(identifier): 812 ... parts = identifier.split("_") 813 ... return int(parts[-2]), int(parts[-1]) 814 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 815 ... key_function=make_tuple) 816 >>> len(records) 817 3 818 >>> sorted(records) 819 [(413, 324), (443, 348), (540, 792)] 820 >>> print(records[(540, 792)].format("fasta")) 821 >EAS54_6_R1_2_1_540_792 822 TTGGCAGGCCAAGGCCGATGGATCA 823 <BLANKLINE> 824 >>> (540, 792) in records 825 True 826 >>> "EAS54_6_R1_2_1_540_792" in records 827 False 828 >>> print(records.get("Missing", None)) 829 None 830 >>> records.close() 831 832 Another common use case would be indexing an NCBI style FASTA file, 833 where you might want to extract the GI number from the FASTA identifier 834 to use as the dictionary key. 835 836 Notice that unlike the to_dict() function, here the key_function does 837 not get given the full SeqRecord to use to generate the key. Doing so 838 would impose a severe performance penalty as it would require the file 839 to be completely parsed while building the index. Right now this is 840 usually avoided. 841 842 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 843 """ 844 # Try and give helpful error messages: 845 if not isinstance(filename, basestring): 846 raise TypeError("Need a filename (not a handle)") 847 if not isinstance(format, basestring): 848 raise TypeError("Need a string for the file format (lower case)") 849 if not format: 850 raise ValueError("Format required (lower case string)") 851 if format != format.lower(): 852 raise ValueError("Format string '%s' should be lower case" % format) 853 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 854 isinstance(alphabet, AlphabetEncoder)): 855 raise ValueError("Invalid alphabet, %r" % alphabet) 856 857 # Map the file format to a sequence iterator: 858 from ._index import _FormatToRandomAccess # Lazy import 859 from Bio.File import _IndexedSeqFileDict 860 try: 861 proxy_class = _FormatToRandomAccess[format] 862 except KeyError: 863 raise ValueError("Unsupported format %r" % format) 864 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 865 % (filename, format, alphabet, key_function) 866 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 867 key_function, repr, "SeqRecord")
868 869
870 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 871 key_function=None):
872 """Index several sequence files and return a dictionary like object. 873 874 The index is stored in an SQLite database rather than in memory (as in the 875 Bio.SeqIO.index(...) function). 876 877 - index_filename - Where to store the SQLite index 878 - filenames - list of strings specifying file(s) to be indexed, or when 879 indexing a single file this can be given as a string. 880 (optional if reloading an existing index, but must match) 881 - format - lower case string describing the file format 882 (optional if reloading an existing index, but must match) 883 - alphabet - optional Alphabet object, useful when the sequence type 884 cannot be automatically inferred from the file itself 885 (e.g. format="fasta" or "tab") 886 - key_function - Optional callback function which when given a 887 SeqRecord identifier string should return a unique 888 key for the dictionary. 889 890 This indexing function will return a dictionary like object, giving the 891 SeqRecord objects as values: 892 893 >>> from Bio.Alphabet import generic_protein 894 >>> from Bio import SeqIO 895 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 896 >>> def get_gi(name): 897 ... parts = name.split("|") 898 ... i = parts.index("gi") 899 ... assert i != -1 900 ... return parts[i+1] 901 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 902 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 903 >>> len(records) 904 95 905 >>> records["7525076"].description 906 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 907 >>> records["45478717"].description 908 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 909 >>> records.close() 910 911 In this example the two files contain 85 and 10 records respectively. 912 913 BGZF compressed files are supported, and detected automatically. Ordinary 914 GZIP compressed files are not supported. 915 916 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 917 glob which is useful for building lists of files. 918 """ 919 # Try and give helpful error messages: 920 if not isinstance(index_filename, basestring): 921 raise TypeError("Need a string for the index filename") 922 if isinstance(filenames, basestring): 923 # Make the API a little more friendly, and more similar 924 # to Bio.SeqIO.index(...) for indexing just one file. 925 filenames = [filenames] 926 if filenames is not None and not isinstance(filenames, list): 927 raise TypeError( 928 "Need a list of filenames (as strings), or one filename") 929 if format is not None and not isinstance(format, basestring): 930 raise TypeError("Need a string for the file format (lower case)") 931 if format and format != format.lower(): 932 raise ValueError("Format string '%s' should be lower case" % format) 933 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 934 isinstance(alphabet, AlphabetEncoder)): 935 raise ValueError("Invalid alphabet, %r" % alphabet) 936 937 # Map the file format to a sequence iterator: 938 from ._index import _FormatToRandomAccess # Lazy import 939 from Bio.File import _SQLiteManySeqFilesDict 940 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 941 % (index_filename, filenames, format, alphabet, key_function) 942 943 def proxy_factory(format, filename=None): 944 """Given a filename returns proxy object, else boolean if format OK.""" 945 if filename: 946 return _FormatToRandomAccess[format](filename, format, alphabet) 947 else: 948 return format in _FormatToRandomAccess
949 950 return _SQLiteManySeqFilesDict(index_filename, filenames, 951 proxy_factory, format, 952 key_function, repr) 953 954
955 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
956 """Convert between two sequence file formats, return number of records. 957 958 - in_file - an input handle or filename 959 - in_format - input file format, lower case string 960 - out_file - an output handle or filename 961 - out_format - output file format, lower case string 962 - alphabet - optional alphabet to assume 963 964 **NOTE** - If you provide an output filename, it will be opened which will 965 overwrite any existing file without warning. This may happen if even 966 the conversion is aborted (e.g. an invalid out_format name is given). 967 968 For example, going from a filename to a handle: 969 970 >>> from Bio import SeqIO 971 >>> try: 972 ... from StringIO import StringIO # Python 2 973 ... except ImportError: 974 ... from io import StringIO # Python 3 975 ... 976 >>> handle = StringIO("") 977 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 978 3 979 >>> print(handle.getvalue()) 980 >EAS54_6_R1_2_1_413_324 981 CCCTTCTTGTCTTCAGCGTTTCTCC 982 >EAS54_6_R1_2_1_540_792 983 TTGGCAGGCCAAGGCCGATGGATCA 984 >EAS54_6_R1_2_1_443_348 985 GTTGCTTCTGGCGTGGGTGGGGGGG 986 <BLANKLINE> 987 """ 988 # Hack for SFF, will need to make this more general in future 989 if in_format in _BinaryFormats: 990 in_mode = 'rb' 991 else: 992 in_mode = 'rU' 993 994 # Don't open the output file until we've checked the input is OK? 995 if out_format in ["sff", "sff_trim"]: 996 out_mode = 'wb' 997 else: 998 out_mode = 'w' 999 1000 # This will check the arguments and issue error messages, 1001 # after we have opened the file which is a shame. 1002 from ._convert import _handle_convert # Lazy import 1003 with as_handle(in_file, in_mode) as in_handle: 1004 with as_handle(out_file, out_mode) as out_handle: 1005 count = _handle_convert(in_handle, in_format, 1006 out_handle, out_format, 1007 alphabet) 1008 return count
1009 1010 1011 # This helpful trick for testing no longer works with the 1012 # local imports :( 1013 # 1014 # if __name__ == "__main__": 1015 # from Bio._utils import run_doctest 1016 # run_doctest() 1017