Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2016 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by 
  12  a whole chapter in our tutorial: 
  13   
  14    - `HTML Tutorial`_ 
  15    - `PDF Tutorial`_ 
  16   
  17  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  18  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  19  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  20   
  21  Input 
  22  ----- 
  23  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  24  (or in recent versions of Biopython alternatively a filename as a string), 
  25  and format string.  This returns an iterator giving SeqRecord objects: 
  26   
  27  >>> from Bio import SeqIO 
  28  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  29  ...     print("%s %i" % (record.id, len(record))) 
  30  gi|1348912|gb|G26680|G26680 633 
  31  gi|1348917|gb|G26685|G26685 413 
  32  gi|1592936|gb|G29385|G29385 471 
  33   
  34  Note that the parse() function will invoke the relevant parser for the 
  35  format with its default settings.  You may want more control, in which case 
  36  you need to create a format specific sequence iterator directly. 
  37   
  38  Input - Single Records 
  39  ---------------------- 
  40  If you expect your file to contain one-and-only-one record, then we provide 
  41  the following 'helper' function which will return a single SeqRecord, or 
  42  raise an exception if there are no records or more than one record: 
  43   
  44  >>> from Bio import SeqIO 
  45  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  46  >>> print("%s %i" % (record.id, len(record))) 
  47  gi|3318709|pdb|1A91| 79 
  48   
  49  This style is useful when you expect a single record only (and would 
  50  consider multiple records an error).  For example, when dealing with GenBank 
  51  files for bacterial genomes or chromosomes, there is normally only a single 
  52  record.  Alternatively, use this with a handle when downloading a single 
  53  record from the internet. 
  54   
  55  However, if you just want the first record from a file containing multiple 
  56  record, use the next() function on the iterator (or under Python 2, the 
  57  iterator's next() method): 
  58   
  59  >>> from Bio import SeqIO 
  60  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  61  >>> print("%s %i" % (record.id, len(record))) 
  62  gi|1348912|gb|G26680|G26680 633 
  63   
  64  The above code will work as long as the file contains at least one record. 
  65  Note that if there is more than one record, the remaining records will be 
  66  silently ignored. 
  67   
  68   
  69  Input - Multiple Records 
  70  ------------------------ 
  71  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  72  using a sequence iterator can save you a lot of memory (RAM).  There is 
  73  less benefit for interlaced file formats (e.g. most multiple alignment file 
  74  formats).  However, an iterator only lets you access the records one by one. 
  75   
  76  If you want random access to the records by number, turn this into a list: 
  77   
  78  >>> from Bio import SeqIO 
  79  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  80  >>> len(records) 
  81  3 
  82  >>> print(records[1].id) 
  83  gi|1348917|gb|G26685|G26685 
  84   
  85  If you want random access to the records by a key such as the record id, 
  86  turn the iterator into a dictionary: 
  87   
  88  >>> from Bio import SeqIO 
  89  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  90  >>> len(record_dict) 
  91  3 
  92  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
  93  413 
  94   
  95  However, using list() or the to_dict() function will load all the records 
  96  into memory at once, and therefore is not possible on very large files. 
  97  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
  98  providing dictionary like access to any record. For example, 
  99   
 100  >>> from Bio import SeqIO 
 101  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 102  >>> len(record_dict) 
 103  3 
 104  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 105  413 
 106  >>> record_dict.close() 
 107   
 108  Many but not all of the supported input file formats can be indexed like 
 109  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 110  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 111   
 112  In most cases you can also use SeqIO.index to get the record from the file 
 113  as a raw string (not a SeqRecord). This can be useful for example to extract 
 114  a sub-set of records from a file where SeqIO cannot output the file format 
 115  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 116  keep the output 100% identical to the input). For example, 
 117   
 118  >>> from Bio import SeqIO 
 119  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 120  >>> len(record_dict) 
 121  3 
 122  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 123  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 124  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 125  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 126  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 127  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 128  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 129  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 130  <BLANKLINE> 
 131  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 134  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 135  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 136  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 137  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 138  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 139  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 140  <BLANKLINE> 
 141  >>> record_dict.close() 
 142   
 143  Here the original file and what Biopython would output differ in the line 
 144  wrapping. Also note that under Python 3, the get_raw method will return a 
 145  bytes string, hence the use of decode to turn it into a (unicode) string. 
 146  This is unnecessary on Python 2. 
 147   
 148  Also note that the get_raw method will preserve the newline endings. This 
 149  example FASTQ file uses Unix style endings (b"\n" only), 
 150   
 151  >>> from Bio import SeqIO 
 152  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 153  >>> len(fastq_dict) 
 154  3 
 155  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 156  >>> raw.count(b"\n") 
 157  4 
 158  >>> raw.count(b"\r\n") 
 159  0 
 160  >>> b"\r" in raw 
 161  False 
 162  >>> len(raw) 
 163  78 
 164  >>> fastq_dict.close() 
 165   
 166  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 167   
 168  >>> from Bio import SeqIO 
 169  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 170  >>> len(fastq_dict) 
 171  3 
 172  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 173  >>> raw.count(b"\n") 
 174  4 
 175  >>> raw.count(b"\r\n") 
 176  4 
 177  >>> b"\r\n" in raw 
 178  True 
 179  >>> len(raw) 
 180  82 
 181  >>> fastq_dict.close() 
 182   
 183  Because this uses two bytes for each new line, the file is longer than 
 184  the Unix equivalent with only one byte. 
 185   
 186   
 187  Input - Alignments 
 188  ------------------ 
 189  You can read in alignment files as alignment objects using Bio.AlignIO. 
 190  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 191  you a SeqRecord for each row of each alignment: 
 192   
 193  >>> from Bio import SeqIO 
 194  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 195  ...     print("%s %i" % (record.id, len(record))) 
 196  gi|167877390|gb|EDS40773.1| 447 
 197  gi|167234445|ref|NP_001107837. 447 
 198  gi|74100009|gb|AAZ99217.1| 447 
 199  gi|13990994|dbj|BAA33523.2| 447 
 200  gi|56122354|gb|AAV74328.1| 447 
 201   
 202   
 203  Output 
 204  ------ 
 205  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 206  SeqRecord objects (either as a list, or an iterator), an output file handle 
 207  (or in recent versions of Biopython an output filename as a string) and of 
 208  course the file format:: 
 209   
 210    from Bio import SeqIO 
 211    records = ... 
 212    SeqIO.write(records, "example.faa", "fasta") 
 213   
 214  Or, using a handle:: 
 215   
 216      from Bio import SeqIO 
 217      records = ... 
 218      with open("example.faa", "w") as handle: 
 219        SeqIO.write(records, handle, "fasta") 
 220   
 221  You are expected to call this function once (with all your records) and if 
 222  using a handle, make sure you close it to flush the data to the hard disk. 
 223   
 224   
 225  Output - Advanced 
 226  ----------------- 
 227  The effect of calling write() multiple times on a single file will vary 
 228  depending on the file format, and is best avoided unless you have a strong 
 229  reason to do so. 
 230   
 231  If you give a filename, then each time you call write() the existing file 
 232  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 233  "record block" holds a single sequence.  For these files it would probably 
 234  be safe to call write() multiple times by re-using the same handle. 
 235   
 236  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 237  stockholm) would have the effect of concatenating several multiple sequence 
 238  alignments together.  Such files are created by the PHYLIP suite of programs 
 239  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 240   
 241  Worse, many fileformats have an explicit header and/or footer structure 
 242  (e.g. any XMl format, and most binary file formats like SFF). Here making 
 243  multiple calls to write() will result in an invalid file. 
 244   
 245   
 246  Conversion 
 247  ---------- 
 248  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 249  file format conversions. Additionally, it may use file format specific 
 250  optimisations so this should be the fastest way too. 
 251   
 252  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 253  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 254  generator expressions or generator functions provides a memory efficient way 
 255  to perform filtering or other extra operations as part of the process. 
 256   
 257   
 258  File Formats 
 259  ------------ 
 260  When specifying the file format, use lowercase strings.  The same format 
 261  names are also used in Bio.AlignIO and include the following: 
 262   
 263      - abif    - Applied Biosystem's sequencing trace format 
 264      - ace     - Reads the contig sequences from an ACE assembly file. 
 265      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 266      - fasta   - The generic sequence file format where each record starts with 
 267        an identifer line starting with a ">" character, followed by 
 268        lines of sequence. 
 269      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 270        sequence quality values (with an ASCII offset of 33). 
 271      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 272      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 273        encodes Solexa quality scores (not PHRED quality scores) with an 
 274        ASCII offset of 64. 
 275      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 276        which encodes PHRED quality scores with an ASCII offset of 64 
 277        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 278        will produce FASTQ files using the standard Sanger encoding. 
 279      - genbank - The GenBank or GenPept flat file format. 
 280      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 281      - ig      - The IntelliGenetics file format, apparently the same as the 
 282        MASE alignment format. 
 283      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 284        indented to allow for longer feature types. 
 285      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 286        complete protein sequence as it appears in the header (no dependencies). 
 287      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 288        it appears in the structure based on the atom coordinate section of the 
 289        file (requires NumPy for Bio.PDB). 
 290      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 291      - pir     - A "FASTA like" format introduced by the National Biomedical 
 292        Research Foundation (NBRF) for the Protein Information Resource 
 293        (PIR) database, now part of UniProt. 
 294      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 295      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 296      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 297      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 298      - tab     - Simple two column tab separated sequence files, where each 
 299        line holds a record's identifier and sequence. For example, 
 300        this is used as by Aligent's eArray software when saving 
 301        microarray probes in a minimal tab delimited text file. 
 302      - qual    - A "FASTA like" format holding PHRED quality values from 
 303        sequencing DNA, but no actual sequences (usually provided 
 304        in separate FASTA files). 
 305      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 306        text format which we call "swiss") 
 307   
 308  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 309  write to all of them. 
 310   
 311  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 312  "phylip" and "stockholm", which gives you access to the individual sequences 
 313  making up each alignment as SeqRecords. 
 314  """ 
 315   
 316  from __future__ import print_function 
 317  from Bio._py3k import basestring 
 318   
 319  # TODO 
 320  # - define policy on reading aligned sequences with gaps in 
 321  #   (e.g. - and . characters) including how the alphabet interacts 
 322  # 
 323  # - How best to handle unique/non unique record.id when writing. 
 324  #   For most file formats reading such files is fine; The stockholm 
 325  #   parser would fail. 
 326  # 
 327  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 328  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 329  # 
 330  # FAO BioPython Developers 
 331  # ------------------------ 
 332  # The way I envision this SeqIO system working as that for any sequence file 
 333  # format we have an iterator that returns SeqRecord objects. 
 334  # 
 335  # This also applies to interlaced file formats (like clustal - although that 
 336  # is now handled via Bio.AlignIO instead) where the file cannot be read record 
 337  # by record.  You should still return an iterator, even if the implementation 
 338  # could just as easily return a list. 
 339  # 
 340  # These file format specific sequence iterators may be implemented as: 
 341  #    - Classes which take a handle for __init__ and provide the __iter__ method 
 342  #    - Functions that take a handle, and return an iterator object 
 343  #    - Generator functions that take a handle, and yield SeqRecord objects 
 344  # 
 345  # It is then trivial to turn this iterator into a list of SeqRecord objects, 
 346  # an in memory dictionary, or a multiple sequence alignment object. 
 347  # 
 348  # For building the dictionary by default the id property of each SeqRecord is 
 349  # used as the key.  You should always populate the id property, and it should 
 350  # be unique in most cases. For some file formats the accession number is a good 
 351  # choice.  If the file itself contains ambiguous identifiers, don't try and 
 352  # dis-ambiguate them - return them as is. 
 353  # 
 354  # When adding a new file format, please use the same lower case format name 
 355  # as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 356  # 
 357  # See also http://biopython.org/wiki/SeqIO_dev 
 358  # 
 359  # --Peter 
 360   
 361  from Bio.File import as_handle 
 362  from Bio.SeqRecord import SeqRecord 
 363  from Bio.Align import MultipleSeqAlignment 
 364  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 365   
 366  from . import AbiIO 
 367  from . import AceIO 
 368  from . import FastaIO 
 369  from . import IgIO  # IntelliGenetics or MASE format 
 370  from . import InsdcIO  # EMBL and GenBank 
 371  from . import PdbIO 
 372  from . import PhdIO 
 373  from . import PirIO 
 374  from . import SeqXmlIO 
 375  from . import SffIO 
 376  from . import SwissIO 
 377  from . import TabIO 
 378  from . import QualityIO  # FastQ and qual files 
 379  from . import UniprotIO 
 380   
 381   
 382  # Convention for format names is "mainname-subtype" in lower case. 
 383  # Please use the same names as BioPerl or EMBOSS where possible. 
 384  # 
 385  # Note that this simple system copes with defining 
 386  # multiple possible iterators for a given format/extension 
 387  # with the -subtype suffix 
 388  # 
 389  # Most alignment file formats will be handled via Bio.AlignIO 
 390   
 391  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 392                       "gb": InsdcIO.GenBankIterator, 
 393                       "genbank": InsdcIO.GenBankIterator, 
 394                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 395                       "embl": InsdcIO.EmblIterator, 
 396                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 397                       "imgt": InsdcIO.ImgtIterator, 
 398                       "ig": IgIO.IgIterator, 
 399                       "swiss": SwissIO.SwissIterator, 
 400                       "pdb-atom": PdbIO.PdbAtomIterator, 
 401                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 402                       "phd": PhdIO.PhdIterator, 
 403                       "ace": AceIO.AceIterator, 
 404                       "tab": TabIO.TabIterator, 
 405                       "pir": PirIO.PirIterator, 
 406                       "fastq": QualityIO.FastqPhredIterator, 
 407                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 408                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 409                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 410                       "qual": QualityIO.QualPhredIterator, 
 411                       "sff": SffIO.SffIterator, 
 412                       # Not sure about this in the long run: 
 413                       "sff-trim": SffIO._SffTrimIterator, 
 414                       "uniprot-xml": UniprotIO.UniprotIterator, 
 415                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 416                       "abi": AbiIO.AbiIterator, 
 417                       "abi-trim": AbiIO._AbiTrimIterator, 
 418                       } 
 419   
 420  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 421                     "gb": InsdcIO.GenBankWriter, 
 422                     "genbank": InsdcIO.GenBankWriter, 
 423                     "embl": InsdcIO.EmblWriter, 
 424                     "imgt": InsdcIO.ImgtWriter, 
 425                     "tab": TabIO.TabWriter, 
 426                     "fastq": QualityIO.FastqPhredWriter, 
 427                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 428                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 429                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 430                     "phd": PhdIO.PhdWriter, 
 431                     "qual": QualityIO.QualPhredWriter, 
 432                     "sff": SffIO.SffWriter, 
 433                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 434                     } 
 435   
 436  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 437   
 438   
439 -def write(sequences, handle, format):
440 """Write complete set of sequences to a file. 441 442 - sequences - A list (or iterator) of SeqRecord objects, or (if using 443 Biopython 1.54 or later) a single SeqRecord. 444 - handle - File handle object to write to, or filename as string 445 (note older versions of Biopython only took a handle). 446 - format - lower case string describing the file format to write. 447 448 You should close the handle after calling this function. 449 450 Returns the number of records written (as an integer). 451 """ 452 from Bio import AlignIO 453 454 # Try and give helpful error messages: 455 if not isinstance(format, basestring): 456 raise TypeError("Need a string for the file format (lower case)") 457 if not format: 458 raise ValueError("Format required (lower case string)") 459 if format != format.lower(): 460 raise ValueError("Format string '%s' should be lower case" % format) 461 462 if isinstance(handle, SeqRecord): 463 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 464 if isinstance(handle, list): 465 # e.g. list of SeqRecord objects 466 raise TypeError("Check arguments, handle should NOT be a list") 467 468 if isinstance(sequences, SeqRecord): 469 # This raised an exception in older versions of Biopython 470 sequences = [sequences] 471 472 if format in _BinaryFormats: 473 mode = 'wb' 474 else: 475 mode = 'w' 476 477 with as_handle(handle, mode) as fp: 478 # Map the file format to a writer class 479 if format in _FormatToWriter: 480 writer_class = _FormatToWriter[format] 481 count = writer_class(fp).write_file(sequences) 482 elif format in AlignIO._FormatToWriter: 483 # Try and turn all the records into a single alignment, 484 # and write that using Bio.AlignIO 485 alignment = MultipleSeqAlignment(sequences) 486 alignment_count = AlignIO.write([alignment], fp, format) 487 assert alignment_count == 1, \ 488 "Internal error - the underlying writer " \ 489 " should have returned 1, not %r" % alignment_count 490 count = len(alignment) 491 del alignment_count, alignment 492 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 493 raise ValueError("Reading format '%s' is supported, but not writing" 494 % format) 495 else: 496 raise ValueError("Unknown format '%s'" % format) 497 498 assert isinstance(count, int), "Internal error - the underlying %s " \ 499 "writer should have returned the record count, not %r" \ 500 % (format, count) 501 502 return count
503 504
505 -def parse(handle, format, alphabet=None):
506 r"""Turns a sequence file into an iterator returning SeqRecords. 507 508 - handle - handle to the file, or the filename as a string 509 (note older versions of Biopython only took a handle). 510 - format - lower case string describing the file format. 511 - alphabet - optional Alphabet object, useful when the sequence type 512 cannot be automatically inferred from the file itself 513 (e.g. format="fasta" or "tab") 514 515 Typical usage, opening a file to read in, and looping over the record(s): 516 517 >>> from Bio import SeqIO 518 >>> filename = "Fasta/sweetpea.nu" 519 >>> for record in SeqIO.parse(filename, "fasta"): 520 ... print("ID %s" % record.id) 521 ... print("Sequence length %i" % len(record)) 522 ... print("Sequence alphabet %s" % record.seq.alphabet) 523 ID gi|3176602|gb|U78617.1|LOU78617 524 Sequence length 309 525 Sequence alphabet SingleLetterAlphabet() 526 527 For file formats like FASTA where the alphabet cannot be determined, it 528 may be useful to specify the alphabet explicitly: 529 530 >>> from Bio import SeqIO 531 >>> from Bio.Alphabet import generic_dna 532 >>> filename = "Fasta/sweetpea.nu" 533 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 534 ... print("ID %s" % record.id) 535 ... print("Sequence length %i" % len(record)) 536 ... print("Sequence alphabet %s" % record.seq.alphabet) 537 ID gi|3176602|gb|U78617.1|LOU78617 538 Sequence length 309 539 Sequence alphabet DNAAlphabet() 540 541 If you have a string 'data' containing the file contents, you must 542 first turn this into a handle in order to parse it: 543 544 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 545 >>> from Bio import SeqIO 546 >>> try: 547 ... from StringIO import StringIO # Python 2 548 ... except ImportError: 549 ... from io import StringIO # Python 3 550 ... 551 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 552 ... print("%s %s" % (record.id, record.seq)) 553 Alpha ACCGGATGTA 554 Beta AGGCTCGGTTA 555 556 Use the Bio.SeqIO.read(...) function when you expect a single record 557 only. 558 """ 559 # NOTE - The above docstring has some raw \n characters needed 560 # for the StringIO example, hence the whole docstring is in raw 561 # string mode (see the leading r before the opening quote). 562 from Bio import AlignIO 563 564 # Hack for SFF, will need to make this more general in future 565 if format in _BinaryFormats: 566 mode = 'rb' 567 else: 568 mode = 'rU' 569 570 # Try and give helpful error messages: 571 if not isinstance(format, basestring): 572 raise TypeError("Need a string for the file format (lower case)") 573 if not format: 574 raise ValueError("Format required (lower case string)") 575 if format != format.lower(): 576 raise ValueError("Format string '%s' should be lower case" % format) 577 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 578 isinstance(alphabet, AlphabetEncoder)): 579 raise ValueError("Invalid alphabet, %r" % alphabet) 580 581 with as_handle(handle, mode) as fp: 582 # Map the file format to a sequence iterator: 583 if format in _FormatToIterator: 584 iterator_generator = _FormatToIterator[format] 585 if alphabet is None: 586 i = iterator_generator(fp) 587 else: 588 try: 589 i = iterator_generator(fp, alphabet=alphabet) 590 except TypeError: 591 i = _force_alphabet(iterator_generator(fp), alphabet) 592 elif format in AlignIO._FormatToIterator: 593 # Use Bio.AlignIO to read in the alignments 594 i = (r for alignment in AlignIO.parse(fp, format, 595 alphabet=alphabet) 596 for r in alignment) 597 else: 598 raise ValueError("Unknown format '%s'" % format) 599 # This imposes some overhead... wait until we drop Python 2.4 to fix it 600 for r in i: 601 yield r
602 603
604 -def _force_alphabet(record_iterator, alphabet):
605 """Iterate over records, over-riding the alphabet (PRIVATE).""" 606 # Assume the alphabet argument has been pre-validated 607 given_base_class = _get_base_alphabet(alphabet).__class__ 608 for record in record_iterator: 609 if isinstance(_get_base_alphabet(record.seq.alphabet), 610 given_base_class): 611 record.seq.alphabet = alphabet 612 yield record 613 else: 614 raise ValueError("Specified alphabet %r clashes with " 615 "that determined from the file, %r" 616 % (alphabet, record.seq.alphabet))
617 618
619 -def read(handle, format, alphabet=None):
620 """Turns a sequence file into a single SeqRecord. 621 622 - handle - handle to the file, or the filename as a string 623 (note older versions of Biopython only took a handle). 624 - format - string describing the file format. 625 - alphabet - optional Alphabet object, useful when the sequence type 626 cannot be automatically inferred from the file itself 627 (e.g. format="fasta" or "tab") 628 629 This function is for use parsing sequence files containing 630 exactly one record. For example, reading a GenBank file: 631 632 >>> from Bio import SeqIO 633 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 634 >>> print("ID %s" % record.id) 635 ID AC007323.5 636 >>> print("Sequence length %i" % len(record)) 637 Sequence length 86436 638 >>> print("Sequence alphabet %s" % record.seq.alphabet) 639 Sequence alphabet IUPACAmbiguousDNA() 640 641 If the handle contains no records, or more than one record, 642 an exception is raised. For example: 643 644 >>> from Bio import SeqIO 645 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 646 Traceback (most recent call last): 647 ... 648 ValueError: More than one record found in handle 649 650 If however you want the first record from a file containing 651 multiple records this function would raise an exception (as 652 shown in the example above). Instead use: 653 654 >>> from Bio import SeqIO 655 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 656 >>> print("First record's ID %s" % record.id) 657 First record's ID X55053.1 658 659 Use the Bio.SeqIO.parse(handle, format) function if you want 660 to read multiple records from the handle. 661 """ 662 iterator = parse(handle, format, alphabet) 663 try: 664 first = next(iterator) 665 except StopIteration: 666 first = None 667 if first is None: 668 raise ValueError("No records found in handle") 669 try: 670 second = next(iterator) 671 except StopIteration: 672 second = None 673 if second is not None: 674 raise ValueError("More than one record found in handle") 675 return first
676 677
678 -def to_dict(sequences, key_function=None):
679 """Turns a sequence iterator or list into a dictionary. 680 681 - sequences - An iterator that returns SeqRecord objects, 682 or simply a list of SeqRecord objects. 683 - key_function - Optional callback function which when given a 684 SeqRecord should return a unique key for the dictionary. 685 686 e.g. key_function = lambda rec : rec.name 687 or, key_function = lambda rec : rec.description.split()[0] 688 689 If key_function is omitted then record.id is used, on the assumption 690 that the records objects returned are SeqRecords with a unique id. 691 692 If there are duplicate keys, an error is raised. 693 694 Example usage, defaulting to using the record.id as key: 695 696 >>> from Bio import SeqIO 697 >>> filename = "GenBank/cor6_6.gb" 698 >>> format = "genbank" 699 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 700 >>> print(sorted(id_dict)) 701 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 702 >>> print(id_dict["L31939.1"].description) 703 Brassica rapa (clone bif72) kin mRNA, complete cds. 704 705 A more complex example, using the key_function argument in order to 706 use a sequence checksum as the dictionary key: 707 708 >>> from Bio import SeqIO 709 >>> from Bio.SeqUtils.CheckSum import seguid 710 >>> filename = "GenBank/cor6_6.gb" 711 >>> format = "genbank" 712 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 713 ... key_function = lambda rec : seguid(rec.seq)) 714 >>> for key, record in sorted(seguid_dict.items()): 715 ... print("%s %s" % (key, record.id)) 716 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 717 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 718 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 719 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 720 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 721 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 722 723 This approach is not suitable for very large sets of sequences, as all 724 the SeqRecord objects are held in memory. Instead, consider using the 725 Bio.SeqIO.index() function (if it supports your particular file format). 726 """ 727 if key_function is None: 728 key_function = lambda rec: rec.id 729 730 d = dict() 731 for record in sequences: 732 key = key_function(record) 733 if key in d: 734 raise ValueError("Duplicate key '%s'" % key) 735 d[key] = record 736 return d
737 738
739 -def index(filename, format, alphabet=None, key_function=None):
740 """Indexes a sequence file and returns a dictionary like object. 741 742 - filename - string giving name of file to be indexed 743 - format - lower case string describing the file format 744 - alphabet - optional Alphabet object, useful when the sequence type 745 cannot be automatically inferred from the file itself 746 (e.g. format="fasta" or "tab") 747 - key_function - Optional callback function which when given a 748 SeqRecord identifier string should return a unique 749 key for the dictionary. 750 751 This indexing function will return a dictionary like object, giving the 752 SeqRecord objects as values: 753 754 >>> from Bio import SeqIO 755 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 756 >>> len(records) 757 3 758 >>> sorted(records) 759 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 760 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 761 >EAS54_6_R1_2_1_540_792 762 TTGGCAGGCCAAGGCCGATGGATCA 763 <BLANKLINE> 764 >>> "EAS54_6_R1_2_1_540_792" in records 765 True 766 >>> print(records.get("Missing", None)) 767 None 768 >>> records.close() 769 770 If the file is BGZF compressed, this is detected automatically. Ordinary 771 GZIP files are not supported: 772 773 >>> from Bio import SeqIO 774 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 775 >>> len(records) 776 3 777 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 778 TTGGCAGGCCAAGGCCGATGGATCA 779 >>> records.close() 780 781 Note that this pseudo dictionary will not support all the methods of a 782 true Python dictionary, for example values() is not defined since this 783 would require loading all of the records into memory at once. 784 785 When you call the index function, it will scan through the file, noting 786 the location of each record. When you access a particular record via the 787 dictionary methods, the code will jump to the appropriate part of the 788 file and then parse that section into a SeqRecord. 789 790 Note that not all the input formats supported by Bio.SeqIO can be used 791 with this index function. It is designed to work only with sequential 792 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 793 interlaced file format (e.g. alignment formats such as "clustal"). 794 795 For small files, it may be more efficient to use an in memory Python 796 dictionary, e.g. 797 798 >>> from Bio import SeqIO 799 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 800 >>> len(records) 801 3 802 >>> sorted(records) 803 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 804 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 805 >EAS54_6_R1_2_1_540_792 806 TTGGCAGGCCAAGGCCGATGGATCA 807 <BLANKLINE> 808 809 As with the to_dict() function, by default the id string of each record 810 is used as the key. You can specify a callback function to transform 811 this (the record identifier string) into your preferred key. For example: 812 813 >>> from Bio import SeqIO 814 >>> def make_tuple(identifier): 815 ... parts = identifier.split("_") 816 ... return int(parts[-2]), int(parts[-1]) 817 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 818 ... key_function=make_tuple) 819 >>> len(records) 820 3 821 >>> sorted(records) 822 [(413, 324), (443, 348), (540, 792)] 823 >>> print(records[(540, 792)].format("fasta")) 824 >EAS54_6_R1_2_1_540_792 825 TTGGCAGGCCAAGGCCGATGGATCA 826 <BLANKLINE> 827 >>> (540, 792) in records 828 True 829 >>> "EAS54_6_R1_2_1_540_792" in records 830 False 831 >>> print(records.get("Missing", None)) 832 None 833 >>> records.close() 834 835 Another common use case would be indexing an NCBI style FASTA file, 836 where you might want to extract the GI number from the FASTA identifier 837 to use as the dictionary key. 838 839 Notice that unlike the to_dict() function, here the key_function does 840 not get given the full SeqRecord to use to generate the key. Doing so 841 would impose a severe performance penalty as it would require the file 842 to be completely parsed while building the index. Right now this is 843 usually avoided. 844 845 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 846 """ 847 # Try and give helpful error messages: 848 if not isinstance(filename, basestring): 849 raise TypeError("Need a filename (not a handle)") 850 if not isinstance(format, basestring): 851 raise TypeError("Need a string for the file format (lower case)") 852 if not format: 853 raise ValueError("Format required (lower case string)") 854 if format != format.lower(): 855 raise ValueError("Format string '%s' should be lower case" % format) 856 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 857 isinstance(alphabet, AlphabetEncoder)): 858 raise ValueError("Invalid alphabet, %r" % alphabet) 859 860 # Map the file format to a sequence iterator: 861 from ._index import _FormatToRandomAccess # Lazy import 862 from Bio.File import _IndexedSeqFileDict 863 try: 864 proxy_class = _FormatToRandomAccess[format] 865 except KeyError: 866 raise ValueError("Unsupported format %r" % format) 867 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 868 % (filename, format, alphabet, key_function) 869 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 870 key_function, repr, "SeqRecord")
871 872
873 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 874 key_function=None):
875 """Index several sequence files and return a dictionary like object. 876 877 The index is stored in an SQLite database rather than in memory (as in the 878 Bio.SeqIO.index(...) function). 879 880 - index_filename - Where to store the SQLite index 881 - filenames - list of strings specifying file(s) to be indexed, or when 882 indexing a single file this can be given as a string. 883 (optional if reloading an existing index, but must match) 884 - format - lower case string describing the file format 885 (optional if reloading an existing index, but must match) 886 - alphabet - optional Alphabet object, useful when the sequence type 887 cannot be automatically inferred from the file itself 888 (e.g. format="fasta" or "tab") 889 - key_function - Optional callback function which when given a 890 SeqRecord identifier string should return a unique 891 key for the dictionary. 892 893 This indexing function will return a dictionary like object, giving the 894 SeqRecord objects as values: 895 896 >>> from Bio.Alphabet import generic_protein 897 >>> from Bio import SeqIO 898 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 899 >>> def get_gi(name): 900 ... parts = name.split("|") 901 ... i = parts.index("gi") 902 ... assert i != -1 903 ... return parts[i+1] 904 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 905 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 906 >>> len(records) 907 95 908 >>> records["7525076"].description 909 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 910 >>> records["45478717"].description 911 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 912 >>> records.close() 913 914 In this example the two files contain 85 and 10 records respectively. 915 916 BGZF compressed files are supported, and detected automatically. Ordinary 917 GZIP compressed files are not supported. 918 919 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 920 glob which is useful for building lists of files. 921 """ 922 # Try and give helpful error messages: 923 if not isinstance(index_filename, basestring): 924 raise TypeError("Need a string for the index filename") 925 if isinstance(filenames, basestring): 926 # Make the API a little more friendly, and more similar 927 # to Bio.SeqIO.index(...) for indexing just one file. 928 filenames = [filenames] 929 if filenames is not None and not isinstance(filenames, list): 930 raise TypeError( 931 "Need a list of filenames (as strings), or one filename") 932 if format is not None and not isinstance(format, basestring): 933 raise TypeError("Need a string for the file format (lower case)") 934 if format and format != format.lower(): 935 raise ValueError("Format string '%s' should be lower case" % format) 936 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 937 isinstance(alphabet, AlphabetEncoder)): 938 raise ValueError("Invalid alphabet, %r" % alphabet) 939 940 # Map the file format to a sequence iterator: 941 from ._index import _FormatToRandomAccess # Lazy import 942 from Bio.File import _SQLiteManySeqFilesDict 943 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 944 % (index_filename, filenames, format, alphabet, key_function) 945 946 def proxy_factory(format, filename=None): 947 """Given a filename returns proxy object, else boolean if format OK.""" 948 if filename: 949 return _FormatToRandomAccess[format](filename, format, alphabet) 950 else: 951 return format in _FormatToRandomAccess
952 953 return _SQLiteManySeqFilesDict(index_filename, filenames, 954 proxy_factory, format, 955 key_function, repr) 956 957
958 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
959 """Convert between two sequence file formats, return number of records. 960 961 - in_file - an input handle or filename 962 - in_format - input file format, lower case string 963 - out_file - an output handle or filename 964 - out_format - output file format, lower case string 965 - alphabet - optional alphabet to assume 966 967 **NOTE** - If you provide an output filename, it will be opened which will 968 overwrite any existing file without warning. This may happen if even 969 the conversion is aborted (e.g. an invalid out_format name is given). 970 971 For example, going from a filename to a handle: 972 973 >>> from Bio import SeqIO 974 >>> try: 975 ... from StringIO import StringIO # Python 2 976 ... except ImportError: 977 ... from io import StringIO # Python 3 978 ... 979 >>> handle = StringIO("") 980 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 981 3 982 >>> print(handle.getvalue()) 983 >EAS54_6_R1_2_1_413_324 984 CCCTTCTTGTCTTCAGCGTTTCTCC 985 >EAS54_6_R1_2_1_540_792 986 TTGGCAGGCCAAGGCCGATGGATCA 987 >EAS54_6_R1_2_1_443_348 988 GTTGCTTCTGGCGTGGGTGGGGGGG 989 <BLANKLINE> 990 """ 991 # Hack for SFF, will need to make this more general in future 992 if in_format in _BinaryFormats: 993 in_mode = 'rb' 994 else: 995 in_mode = 'rU' 996 997 # Don't open the output file until we've checked the input is OK? 998 if out_format in ["sff", "sff_trim"]: 999 out_mode = 'wb' 1000 else: 1001 out_mode = 'w' 1002 1003 # This will check the arguments and issue error messages, 1004 # after we have opened the file which is a shame. 1005 from ._convert import _handle_convert # Lazy import 1006 with as_handle(in_file, in_mode) as in_handle: 1007 with as_handle(out_file, out_mode) as out_handle: 1008 count = _handle_convert(in_handle, in_format, 1009 out_handle, out_format, 1010 alphabet) 1011 return count
1012 1013 1014 # This helpful trick for testing no longer works with the 1015 # local imports :( 1016 # 1017 # if __name__ == "__main__": 1018 # from Bio._utils import run_doctest 1019 # run_doctest() 1020