Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2017 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by a whole chapter in our tutorial: 
  12   
  13    - `HTML Tutorial`_ 
  14    - `PDF Tutorial`_ 
  15   
  16  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  17  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  18  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  19   
  20  Input 
  21  ----- 
  22  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  23  (or in recent versions of Biopython alternatively a filename as a string), 
  24  and format string.  This returns an iterator giving SeqRecord objects: 
  25   
  26  >>> from Bio import SeqIO 
  27  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  28  ...     print("%s %i" % (record.id, len(record))) 
  29  gi|1348912|gb|G26680|G26680 633 
  30  gi|1348917|gb|G26685|G26685 413 
  31  gi|1592936|gb|G29385|G29385 471 
  32   
  33  Note that the parse() function will invoke the relevant parser for the 
  34  format with its default settings.  You may want more control, in which case 
  35  you need to create a format specific sequence iterator directly. 
  36   
  37  Some of these parsers are wrappers around low-level parsers which build up 
  38  SeqRecord objects for the consistent SeqIO interface. In cases where the 
  39  run-time is critical, such as large FASTA or FASTQ files, calling these 
  40  underlying parsers will be much faster - in this case these generator 
  41  functions which return tuples of strings: 
  42   
  43  >>> from Bio.SeqIO.FastaIO import SimpleFastaParser 
  44  >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator 
  45   
  46   
  47  Input - Single Records 
  48  ---------------------- 
  49  If you expect your file to contain one-and-only-one record, then we provide 
  50  the following 'helper' function which will return a single SeqRecord, or 
  51  raise an exception if there are no records or more than one record: 
  52   
  53  >>> from Bio import SeqIO 
  54  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  55  >>> print("%s %i" % (record.id, len(record))) 
  56  gi|3318709|pdb|1A91| 79 
  57   
  58  This style is useful when you expect a single record only (and would 
  59  consider multiple records an error).  For example, when dealing with GenBank 
  60  files for bacterial genomes or chromosomes, there is normally only a single 
  61  record.  Alternatively, use this with a handle when downloading a single 
  62  record from the internet. 
  63   
  64  However, if you just want the first record from a file containing multiple 
  65  record, use the next() function on the iterator (or under Python 2, the 
  66  iterator's next() method): 
  67   
  68  >>> from Bio import SeqIO 
  69  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  70  >>> print("%s %i" % (record.id, len(record))) 
  71  gi|1348912|gb|G26680|G26680 633 
  72   
  73  The above code will work as long as the file contains at least one record. 
  74  Note that if there is more than one record, the remaining records will be 
  75  silently ignored. 
  76   
  77   
  78  Input - Multiple Records 
  79  ------------------------ 
  80  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  81  using a sequence iterator can save you a lot of memory (RAM).  There is 
  82  less benefit for interlaced file formats (e.g. most multiple alignment file 
  83  formats).  However, an iterator only lets you access the records one by one. 
  84   
  85  If you want random access to the records by number, turn this into a list: 
  86   
  87  >>> from Bio import SeqIO 
  88  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  89  >>> len(records) 
  90  3 
  91  >>> print(records[1].id) 
  92  gi|1348917|gb|G26685|G26685 
  93   
  94  If you want random access to the records by a key such as the record id, 
  95  turn the iterator into a dictionary: 
  96   
  97  >>> from Bio import SeqIO 
  98  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  99  >>> len(record_dict) 
 100  3 
 101  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 102  413 
 103   
 104  However, using list() or the to_dict() function will load all the records 
 105  into memory at once, and therefore is not possible on very large files. 
 106  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 107  providing dictionary like access to any record. For example, 
 108   
 109  >>> from Bio import SeqIO 
 110  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 111  >>> len(record_dict) 
 112  3 
 113  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 114  413 
 115  >>> record_dict.close() 
 116   
 117  Many but not all of the supported input file formats can be indexed like 
 118  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 119  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 120   
 121  In most cases you can also use SeqIO.index to get the record from the file 
 122  as a raw string (not a SeqRecord). This can be useful for example to extract 
 123  a sub-set of records from a file where SeqIO cannot output the file format 
 124  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 125  keep the output 100% identical to the input). For example, 
 126   
 127  >>> from Bio import SeqIO 
 128  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 129  >>> len(record_dict) 
 130  3 
 131  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 134  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 135  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 136  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 137  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 138  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 139  <BLANKLINE> 
 140  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 141  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 142  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 143  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 144  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 145  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 146  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 147  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 148  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 149  <BLANKLINE> 
 150  >>> record_dict.close() 
 151   
 152  Here the original file and what Biopython would output differ in the line 
 153  wrapping. Also note that under Python 3, the get_raw method will return a 
 154  bytes string, hence the use of decode to turn it into a (unicode) string. 
 155  This is unnecessary on Python 2. 
 156   
 157  Also note that the get_raw method will preserve the newline endings. This 
 158  example FASTQ file uses Unix style endings (b"\n" only), 
 159   
 160  >>> from Bio import SeqIO 
 161  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 162  >>> len(fastq_dict) 
 163  3 
 164  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 165  >>> raw.count(b"\n") 
 166  4 
 167  >>> raw.count(b"\r\n") 
 168  0 
 169  >>> b"\r" in raw 
 170  False 
 171  >>> len(raw) 
 172  78 
 173  >>> fastq_dict.close() 
 174   
 175  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 176   
 177  >>> from Bio import SeqIO 
 178  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 179  >>> len(fastq_dict) 
 180  3 
 181  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 182  >>> raw.count(b"\n") 
 183  4 
 184  >>> raw.count(b"\r\n") 
 185  4 
 186  >>> b"\r\n" in raw 
 187  True 
 188  >>> len(raw) 
 189  82 
 190  >>> fastq_dict.close() 
 191   
 192  Because this uses two bytes for each new line, the file is longer than 
 193  the Unix equivalent with only one byte. 
 194   
 195   
 196  Input - Alignments 
 197  ------------------ 
 198  You can read in alignment files as alignment objects using Bio.AlignIO. 
 199  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 200  you a SeqRecord for each row of each alignment: 
 201   
 202  >>> from Bio import SeqIO 
 203  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 204  ...     print("%s %i" % (record.id, len(record))) 
 205  gi|167877390|gb|EDS40773.1| 447 
 206  gi|167234445|ref|NP_001107837. 447 
 207  gi|74100009|gb|AAZ99217.1| 447 
 208  gi|13990994|dbj|BAA33523.2| 447 
 209  gi|56122354|gb|AAV74328.1| 447 
 210   
 211   
 212  Output 
 213  ------ 
 214  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 215  SeqRecord objects (either as a list, or an iterator), an output file handle 
 216  (or in recent versions of Biopython an output filename as a string) and of 
 217  course the file format:: 
 218   
 219    from Bio import SeqIO 
 220    records = ... 
 221    SeqIO.write(records, "example.faa", "fasta") 
 222   
 223  Or, using a handle:: 
 224   
 225      from Bio import SeqIO 
 226      records = ... 
 227      with open("example.faa", "w") as handle: 
 228        SeqIO.write(records, handle, "fasta") 
 229   
 230  You are expected to call this function once (with all your records) and if 
 231  using a handle, make sure you close it to flush the data to the hard disk. 
 232   
 233   
 234  Output - Advanced 
 235  ----------------- 
 236  The effect of calling write() multiple times on a single file will vary 
 237  depending on the file format, and is best avoided unless you have a strong 
 238  reason to do so. 
 239   
 240  If you give a filename, then each time you call write() the existing file 
 241  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 242  "record block" holds a single sequence.  For these files it would probably 
 243  be safe to call write() multiple times by re-using the same handle. 
 244   
 245  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 246  stockholm) would have the effect of concatenating several multiple sequence 
 247  alignments together.  Such files are created by the PHYLIP suite of programs 
 248  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 249   
 250  Worse, many fileformats have an explicit header and/or footer structure 
 251  (e.g. any XMl format, and most binary file formats like SFF). Here making 
 252  multiple calls to write() will result in an invalid file. 
 253   
 254   
 255  Conversion 
 256  ---------- 
 257  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 258  file format conversions. Additionally, it may use file format specific 
 259  optimisations so this should be the fastest way too. 
 260   
 261  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 262  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 263  generator expressions or generator functions provides a memory efficient way 
 264  to perform filtering or other extra operations as part of the process. 
 265   
 266   
 267  File Formats 
 268  ------------ 
 269  When specifying the file format, use lowercase strings.  The same format 
 270  names are also used in Bio.AlignIO and include the following: 
 271   
 272      - abif    - Applied Biosystem's sequencing trace format 
 273      - ace     - Reads the contig sequences from an ACE assembly file. 
 274      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 275      - fasta   - The generic sequence file format where each record starts with 
 276        an identifer line starting with a ">" character, followed by 
 277        lines of sequence. 
 278      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 279        sequence quality values (with an ASCII offset of 33). 
 280      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 281      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 282        encodes Solexa quality scores (not PHRED quality scores) with an 
 283        ASCII offset of 64. 
 284      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 285        which encodes PHRED quality scores with an ASCII offset of 64 
 286        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 287        will produce FASTQ files using the standard Sanger encoding. 
 288      - genbank - The GenBank or GenPept flat file format. 
 289      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 290      - ig      - The IntelliGenetics file format, apparently the same as the 
 291        MASE alignment format. 
 292      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 293        indented to allow for longer feature types. 
 294      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 295        complete protein sequence as it appears in the header (no dependencies). 
 296      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 297        it appears in the structure based on the atom coordinate section of the 
 298        file (requires NumPy for Bio.PDB). 
 299      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 300      - pir     - A "FASTA like" format introduced by the National Biomedical 
 301        Research Foundation (NBRF) for the Protein Information Resource 
 302        (PIR) database, now part of UniProt. 
 303      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 304      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 305      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 306      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 307      - tab     - Simple two column tab separated sequence files, where each 
 308        line holds a record's identifier and sequence. For example, 
 309        this is used as by Aligent's eArray software when saving 
 310        microarray probes in a minimal tab delimited text file. 
 311      - qual    - A "FASTA like" format holding PHRED quality values from 
 312        sequencing DNA, but no actual sequences (usually provided 
 313        in separate FASTA files). 
 314      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 315        text format which we call "swiss") 
 316   
 317  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 318  write to all of them. 
 319   
 320  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 321  "phylip" and "stockholm", which gives you access to the individual sequences 
 322  making up each alignment as SeqRecords. 
 323  """ 
 324   
 325  from __future__ import print_function 
 326  from Bio._py3k import basestring 
 327   
 328  # TODO 
 329  # - define policy on reading aligned sequences with gaps in 
 330  #   (e.g. - and . characters) including how the alphabet interacts 
 331  # 
 332  # - How best to handle unique/non unique record.id when writing. 
 333  #   For most file formats reading such files is fine; The stockholm 
 334  #   parser would fail. 
 335  # 
 336  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 337  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 338  # 
 339  # FAO BioPython Developers 
 340  # ------------------------ 
 341  # The way I envision this SeqIO system working as that for any sequence file 
 342  # format we have an iterator that returns SeqRecord objects. 
 343  # 
 344  # This also applies to interlaced file formats (like clustal - although that 
 345  # is now handled via Bio.AlignIO instead) where the file cannot be read record 
 346  # by record.  You should still return an iterator, even if the implementation 
 347  # could just as easily return a list. 
 348  # 
 349  # These file format specific sequence iterators may be implemented as: 
 350  #    - Classes which take a handle for __init__ and provide the __iter__ method 
 351  #    - Functions that take a handle, and return an iterator object 
 352  #    - Generator functions that take a handle, and yield SeqRecord objects 
 353  # 
 354  # It is then trivial to turn this iterator into a list of SeqRecord objects, 
 355  # an in memory dictionary, or a multiple sequence alignment object. 
 356  # 
 357  # For building the dictionary by default the id property of each SeqRecord is 
 358  # used as the key.  You should always populate the id property, and it should 
 359  # be unique in most cases. For some file formats the accession number is a good 
 360  # choice.  If the file itself contains ambiguous identifiers, don't try and 
 361  # dis-ambiguate them - return them as is. 
 362  # 
 363  # When adding a new file format, please use the same lower case format name 
 364  # as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 365  # 
 366  # See also http://biopython.org/wiki/SeqIO_dev 
 367  # 
 368  # --Peter 
 369   
 370  from Bio.File import as_handle 
 371  from Bio.SeqRecord import SeqRecord 
 372  from Bio.Align import MultipleSeqAlignment 
 373  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 374   
 375  from . import AbiIO 
 376  from . import AceIO 
 377  from . import FastaIO 
 378  from . import IgIO  # IntelliGenetics or MASE format 
 379  from . import InsdcIO  # EMBL and GenBank 
 380  from . import PdbIO 
 381  from . import PhdIO 
 382  from . import PirIO 
 383  from . import SeqXmlIO 
 384  from . import SffIO 
 385  from . import SwissIO 
 386  from . import TabIO 
 387  from . import QualityIO  # FastQ and qual files 
 388  from . import UniprotIO 
 389   
 390   
 391  # Convention for format names is "mainname-subtype" in lower case. 
 392  # Please use the same names as BioPerl or EMBOSS where possible. 
 393  # 
 394  # Note that this simple system copes with defining 
 395  # multiple possible iterators for a given format/extension 
 396  # with the -subtype suffix 
 397  # 
 398  # Most alignment file formats will be handled via Bio.AlignIO 
 399   
 400  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 401                       "gb": InsdcIO.GenBankIterator, 
 402                       "genbank": InsdcIO.GenBankIterator, 
 403                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 404                       "embl": InsdcIO.EmblIterator, 
 405                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 406                       "imgt": InsdcIO.ImgtIterator, 
 407                       "ig": IgIO.IgIterator, 
 408                       "swiss": SwissIO.SwissIterator, 
 409                       "pdb-atom": PdbIO.PdbAtomIterator, 
 410                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 411                       "phd": PhdIO.PhdIterator, 
 412                       "ace": AceIO.AceIterator, 
 413                       "tab": TabIO.TabIterator, 
 414                       "pir": PirIO.PirIterator, 
 415                       "fastq": QualityIO.FastqPhredIterator, 
 416                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 417                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 418                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 419                       "qual": QualityIO.QualPhredIterator, 
 420                       "sff": SffIO.SffIterator, 
 421                       # Not sure about this in the long run: 
 422                       "sff-trim": SffIO._SffTrimIterator, 
 423                       "uniprot-xml": UniprotIO.UniprotIterator, 
 424                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 425                       "abi": AbiIO.AbiIterator, 
 426                       "abi-trim": AbiIO._AbiTrimIterator, 
 427                       } 
 428   
 429  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 430                     "gb": InsdcIO.GenBankWriter, 
 431                     "genbank": InsdcIO.GenBankWriter, 
 432                     "embl": InsdcIO.EmblWriter, 
 433                     "imgt": InsdcIO.ImgtWriter, 
 434                     "tab": TabIO.TabWriter, 
 435                     "fastq": QualityIO.FastqPhredWriter, 
 436                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 437                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 438                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 439                     "phd": PhdIO.PhdWriter, 
 440                     "qual": QualityIO.QualPhredWriter, 
 441                     "sff": SffIO.SffWriter, 
 442                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 443                     } 
 444   
 445  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 446   
 447   
448 -def write(sequences, handle, format):
449 """Write complete set of sequences to a file. 450 451 - sequences - A list (or iterator) of SeqRecord objects, or (if using 452 Biopython 1.54 or later) a single SeqRecord. 453 - handle - File handle object to write to, or filename as string 454 (note older versions of Biopython only took a handle). 455 - format - lower case string describing the file format to write. 456 457 You should close the handle after calling this function. 458 459 Returns the number of records written (as an integer). 460 """ 461 from Bio import AlignIO 462 463 # Try and give helpful error messages: 464 if not isinstance(format, basestring): 465 raise TypeError("Need a string for the file format (lower case)") 466 if not format: 467 raise ValueError("Format required (lower case string)") 468 if format != format.lower(): 469 raise ValueError("Format string '%s' should be lower case" % format) 470 471 if isinstance(handle, SeqRecord): 472 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 473 if isinstance(handle, list): 474 # e.g. list of SeqRecord objects 475 raise TypeError("Check arguments, handle should NOT be a list") 476 477 if isinstance(sequences, SeqRecord): 478 # This raised an exception in older versions of Biopython 479 sequences = [sequences] 480 481 if format in _BinaryFormats: 482 mode = 'wb' 483 else: 484 mode = 'w' 485 486 with as_handle(handle, mode) as fp: 487 # Map the file format to a writer class 488 if format in _FormatToWriter: 489 writer_class = _FormatToWriter[format] 490 count = writer_class(fp).write_file(sequences) 491 elif format in AlignIO._FormatToWriter: 492 # Try and turn all the records into a single alignment, 493 # and write that using Bio.AlignIO 494 alignment = MultipleSeqAlignment(sequences) 495 alignment_count = AlignIO.write([alignment], fp, format) 496 assert alignment_count == 1, \ 497 "Internal error - the underlying writer " \ 498 " should have returned 1, not %r" % alignment_count 499 count = len(alignment) 500 del alignment_count, alignment 501 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 502 raise ValueError("Reading format '%s' is supported, but not writing" 503 % format) 504 else: 505 raise ValueError("Unknown format '%s'" % format) 506 507 assert isinstance(count, int), "Internal error - the underlying %s " \ 508 "writer should have returned the record count, not %r" \ 509 % (format, count) 510 511 return count
512 513
514 -def parse(handle, format, alphabet=None):
515 r"""Turns a sequence file into an iterator returning SeqRecords. 516 517 - handle - handle to the file, or the filename as a string 518 (note older versions of Biopython only took a handle). 519 - format - lower case string describing the file format. 520 - alphabet - optional Alphabet object, useful when the sequence type 521 cannot be automatically inferred from the file itself 522 (e.g. format="fasta" or "tab") 523 524 Typical usage, opening a file to read in, and looping over the record(s): 525 526 >>> from Bio import SeqIO 527 >>> filename = "Fasta/sweetpea.nu" 528 >>> for record in SeqIO.parse(filename, "fasta"): 529 ... print("ID %s" % record.id) 530 ... print("Sequence length %i" % len(record)) 531 ... print("Sequence alphabet %s" % record.seq.alphabet) 532 ID gi|3176602|gb|U78617.1|LOU78617 533 Sequence length 309 534 Sequence alphabet SingleLetterAlphabet() 535 536 For file formats like FASTA where the alphabet cannot be determined, it 537 may be useful to specify the alphabet explicitly: 538 539 >>> from Bio import SeqIO 540 >>> from Bio.Alphabet import generic_dna 541 >>> filename = "Fasta/sweetpea.nu" 542 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 543 ... print("ID %s" % record.id) 544 ... print("Sequence length %i" % len(record)) 545 ... print("Sequence alphabet %s" % record.seq.alphabet) 546 ID gi|3176602|gb|U78617.1|LOU78617 547 Sequence length 309 548 Sequence alphabet DNAAlphabet() 549 550 If you have a string 'data' containing the file contents, you must 551 first turn this into a handle in order to parse it: 552 553 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 554 >>> from Bio import SeqIO 555 >>> try: 556 ... from StringIO import StringIO # Python 2 557 ... except ImportError: 558 ... from io import StringIO # Python 3 559 ... 560 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 561 ... print("%s %s" % (record.id, record.seq)) 562 Alpha ACCGGATGTA 563 Beta AGGCTCGGTTA 564 565 Use the Bio.SeqIO.read(...) function when you expect a single record 566 only. 567 """ 568 # NOTE - The above docstring has some raw \n characters needed 569 # for the StringIO example, hence the whole docstring is in raw 570 # string mode (see the leading r before the opening quote). 571 from Bio import AlignIO 572 573 # Hack for SFF, will need to make this more general in future 574 if format in _BinaryFormats: 575 mode = 'rb' 576 else: 577 mode = 'rU' 578 579 # Try and give helpful error messages: 580 if not isinstance(format, basestring): 581 raise TypeError("Need a string for the file format (lower case)") 582 if not format: 583 raise ValueError("Format required (lower case string)") 584 if format != format.lower(): 585 raise ValueError("Format string '%s' should be lower case" % format) 586 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 587 isinstance(alphabet, AlphabetEncoder)): 588 raise ValueError("Invalid alphabet, %r" % alphabet) 589 590 with as_handle(handle, mode) as fp: 591 # Map the file format to a sequence iterator: 592 if format in _FormatToIterator: 593 iterator_generator = _FormatToIterator[format] 594 if alphabet is None: 595 i = iterator_generator(fp) 596 else: 597 try: 598 i = iterator_generator(fp, alphabet=alphabet) 599 except TypeError: 600 i = _force_alphabet(iterator_generator(fp), alphabet) 601 elif format in AlignIO._FormatToIterator: 602 # Use Bio.AlignIO to read in the alignments 603 i = (r for alignment in AlignIO.parse(fp, format, 604 alphabet=alphabet) 605 for r in alignment) 606 else: 607 raise ValueError("Unknown format '%s'" % format) 608 # This imposes some overhead... wait until we drop Python 2.4 to fix it 609 for r in i: 610 yield r
611 612
613 -def _force_alphabet(record_iterator, alphabet):
614 """Iterate over records, over-riding the alphabet (PRIVATE).""" 615 # Assume the alphabet argument has been pre-validated 616 given_base_class = _get_base_alphabet(alphabet).__class__ 617 for record in record_iterator: 618 if isinstance(_get_base_alphabet(record.seq.alphabet), 619 given_base_class): 620 record.seq.alphabet = alphabet 621 yield record 622 else: 623 raise ValueError("Specified alphabet %r clashes with " 624 "that determined from the file, %r" 625 % (alphabet, record.seq.alphabet))
626 627
628 -def read(handle, format, alphabet=None):
629 """Turns a sequence file into a single SeqRecord. 630 631 - handle - handle to the file, or the filename as a string 632 (note older versions of Biopython only took a handle). 633 - format - string describing the file format. 634 - alphabet - optional Alphabet object, useful when the sequence type 635 cannot be automatically inferred from the file itself 636 (e.g. format="fasta" or "tab") 637 638 This function is for use parsing sequence files containing 639 exactly one record. For example, reading a GenBank file: 640 641 >>> from Bio import SeqIO 642 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 643 >>> print("ID %s" % record.id) 644 ID AC007323.5 645 >>> print("Sequence length %i" % len(record)) 646 Sequence length 86436 647 >>> print("Sequence alphabet %s" % record.seq.alphabet) 648 Sequence alphabet IUPACAmbiguousDNA() 649 650 If the handle contains no records, or more than one record, 651 an exception is raised. For example: 652 653 >>> from Bio import SeqIO 654 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 655 Traceback (most recent call last): 656 ... 657 ValueError: More than one record found in handle 658 659 If however you want the first record from a file containing 660 multiple records this function would raise an exception (as 661 shown in the example above). Instead use: 662 663 >>> from Bio import SeqIO 664 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 665 >>> print("First record's ID %s" % record.id) 666 First record's ID X55053.1 667 668 Use the Bio.SeqIO.parse(handle, format) function if you want 669 to read multiple records from the handle. 670 """ 671 iterator = parse(handle, format, alphabet) 672 try: 673 first = next(iterator) 674 except StopIteration: 675 first = None 676 if first is None: 677 raise ValueError("No records found in handle") 678 try: 679 second = next(iterator) 680 except StopIteration: 681 second = None 682 if second is not None: 683 raise ValueError("More than one record found in handle") 684 return first
685 686
687 -def to_dict(sequences, key_function=None):
688 """Turns a sequence iterator or list into a dictionary. 689 690 - sequences - An iterator that returns SeqRecord objects, 691 or simply a list of SeqRecord objects. 692 - key_function - Optional callback function which when given a 693 SeqRecord should return a unique key for the dictionary. 694 695 e.g. key_function = lambda rec : rec.name 696 or, key_function = lambda rec : rec.description.split()[0] 697 698 If key_function is omitted then record.id is used, on the assumption 699 that the records objects returned are SeqRecords with a unique id. 700 701 If there are duplicate keys, an error is raised. 702 703 Example usage, defaulting to using the record.id as key: 704 705 >>> from Bio import SeqIO 706 >>> filename = "GenBank/cor6_6.gb" 707 >>> format = "genbank" 708 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 709 >>> print(sorted(id_dict)) 710 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 711 >>> print(id_dict["L31939.1"].description) 712 Brassica rapa (clone bif72) kin mRNA, complete cds 713 714 A more complex example, using the key_function argument in order to 715 use a sequence checksum as the dictionary key: 716 717 >>> from Bio import SeqIO 718 >>> from Bio.SeqUtils.CheckSum import seguid 719 >>> filename = "GenBank/cor6_6.gb" 720 >>> format = "genbank" 721 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 722 ... key_function = lambda rec : seguid(rec.seq)) 723 >>> for key, record in sorted(seguid_dict.items()): 724 ... print("%s %s" % (key, record.id)) 725 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 726 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 727 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 728 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 729 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 730 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 731 732 This approach is not suitable for very large sets of sequences, as all 733 the SeqRecord objects are held in memory. Instead, consider using the 734 Bio.SeqIO.index() function (if it supports your particular file format). 735 """ 736 if key_function is None: 737 key_function = lambda rec: rec.id 738 739 d = dict() 740 for record in sequences: 741 key = key_function(record) 742 if key in d: 743 raise ValueError("Duplicate key '%s'" % key) 744 d[key] = record 745 return d
746 747
748 -def index(filename, format, alphabet=None, key_function=None):
749 """Indexes a sequence file and returns a dictionary like object. 750 751 - filename - string giving name of file to be indexed 752 - format - lower case string describing the file format 753 - alphabet - optional Alphabet object, useful when the sequence type 754 cannot be automatically inferred from the file itself 755 (e.g. format="fasta" or "tab") 756 - key_function - Optional callback function which when given a 757 SeqRecord identifier string should return a unique 758 key for the dictionary. 759 760 This indexing function will return a dictionary like object, giving the 761 SeqRecord objects as values: 762 763 >>> from Bio import SeqIO 764 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 765 >>> len(records) 766 3 767 >>> sorted(records) 768 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 769 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 770 >EAS54_6_R1_2_1_540_792 771 TTGGCAGGCCAAGGCCGATGGATCA 772 <BLANKLINE> 773 >>> "EAS54_6_R1_2_1_540_792" in records 774 True 775 >>> print(records.get("Missing", None)) 776 None 777 >>> records.close() 778 779 If the file is BGZF compressed, this is detected automatically. Ordinary 780 GZIP files are not supported: 781 782 >>> from Bio import SeqIO 783 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 784 >>> len(records) 785 3 786 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 787 TTGGCAGGCCAAGGCCGATGGATCA 788 >>> records.close() 789 790 Note that this pseudo dictionary will not support all the methods of a 791 true Python dictionary, for example values() is not defined since this 792 would require loading all of the records into memory at once. 793 794 When you call the index function, it will scan through the file, noting 795 the location of each record. When you access a particular record via the 796 dictionary methods, the code will jump to the appropriate part of the 797 file and then parse that section into a SeqRecord. 798 799 Note that not all the input formats supported by Bio.SeqIO can be used 800 with this index function. It is designed to work only with sequential 801 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 802 interlaced file format (e.g. alignment formats such as "clustal"). 803 804 For small files, it may be more efficient to use an in memory Python 805 dictionary, e.g. 806 807 >>> from Bio import SeqIO 808 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 809 >>> len(records) 810 3 811 >>> sorted(records) 812 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 813 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 814 >EAS54_6_R1_2_1_540_792 815 TTGGCAGGCCAAGGCCGATGGATCA 816 <BLANKLINE> 817 818 As with the to_dict() function, by default the id string of each record 819 is used as the key. You can specify a callback function to transform 820 this (the record identifier string) into your preferred key. For example: 821 822 >>> from Bio import SeqIO 823 >>> def make_tuple(identifier): 824 ... parts = identifier.split("_") 825 ... return int(parts[-2]), int(parts[-1]) 826 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 827 ... key_function=make_tuple) 828 >>> len(records) 829 3 830 >>> sorted(records) 831 [(413, 324), (443, 348), (540, 792)] 832 >>> print(records[(540, 792)].format("fasta")) 833 >EAS54_6_R1_2_1_540_792 834 TTGGCAGGCCAAGGCCGATGGATCA 835 <BLANKLINE> 836 >>> (540, 792) in records 837 True 838 >>> "EAS54_6_R1_2_1_540_792" in records 839 False 840 >>> print(records.get("Missing", None)) 841 None 842 >>> records.close() 843 844 Another common use case would be indexing an NCBI style FASTA file, 845 where you might want to extract the GI number from the FASTA identifier 846 to use as the dictionary key. 847 848 Notice that unlike the to_dict() function, here the key_function does 849 not get given the full SeqRecord to use to generate the key. Doing so 850 would impose a severe performance penalty as it would require the file 851 to be completely parsed while building the index. Right now this is 852 usually avoided. 853 854 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 855 """ 856 # Try and give helpful error messages: 857 if not isinstance(filename, basestring): 858 raise TypeError("Need a filename (not a handle)") 859 if not isinstance(format, basestring): 860 raise TypeError("Need a string for the file format (lower case)") 861 if not format: 862 raise ValueError("Format required (lower case string)") 863 if format != format.lower(): 864 raise ValueError("Format string '%s' should be lower case" % format) 865 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 866 isinstance(alphabet, AlphabetEncoder)): 867 raise ValueError("Invalid alphabet, %r" % alphabet) 868 869 # Map the file format to a sequence iterator: 870 from ._index import _FormatToRandomAccess # Lazy import 871 from Bio.File import _IndexedSeqFileDict 872 try: 873 proxy_class = _FormatToRandomAccess[format] 874 except KeyError: 875 raise ValueError("Unsupported format %r" % format) 876 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 877 % (filename, format, alphabet, key_function) 878 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 879 key_function, repr, "SeqRecord")
880 881
882 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 883 key_function=None):
884 """Index several sequence files and return a dictionary like object. 885 886 The index is stored in an SQLite database rather than in memory (as in the 887 Bio.SeqIO.index(...) function). 888 889 - index_filename - Where to store the SQLite index 890 - filenames - list of strings specifying file(s) to be indexed, or when 891 indexing a single file this can be given as a string. 892 (optional if reloading an existing index, but must match) 893 - format - lower case string describing the file format 894 (optional if reloading an existing index, but must match) 895 - alphabet - optional Alphabet object, useful when the sequence type 896 cannot be automatically inferred from the file itself 897 (e.g. format="fasta" or "tab") 898 - key_function - Optional callback function which when given a 899 SeqRecord identifier string should return a unique 900 key for the dictionary. 901 902 This indexing function will return a dictionary like object, giving the 903 SeqRecord objects as values: 904 905 >>> from Bio.Alphabet import generic_protein 906 >>> from Bio import SeqIO 907 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 908 >>> def get_gi(name): 909 ... parts = name.split("|") 910 ... i = parts.index("gi") 911 ... assert i != -1 912 ... return parts[i+1] 913 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 914 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 915 >>> len(records) 916 95 917 >>> records["7525076"].description 918 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 919 >>> records["45478717"].description 920 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 921 >>> records.close() 922 923 In this example the two files contain 85 and 10 records respectively. 924 925 BGZF compressed files are supported, and detected automatically. Ordinary 926 GZIP compressed files are not supported. 927 928 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 929 glob which is useful for building lists of files. 930 """ 931 # Try and give helpful error messages: 932 if not isinstance(index_filename, basestring): 933 raise TypeError("Need a string for the index filename") 934 if isinstance(filenames, basestring): 935 # Make the API a little more friendly, and more similar 936 # to Bio.SeqIO.index(...) for indexing just one file. 937 filenames = [filenames] 938 if filenames is not None and not isinstance(filenames, list): 939 raise TypeError( 940 "Need a list of filenames (as strings), or one filename") 941 if format is not None and not isinstance(format, basestring): 942 raise TypeError("Need a string for the file format (lower case)") 943 if format and format != format.lower(): 944 raise ValueError("Format string '%s' should be lower case" % format) 945 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 946 isinstance(alphabet, AlphabetEncoder)): 947 raise ValueError("Invalid alphabet, %r" % alphabet) 948 949 # Map the file format to a sequence iterator: 950 from ._index import _FormatToRandomAccess # Lazy import 951 from Bio.File import _SQLiteManySeqFilesDict 952 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 953 % (index_filename, filenames, format, alphabet, key_function) 954 955 def proxy_factory(format, filename=None): 956 """Given a filename returns proxy object, else boolean if format OK.""" 957 if filename: 958 return _FormatToRandomAccess[format](filename, format, alphabet) 959 else: 960 return format in _FormatToRandomAccess
961 962 return _SQLiteManySeqFilesDict(index_filename, filenames, 963 proxy_factory, format, 964 key_function, repr) 965 966
967 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
968 """Convert between two sequence file formats, return number of records. 969 970 - in_file - an input handle or filename 971 - in_format - input file format, lower case string 972 - out_file - an output handle or filename 973 - out_format - output file format, lower case string 974 - alphabet - optional alphabet to assume 975 976 **NOTE** - If you provide an output filename, it will be opened which will 977 overwrite any existing file without warning. This may happen if even 978 the conversion is aborted (e.g. an invalid out_format name is given). 979 980 For example, going from a filename to a handle: 981 982 >>> from Bio import SeqIO 983 >>> try: 984 ... from StringIO import StringIO # Python 2 985 ... except ImportError: 986 ... from io import StringIO # Python 3 987 ... 988 >>> handle = StringIO("") 989 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 990 3 991 >>> print(handle.getvalue()) 992 >EAS54_6_R1_2_1_413_324 993 CCCTTCTTGTCTTCAGCGTTTCTCC 994 >EAS54_6_R1_2_1_540_792 995 TTGGCAGGCCAAGGCCGATGGATCA 996 >EAS54_6_R1_2_1_443_348 997 GTTGCTTCTGGCGTGGGTGGGGGGG 998 <BLANKLINE> 999 """ 1000 # Hack for SFF, will need to make this more general in future 1001 if in_format in _BinaryFormats: 1002 in_mode = 'rb' 1003 else: 1004 in_mode = 'rU' 1005 1006 # Don't open the output file until we've checked the input is OK? 1007 if out_format in ["sff", "sff_trim"]: 1008 out_mode = 'wb' 1009 else: 1010 out_mode = 'w' 1011 1012 # This will check the arguments and issue error messages, 1013 # after we have opened the file which is a shame. 1014 from ._convert import _handle_convert # Lazy import 1015 with as_handle(in_file, in_mode) as in_handle: 1016 with as_handle(out_file, out_mode) as out_handle: 1017 count = _handle_convert(in_handle, in_format, 1018 out_handle, out_format, 1019 alphabet) 1020 return count
1021 1022 1023 # This helpful trick for testing no longer works with the 1024 # local imports :( 1025 # 1026 # if __name__ == "__main__": 1027 # from Bio._utils import run_doctest 1028 # run_doctest() 1029