Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2017 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by a whole chapter in our tutorial: 
  12   
  13    - `HTML Tutorial`_ 
  14    - `PDF Tutorial`_ 
  15   
  16  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  17  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  18  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  19   
  20  Input 
  21  ----- 
  22  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  23  (or in recent versions of Biopython alternatively a filename as a string), 
  24  and format string.  This returns an iterator giving SeqRecord objects: 
  25   
  26  >>> from Bio import SeqIO 
  27  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  28  ...     print("%s %i" % (record.id, len(record))) 
  29  gi|1348912|gb|G26680|G26680 633 
  30  gi|1348917|gb|G26685|G26685 413 
  31  gi|1592936|gb|G29385|G29385 471 
  32   
  33  Note that the parse() function will invoke the relevant parser for the 
  34  format with its default settings.  You may want more control, in which case 
  35  you need to create a format specific sequence iterator directly. 
  36   
  37  Some of these parsers are wrappers around low-level parsers which build up 
  38  SeqRecord objects for the consistent SeqIO interface. In cases where the 
  39  run-time is critical, such as large FASTA or FASTQ files, calling these 
  40  underlying parsers will be much faster - in this case these generator 
  41  functions which return tuples of strings: 
  42   
  43  >>> from Bio.SeqIO.FastaIO import SimpleFastaParser 
  44  >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator 
  45   
  46   
  47  Input - Single Records 
  48  ---------------------- 
  49  If you expect your file to contain one-and-only-one record, then we provide 
  50  the following 'helper' function which will return a single SeqRecord, or 
  51  raise an exception if there are no records or more than one record: 
  52   
  53  >>> from Bio import SeqIO 
  54  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  55  >>> print("%s %i" % (record.id, len(record))) 
  56  gi|3318709|pdb|1A91| 79 
  57   
  58  This style is useful when you expect a single record only (and would 
  59  consider multiple records an error).  For example, when dealing with GenBank 
  60  files for bacterial genomes or chromosomes, there is normally only a single 
  61  record.  Alternatively, use this with a handle when downloading a single 
  62  record from the internet. 
  63   
  64  However, if you just want the first record from a file containing multiple 
  65  record, use the next() function on the iterator (or under Python 2, the 
  66  iterator's next() method): 
  67   
  68  >>> from Bio import SeqIO 
  69  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  70  >>> print("%s %i" % (record.id, len(record))) 
  71  gi|1348912|gb|G26680|G26680 633 
  72   
  73  The above code will work as long as the file contains at least one record. 
  74  Note that if there is more than one record, the remaining records will be 
  75  silently ignored. 
  76   
  77   
  78  Input - Multiple Records 
  79  ------------------------ 
  80  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  81  using a sequence iterator can save you a lot of memory (RAM).  There is 
  82  less benefit for interlaced file formats (e.g. most multiple alignment file 
  83  formats).  However, an iterator only lets you access the records one by one. 
  84   
  85  If you want random access to the records by number, turn this into a list: 
  86   
  87  >>> from Bio import SeqIO 
  88  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  89  >>> len(records) 
  90  3 
  91  >>> print(records[1].id) 
  92  gi|1348917|gb|G26685|G26685 
  93   
  94  If you want random access to the records by a key such as the record id, 
  95  turn the iterator into a dictionary: 
  96   
  97  >>> from Bio import SeqIO 
  98  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  99  >>> len(record_dict) 
 100  3 
 101  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 102  413 
 103   
 104  However, using list() or the to_dict() function will load all the records 
 105  into memory at once, and therefore is not possible on very large files. 
 106  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 107  providing dictionary like access to any record. For example, 
 108   
 109  >>> from Bio import SeqIO 
 110  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 111  >>> len(record_dict) 
 112  3 
 113  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 114  413 
 115  >>> record_dict.close() 
 116   
 117  Many but not all of the supported input file formats can be indexed like 
 118  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 119  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 120   
 121  In most cases you can also use SeqIO.index to get the record from the file 
 122  as a raw string (not a SeqRecord). This can be useful for example to extract 
 123  a sub-set of records from a file where SeqIO cannot output the file format 
 124  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 125  keep the output 100% identical to the input). For example, 
 126   
 127  >>> from Bio import SeqIO 
 128  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 129  >>> len(record_dict) 
 130  3 
 131  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 134  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 135  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 136  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 137  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 138  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 139  <BLANKLINE> 
 140  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 141  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 142  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 143  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 144  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 145  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 146  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 147  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 148  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 149  <BLANKLINE> 
 150  >>> record_dict.close() 
 151   
 152  Here the original file and what Biopython would output differ in the line 
 153  wrapping. Also note that under Python 3, the get_raw method will return a 
 154  bytes string, hence the use of decode to turn it into a (unicode) string. 
 155  This is unnecessary on Python 2. 
 156   
 157  Also note that the get_raw method will preserve the newline endings. This 
 158  example FASTQ file uses Unix style endings (b"\n" only), 
 159   
 160  >>> from Bio import SeqIO 
 161  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 162  >>> len(fastq_dict) 
 163  3 
 164  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 165  >>> raw.count(b"\n") 
 166  4 
 167  >>> raw.count(b"\r\n") 
 168  0 
 169  >>> b"\r" in raw 
 170  False 
 171  >>> len(raw) 
 172  78 
 173  >>> fastq_dict.close() 
 174   
 175  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 176   
 177  >>> from Bio import SeqIO 
 178  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 179  >>> len(fastq_dict) 
 180  3 
 181  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 182  >>> raw.count(b"\n") 
 183  4 
 184  >>> raw.count(b"\r\n") 
 185  4 
 186  >>> b"\r\n" in raw 
 187  True 
 188  >>> len(raw) 
 189  82 
 190  >>> fastq_dict.close() 
 191   
 192  Because this uses two bytes for each new line, the file is longer than 
 193  the Unix equivalent with only one byte. 
 194   
 195   
 196  Input - Alignments 
 197  ------------------ 
 198  You can read in alignment files as alignment objects using Bio.AlignIO. 
 199  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 200  you a SeqRecord for each row of each alignment: 
 201   
 202  >>> from Bio import SeqIO 
 203  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 204  ...     print("%s %i" % (record.id, len(record))) 
 205  gi|167877390|gb|EDS40773.1| 447 
 206  gi|167234445|ref|NP_001107837. 447 
 207  gi|74100009|gb|AAZ99217.1| 447 
 208  gi|13990994|dbj|BAA33523.2| 447 
 209  gi|56122354|gb|AAV74328.1| 447 
 210   
 211   
 212  Output 
 213  ------ 
 214  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 215  SeqRecord objects (either as a list, or an iterator), an output file handle 
 216  (or in recent versions of Biopython an output filename as a string) and of 
 217  course the file format:: 
 218   
 219    from Bio import SeqIO 
 220    records = ... 
 221    SeqIO.write(records, "example.faa", "fasta") 
 222   
 223  Or, using a handle:: 
 224   
 225      from Bio import SeqIO 
 226      records = ... 
 227      with open("example.faa", "w") as handle: 
 228        SeqIO.write(records, handle, "fasta") 
 229   
 230  You are expected to call this function once (with all your records) and if 
 231  using a handle, make sure you close it to flush the data to the hard disk. 
 232   
 233   
 234  Output - Advanced 
 235  ----------------- 
 236  The effect of calling write() multiple times on a single file will vary 
 237  depending on the file format, and is best avoided unless you have a strong 
 238  reason to do so. 
 239   
 240  If you give a filename, then each time you call write() the existing file 
 241  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 242  "record block" holds a single sequence.  For these files it would probably 
 243  be safe to call write() multiple times by re-using the same handle. 
 244   
 245  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 246  stockholm) would have the effect of concatenating several multiple sequence 
 247  alignments together.  Such files are created by the PHYLIP suite of programs 
 248  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 249   
 250  Worse, many fileformats have an explicit header and/or footer structure 
 251  (e.g. any XMl format, and most binary file formats like SFF). Here making 
 252  multiple calls to write() will result in an invalid file. 
 253   
 254   
 255  Conversion 
 256  ---------- 
 257  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 258  file format conversions. Additionally, it may use file format specific 
 259  optimisations so this should be the fastest way too. 
 260   
 261  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 262  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 263  generator expressions or generator functions provides a memory efficient way 
 264  to perform filtering or other extra operations as part of the process. 
 265   
 266   
 267  File Formats 
 268  ------------ 
 269  When specifying the file format, use lowercase strings.  The same format 
 270  names are also used in Bio.AlignIO and include the following: 
 271   
 272      - abif    - Applied Biosystem's sequencing trace format 
 273      - ace     - Reads the contig sequences from an ACE assembly file. 
 274      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 275      - fasta   - The generic sequence file format where each record starts with 
 276        an identifer line starting with a ">" character, followed by 
 277        lines of sequence. 
 278      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 279        sequence quality values (with an ASCII offset of 33). 
 280      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 281      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 282        encodes Solexa quality scores (not PHRED quality scores) with an 
 283        ASCII offset of 64. 
 284      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 285        which encodes PHRED quality scores with an ASCII offset of 64 
 286        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 287        will produce FASTQ files using the standard Sanger encoding. 
 288      - genbank - The GenBank or GenPept flat file format. 
 289      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 290      - ig      - The IntelliGenetics file format, apparently the same as the 
 291        MASE alignment format. 
 292      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 293        indented to allow for longer feature types. 
 294      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 295        complete protein sequence as it appears in the header (no dependencies). 
 296      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 297        it appears in the structure based on the atom coordinate section of the 
 298        file (requires NumPy for Bio.PDB). 
 299      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 300      - pir     - A "FASTA like" format introduced by the National Biomedical 
 301        Research Foundation (NBRF) for the Protein Information Resource 
 302        (PIR) database, now part of UniProt. 
 303      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 304      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 305      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 306      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 307      - tab     - Simple two column tab separated sequence files, where each 
 308        line holds a record's identifier and sequence. For example, 
 309        this is used as by Aligent's eArray software when saving 
 310        microarray probes in a minimal tab delimited text file. 
 311      - qual    - A "FASTA like" format holding PHRED quality values from 
 312        sequencing DNA, but no actual sequences (usually provided 
 313        in separate FASTA files). 
 314      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 315        text format which we call "swiss") 
 316   
 317  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 318  write to all of them. 
 319   
 320  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 321  "phylip" and "stockholm", which gives you access to the individual sequences 
 322  making up each alignment as SeqRecords. 
 323  """ 
 324   
 325  from __future__ import print_function 
 326  from Bio._py3k import basestring 
 327   
 328  # TODO 
 329  # - define policy on reading aligned sequences with gaps in 
 330  #   (e.g. - and . characters) including how the alphabet interacts 
 331  # 
 332  # - How best to handle unique/non unique record.id when writing. 
 333  #   For most file formats reading such files is fine; The stockholm 
 334  #   parser would fail. 
 335  # 
 336  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 337  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 338  # 
 339  # FAO BioPython Developers 
 340  # ------------------------ 
 341  # The way I envision this SeqIO system working as that for any sequence file 
 342  # format we have an iterator that returns SeqRecord objects. 
 343  # 
 344  # This also applies to interlaced file formats (like clustal - although that 
 345  # is now handled via Bio.AlignIO instead) where the file cannot be read record 
 346  # by record.  You should still return an iterator, even if the implementation 
 347  # could just as easily return a list. 
 348  # 
 349  # These file format specific sequence iterators may be implemented as: 
 350  #    - Classes which take a handle for __init__ and provide the __iter__ method 
 351  #    - Functions that take a handle, and return an iterator object 
 352  #    - Generator functions that take a handle, and yield SeqRecord objects 
 353  # 
 354  # It is then trivial to turn this iterator into a list of SeqRecord objects, 
 355  # an in memory dictionary, or a multiple sequence alignment object. 
 356  # 
 357  # For building the dictionary by default the id property of each SeqRecord is 
 358  # used as the key.  You should always populate the id property, and it should 
 359  # be unique in most cases. For some file formats the accession number is a good 
 360  # choice.  If the file itself contains ambiguous identifiers, don't try and 
 361  # dis-ambiguate them - return them as is. 
 362  # 
 363  # When adding a new file format, please use the same lower case format name 
 364  # as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 365  # 
 366  # See also http://biopython.org/wiki/SeqIO_dev 
 367  # 
 368  # --Peter 
 369   
 370  from Bio.File import as_handle 
 371  from Bio.SeqRecord import SeqRecord 
 372  from Bio.Align import MultipleSeqAlignment 
 373  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 374   
 375  from . import AbiIO 
 376  from . import AceIO 
 377  from . import FastaIO 
 378  from . import IgIO  # IntelliGenetics or MASE format 
 379  from . import InsdcIO  # EMBL and GenBank 
 380  from . import PdbIO 
 381  from . import PhdIO 
 382  from . import PirIO 
 383  from . import SeqXmlIO 
 384  from . import SffIO 
 385  from . import SwissIO 
 386  from . import TabIO 
 387  from . import QualityIO  # FastQ and qual files 
 388  from . import UniprotIO 
 389   
 390   
 391  # Convention for format names is "mainname-subtype" in lower case. 
 392  # Please use the same names as BioPerl or EMBOSS where possible. 
 393  # 
 394  # Note that this simple system copes with defining 
 395  # multiple possible iterators for a given format/extension 
 396  # with the -subtype suffix 
 397  # 
 398  # Most alignment file formats will be handled via Bio.AlignIO 
 399   
 400  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 401                       "gb": InsdcIO.GenBankIterator, 
 402                       "genbank": InsdcIO.GenBankIterator, 
 403                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 404                       "embl": InsdcIO.EmblIterator, 
 405                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 406                       "imgt": InsdcIO.ImgtIterator, 
 407                       "ig": IgIO.IgIterator, 
 408                       "swiss": SwissIO.SwissIterator, 
 409                       "pdb-atom": PdbIO.PdbAtomIterator, 
 410                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 411                       "phd": PhdIO.PhdIterator, 
 412                       "ace": AceIO.AceIterator, 
 413                       "tab": TabIO.TabIterator, 
 414                       "pir": PirIO.PirIterator, 
 415                       "fastq": QualityIO.FastqPhredIterator, 
 416                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 417                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 418                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 419                       "qual": QualityIO.QualPhredIterator, 
 420                       "sff": SffIO.SffIterator, 
 421                       # Not sure about this in the long run: 
 422                       "sff-trim": SffIO._SffTrimIterator, 
 423                       "uniprot-xml": UniprotIO.UniprotIterator, 
 424                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 425                       "abi": AbiIO.AbiIterator, 
 426                       "abi-trim": AbiIO._AbiTrimIterator, 
 427                       } 
 428   
 429  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 430                     "gb": InsdcIO.GenBankWriter, 
 431                     "genbank": InsdcIO.GenBankWriter, 
 432                     "embl": InsdcIO.EmblWriter, 
 433                     "imgt": InsdcIO.ImgtWriter, 
 434                     "tab": TabIO.TabWriter, 
 435                     "fastq": QualityIO.FastqPhredWriter, 
 436                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 437                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 438                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 439                     "phd": PhdIO.PhdWriter, 
 440                     "qual": QualityIO.QualPhredWriter, 
 441                     "sff": SffIO.SffWriter, 
 442                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 443                     } 
 444   
 445  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 446   
 447   
448 -def write(sequences, handle, format):
449 """Write complete set of sequences to a file. 450 451 Arguments: 452 - sequences - A list (or iterator) of SeqRecord objects, or (if using 453 Biopython 1.54 or later) a single SeqRecord. 454 - handle - File handle object to write to, or filename as string 455 (note older versions of Biopython only took a handle). 456 - format - lower case string describing the file format to write. 457 458 You should close the handle after calling this function. 459 460 Returns the number of records written (as an integer). 461 """ 462 from Bio import AlignIO 463 464 # Try and give helpful error messages: 465 if not isinstance(format, basestring): 466 raise TypeError("Need a string for the file format (lower case)") 467 if not format: 468 raise ValueError("Format required (lower case string)") 469 if format != format.lower(): 470 raise ValueError("Format string '%s' should be lower case" % format) 471 472 if isinstance(handle, SeqRecord): 473 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 474 if isinstance(handle, list): 475 # e.g. list of SeqRecord objects 476 raise TypeError("Check arguments, handle should NOT be a list") 477 478 if isinstance(sequences, SeqRecord): 479 # This raised an exception in older versions of Biopython 480 sequences = [sequences] 481 482 if format in _BinaryFormats: 483 mode = 'wb' 484 else: 485 mode = 'w' 486 487 with as_handle(handle, mode) as fp: 488 # Map the file format to a writer class 489 if format in _FormatToWriter: 490 writer_class = _FormatToWriter[format] 491 count = writer_class(fp).write_file(sequences) 492 elif format in AlignIO._FormatToWriter: 493 # Try and turn all the records into a single alignment, 494 # and write that using Bio.AlignIO 495 alignment = MultipleSeqAlignment(sequences) 496 alignment_count = AlignIO.write([alignment], fp, format) 497 assert alignment_count == 1, \ 498 "Internal error - the underlying writer " \ 499 " should have returned 1, not %r" % alignment_count 500 count = len(alignment) 501 del alignment_count, alignment 502 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 503 raise ValueError("Reading format '%s' is supported, but not writing" 504 % format) 505 else: 506 raise ValueError("Unknown format '%s'" % format) 507 508 assert isinstance(count, int), "Internal error - the underlying %s " \ 509 "writer should have returned the record count, not %r" \ 510 % (format, count) 511 512 return count
513 514
515 -def parse(handle, format, alphabet=None):
516 r"""Turns a sequence file into an iterator returning SeqRecords. 517 518 Arguments: 519 - handle - handle to the file, or the filename as a string 520 (note older versions of Biopython only took a handle). 521 - format - lower case string describing the file format. 522 - alphabet - optional Alphabet object, useful when the sequence type 523 cannot be automatically inferred from the file itself 524 (e.g. format="fasta" or "tab") 525 526 Typical usage, opening a file to read in, and looping over the record(s): 527 528 >>> from Bio import SeqIO 529 >>> filename = "Fasta/sweetpea.nu" 530 >>> for record in SeqIO.parse(filename, "fasta"): 531 ... print("ID %s" % record.id) 532 ... print("Sequence length %i" % len(record)) 533 ... print("Sequence alphabet %s" % record.seq.alphabet) 534 ID gi|3176602|gb|U78617.1|LOU78617 535 Sequence length 309 536 Sequence alphabet SingleLetterAlphabet() 537 538 For file formats like FASTA where the alphabet cannot be determined, it 539 may be useful to specify the alphabet explicitly: 540 541 >>> from Bio import SeqIO 542 >>> from Bio.Alphabet import generic_dna 543 >>> filename = "Fasta/sweetpea.nu" 544 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 545 ... print("ID %s" % record.id) 546 ... print("Sequence length %i" % len(record)) 547 ... print("Sequence alphabet %s" % record.seq.alphabet) 548 ID gi|3176602|gb|U78617.1|LOU78617 549 Sequence length 309 550 Sequence alphabet DNAAlphabet() 551 552 If you have a string 'data' containing the file contents, you must 553 first turn this into a handle in order to parse it: 554 555 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 556 >>> from Bio import SeqIO 557 >>> try: 558 ... from StringIO import StringIO # Python 2 559 ... except ImportError: 560 ... from io import StringIO # Python 3 561 ... 562 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 563 ... print("%s %s" % (record.id, record.seq)) 564 Alpha ACCGGATGTA 565 Beta AGGCTCGGTTA 566 567 Use the Bio.SeqIO.read(...) function when you expect a single record 568 only. 569 """ 570 # NOTE - The above docstring has some raw \n characters needed 571 # for the StringIO example, hence the whole docstring is in raw 572 # string mode (see the leading r before the opening quote). 573 from Bio import AlignIO 574 575 # Hack for SFF, will need to make this more general in future 576 if format in _BinaryFormats: 577 mode = 'rb' 578 else: 579 mode = 'rU' 580 581 # Try and give helpful error messages: 582 if not isinstance(format, basestring): 583 raise TypeError("Need a string for the file format (lower case)") 584 if not format: 585 raise ValueError("Format required (lower case string)") 586 if format != format.lower(): 587 raise ValueError("Format string '%s' should be lower case" % format) 588 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 589 isinstance(alphabet, AlphabetEncoder)): 590 raise ValueError("Invalid alphabet, %r" % alphabet) 591 592 with as_handle(handle, mode) as fp: 593 # Map the file format to a sequence iterator: 594 if format in _FormatToIterator: 595 iterator_generator = _FormatToIterator[format] 596 if alphabet is None: 597 i = iterator_generator(fp) 598 else: 599 try: 600 i = iterator_generator(fp, alphabet=alphabet) 601 except TypeError: 602 i = _force_alphabet(iterator_generator(fp), alphabet) 603 elif format in AlignIO._FormatToIterator: 604 # Use Bio.AlignIO to read in the alignments 605 i = (r for alignment in AlignIO.parse(fp, format, 606 alphabet=alphabet) 607 for r in alignment) 608 else: 609 raise ValueError("Unknown format '%s'" % format) 610 # This imposes some overhead... wait until we drop Python 2.4 to fix it 611 for r in i: 612 yield r
613 614
615 -def _force_alphabet(record_iterator, alphabet):
616 """Iterate over records, over-riding the alphabet (PRIVATE).""" 617 # Assume the alphabet argument has been pre-validated 618 given_base_class = _get_base_alphabet(alphabet).__class__ 619 for record in record_iterator: 620 if isinstance(_get_base_alphabet(record.seq.alphabet), 621 given_base_class): 622 record.seq.alphabet = alphabet 623 yield record 624 else: 625 raise ValueError("Specified alphabet %r clashes with " 626 "that determined from the file, %r" 627 % (alphabet, record.seq.alphabet))
628 629
630 -def read(handle, format, alphabet=None):
631 """Turns a sequence file into a single SeqRecord. 632 633 Arguments: 634 - handle - handle to the file, or the filename as a string 635 (note older versions of Biopython only took a handle). 636 - format - string describing the file format. 637 - alphabet - optional Alphabet object, useful when the sequence type 638 cannot be automatically inferred from the file itself 639 (e.g. format="fasta" or "tab") 640 641 This function is for use parsing sequence files containing 642 exactly one record. For example, reading a GenBank file: 643 644 >>> from Bio import SeqIO 645 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 646 >>> print("ID %s" % record.id) 647 ID AC007323.5 648 >>> print("Sequence length %i" % len(record)) 649 Sequence length 86436 650 >>> print("Sequence alphabet %s" % record.seq.alphabet) 651 Sequence alphabet IUPACAmbiguousDNA() 652 653 If the handle contains no records, or more than one record, 654 an exception is raised. For example: 655 656 >>> from Bio import SeqIO 657 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 658 Traceback (most recent call last): 659 ... 660 ValueError: More than one record found in handle 661 662 If however you want the first record from a file containing 663 multiple records this function would raise an exception (as 664 shown in the example above). Instead use: 665 666 >>> from Bio import SeqIO 667 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 668 >>> print("First record's ID %s" % record.id) 669 First record's ID X55053.1 670 671 Use the Bio.SeqIO.parse(handle, format) function if you want 672 to read multiple records from the handle. 673 """ 674 iterator = parse(handle, format, alphabet) 675 try: 676 first = next(iterator) 677 except StopIteration: 678 first = None 679 if first is None: 680 raise ValueError("No records found in handle") 681 try: 682 second = next(iterator) 683 except StopIteration: 684 second = None 685 if second is not None: 686 raise ValueError("More than one record found in handle") 687 return first
688 689
690 -def to_dict(sequences, key_function=None):
691 """Turns a sequence iterator or list into a dictionary. 692 693 Arguments: 694 - sequences - An iterator that returns SeqRecord objects, 695 or simply a list of SeqRecord objects. 696 - key_function - Optional callback function which when given a 697 SeqRecord should return a unique key for the dictionary. 698 699 e.g. key_function = lambda rec : rec.name 700 or, key_function = lambda rec : rec.description.split()[0] 701 702 If key_function is omitted then record.id is used, on the assumption 703 that the records objects returned are SeqRecords with a unique id. 704 705 If there are duplicate keys, an error is raised. 706 707 Example usage, defaulting to using the record.id as key: 708 709 >>> from Bio import SeqIO 710 >>> filename = "GenBank/cor6_6.gb" 711 >>> format = "genbank" 712 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 713 >>> print(sorted(id_dict)) 714 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 715 >>> print(id_dict["L31939.1"].description) 716 Brassica rapa (clone bif72) kin mRNA, complete cds 717 718 A more complex example, using the key_function argument in order to 719 use a sequence checksum as the dictionary key: 720 721 >>> from Bio import SeqIO 722 >>> from Bio.SeqUtils.CheckSum import seguid 723 >>> filename = "GenBank/cor6_6.gb" 724 >>> format = "genbank" 725 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 726 ... key_function = lambda rec : seguid(rec.seq)) 727 >>> for key, record in sorted(seguid_dict.items()): 728 ... print("%s %s" % (key, record.id)) 729 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 730 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 731 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 732 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 733 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 734 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 735 736 This approach is not suitable for very large sets of sequences, as all 737 the SeqRecord objects are held in memory. Instead, consider using the 738 Bio.SeqIO.index() function (if it supports your particular file format). 739 """ 740 if key_function is None: 741 key_function = lambda rec: rec.id 742 743 d = dict() 744 for record in sequences: 745 key = key_function(record) 746 if key in d: 747 raise ValueError("Duplicate key '%s'" % key) 748 d[key] = record 749 return d
750 751
752 -def index(filename, format, alphabet=None, key_function=None):
753 """Indexes a sequence file and returns a dictionary like object. 754 755 Arguments: 756 - filename - string giving name of file to be indexed 757 - format - lower case string describing the file format 758 - alphabet - optional Alphabet object, useful when the sequence type 759 cannot be automatically inferred from the file itself 760 (e.g. format="fasta" or "tab") 761 - key_function - Optional callback function which when given a 762 SeqRecord identifier string should return a unique key for the 763 dictionary. 764 765 This indexing function will return a dictionary like object, giving the 766 SeqRecord objects as values: 767 768 >>> from Bio import SeqIO 769 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 770 >>> len(records) 771 3 772 >>> sorted(records) 773 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 774 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 775 >EAS54_6_R1_2_1_540_792 776 TTGGCAGGCCAAGGCCGATGGATCA 777 <BLANKLINE> 778 >>> "EAS54_6_R1_2_1_540_792" in records 779 True 780 >>> print(records.get("Missing", None)) 781 None 782 >>> records.close() 783 784 If the file is BGZF compressed, this is detected automatically. Ordinary 785 GZIP files are not supported: 786 787 >>> from Bio import SeqIO 788 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 789 >>> len(records) 790 3 791 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 792 TTGGCAGGCCAAGGCCGATGGATCA 793 >>> records.close() 794 795 Note that this pseudo dictionary will not support all the methods of a 796 true Python dictionary, for example values() is not defined since this 797 would require loading all of the records into memory at once. 798 799 When you call the index function, it will scan through the file, noting 800 the location of each record. When you access a particular record via the 801 dictionary methods, the code will jump to the appropriate part of the 802 file and then parse that section into a SeqRecord. 803 804 Note that not all the input formats supported by Bio.SeqIO can be used 805 with this index function. It is designed to work only with sequential 806 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 807 interlaced file format (e.g. alignment formats such as "clustal"). 808 809 For small files, it may be more efficient to use an in memory Python 810 dictionary, e.g. 811 812 >>> from Bio import SeqIO 813 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 814 >>> len(records) 815 3 816 >>> sorted(records) 817 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 818 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 819 >EAS54_6_R1_2_1_540_792 820 TTGGCAGGCCAAGGCCGATGGATCA 821 <BLANKLINE> 822 823 As with the to_dict() function, by default the id string of each record 824 is used as the key. You can specify a callback function to transform 825 this (the record identifier string) into your preferred key. For example: 826 827 >>> from Bio import SeqIO 828 >>> def make_tuple(identifier): 829 ... parts = identifier.split("_") 830 ... return int(parts[-2]), int(parts[-1]) 831 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 832 ... key_function=make_tuple) 833 >>> len(records) 834 3 835 >>> sorted(records) 836 [(413, 324), (443, 348), (540, 792)] 837 >>> print(records[(540, 792)].format("fasta")) 838 >EAS54_6_R1_2_1_540_792 839 TTGGCAGGCCAAGGCCGATGGATCA 840 <BLANKLINE> 841 >>> (540, 792) in records 842 True 843 >>> "EAS54_6_R1_2_1_540_792" in records 844 False 845 >>> print(records.get("Missing", None)) 846 None 847 >>> records.close() 848 849 Another common use case would be indexing an NCBI style FASTA file, 850 where you might want to extract the GI number from the FASTA identifier 851 to use as the dictionary key. 852 853 Notice that unlike the to_dict() function, here the key_function does 854 not get given the full SeqRecord to use to generate the key. Doing so 855 would impose a severe performance penalty as it would require the file 856 to be completely parsed while building the index. Right now this is 857 usually avoided. 858 859 See Also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 860 861 """ 862 # Try and give helpful error messages: 863 if not isinstance(filename, basestring): 864 raise TypeError("Need a filename (not a handle)") 865 if not isinstance(format, basestring): 866 raise TypeError("Need a string for the file format (lower case)") 867 if not format: 868 raise ValueError("Format required (lower case string)") 869 if format != format.lower(): 870 raise ValueError("Format string '%s' should be lower case" % format) 871 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 872 isinstance(alphabet, AlphabetEncoder)): 873 raise ValueError("Invalid alphabet, %r" % alphabet) 874 875 # Map the file format to a sequence iterator: 876 from ._index import _FormatToRandomAccess # Lazy import 877 from Bio.File import _IndexedSeqFileDict 878 try: 879 proxy_class = _FormatToRandomAccess[format] 880 except KeyError: 881 raise ValueError("Unsupported format %r" % format) 882 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 883 % (filename, format, alphabet, key_function) 884 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 885 key_function, repr, "SeqRecord")
886 887
888 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 889 key_function=None):
890 """Index several sequence files and return a dictionary like object. 891 892 The index is stored in an SQLite database rather than in memory (as in the 893 Bio.SeqIO.index(...) function). 894 895 Arguments: 896 - index_filename - Where to store the SQLite index 897 - filenames - list of strings specifying file(s) to be indexed, or when 898 indexing a single file this can be given as a string. 899 (optional if reloading an existing index, but must match) 900 - format - lower case string describing the file format 901 (optional if reloading an existing index, but must match) 902 - alphabet - optional Alphabet object, useful when the sequence type 903 cannot be automatically inferred from the file itself 904 (e.g. format="fasta" or "tab") 905 - key_function - Optional callback function which when given a 906 SeqRecord identifier string should return a unique 907 key for the dictionary. 908 909 This indexing function will return a dictionary like object, giving the 910 SeqRecord objects as values: 911 912 >>> from Bio.Alphabet import generic_protein 913 >>> from Bio import SeqIO 914 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 915 >>> def get_gi(name): 916 ... parts = name.split("|") 917 ... i = parts.index("gi") 918 ... assert i != -1 919 ... return parts[i+1] 920 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 921 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 922 >>> len(records) 923 95 924 >>> records["7525076"].description 925 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 926 >>> records["45478717"].description 927 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 928 >>> records.close() 929 930 In this example the two files contain 85 and 10 records respectively. 931 932 BGZF compressed files are supported, and detected automatically. Ordinary 933 GZIP compressed files are not supported. 934 935 See Also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 936 glob which is useful for building lists of files. 937 938 """ 939 # Try and give helpful error messages: 940 if not isinstance(index_filename, basestring): 941 raise TypeError("Need a string for the index filename") 942 if isinstance(filenames, basestring): 943 # Make the API a little more friendly, and more similar 944 # to Bio.SeqIO.index(...) for indexing just one file. 945 filenames = [filenames] 946 if filenames is not None and not isinstance(filenames, list): 947 raise TypeError( 948 "Need a list of filenames (as strings), or one filename") 949 if format is not None and not isinstance(format, basestring): 950 raise TypeError("Need a string for the file format (lower case)") 951 if format and format != format.lower(): 952 raise ValueError("Format string '%s' should be lower case" % format) 953 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 954 isinstance(alphabet, AlphabetEncoder)): 955 raise ValueError("Invalid alphabet, %r" % alphabet) 956 957 # Map the file format to a sequence iterator: 958 from ._index import _FormatToRandomAccess # Lazy import 959 from Bio.File import _SQLiteManySeqFilesDict 960 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 961 % (index_filename, filenames, format, alphabet, key_function) 962 963 def proxy_factory(format, filename=None): 964 """Given a filename returns proxy object, else boolean if format OK.""" 965 if filename: 966 return _FormatToRandomAccess[format](filename, format, alphabet) 967 else: 968 return format in _FormatToRandomAccess
969 970 return _SQLiteManySeqFilesDict(index_filename, filenames, 971 proxy_factory, format, 972 key_function, repr) 973 974
975 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
976 """Convert between two sequence file formats, return number of records. 977 978 Arguments: 979 - in_file - an input handle or filename 980 - in_format - input file format, lower case string 981 - out_file - an output handle or filename 982 - out_format - output file format, lower case string 983 - alphabet - optional alphabet to assume 984 985 **NOTE** - If you provide an output filename, it will be opened which will 986 overwrite any existing file without warning. This may happen if even 987 the conversion is aborted (e.g. an invalid out_format name is given). 988 989 For example, going from a filename to a handle: 990 991 >>> from Bio import SeqIO 992 >>> try: 993 ... from StringIO import StringIO # Python 2 994 ... except ImportError: 995 ... from io import StringIO # Python 3 996 ... 997 >>> handle = StringIO("") 998 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 999 3 1000 >>> print(handle.getvalue()) 1001 >EAS54_6_R1_2_1_413_324 1002 CCCTTCTTGTCTTCAGCGTTTCTCC 1003 >EAS54_6_R1_2_1_540_792 1004 TTGGCAGGCCAAGGCCGATGGATCA 1005 >EAS54_6_R1_2_1_443_348 1006 GTTGCTTCTGGCGTGGGTGGGGGGG 1007 <BLANKLINE> 1008 """ 1009 # Hack for SFF, will need to make this more general in future 1010 if in_format in _BinaryFormats: 1011 in_mode = 'rb' 1012 else: 1013 in_mode = 'rU' 1014 1015 # Don't open the output file until we've checked the input is OK? 1016 if out_format in ["sff", "sff_trim"]: 1017 out_mode = 'wb' 1018 else: 1019 out_mode = 'w' 1020 1021 # This will check the arguments and issue error messages, 1022 # after we have opened the file which is a shame. 1023 from ._convert import _handle_convert # Lazy import 1024 with as_handle(in_file, in_mode) as in_handle: 1025 with as_handle(out_file, out_mode) as out_handle: 1026 count = _handle_convert(in_handle, in_format, 1027 out_handle, out_format, 1028 alphabet) 1029 return count
1030 1031 1032 # This helpful trick for testing no longer works with the 1033 # local imports :( 1034 # 1035 # if __name__ == "__main__": 1036 # from Bio._utils import run_doctest 1037 # run_doctest() 1038