Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2017 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by a whole chapter in our tutorial: 
  12   
  13    - `HTML Tutorial`_ 
  14    - `PDF Tutorial`_ 
  15   
  16  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  17  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  18  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  19   
  20  Input 
  21  ----- 
  22  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  23  (or in recent versions of Biopython alternatively a filename as a string), 
  24  and format string.  This returns an iterator giving SeqRecord objects: 
  25   
  26  >>> from Bio import SeqIO 
  27  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  28  ...     print("%s %i" % (record.id, len(record))) 
  29  gi|1348912|gb|G26680|G26680 633 
  30  gi|1348917|gb|G26685|G26685 413 
  31  gi|1592936|gb|G29385|G29385 471 
  32   
  33  Note that the parse() function will invoke the relevant parser for the 
  34  format with its default settings.  You may want more control, in which case 
  35  you need to create a format specific sequence iterator directly. 
  36   
  37  Some of these parsers are wrappers around low-level parsers which build up 
  38  SeqRecord objects for the consistent SeqIO interface. In cases where the 
  39  run-time is critical, such as large FASTA or FASTQ files, calling these 
  40  underlying parsers will be much faster - in this case these generator 
  41  functions which return tuples of strings: 
  42   
  43  >>> from Bio.SeqIO.FastaIO import SimpleFastaParser 
  44  >>> from Bio.SeqIO.QualityIO import FastqGeneralIterator 
  45   
  46   
  47  Input - Single Records 
  48  ---------------------- 
  49  If you expect your file to contain one-and-only-one record, then we provide 
  50  the following 'helper' function which will return a single SeqRecord, or 
  51  raise an exception if there are no records or more than one record: 
  52   
  53  >>> from Bio import SeqIO 
  54  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  55  >>> print("%s %i" % (record.id, len(record))) 
  56  gi|3318709|pdb|1A91| 79 
  57   
  58  This style is useful when you expect a single record only (and would 
  59  consider multiple records an error).  For example, when dealing with GenBank 
  60  files for bacterial genomes or chromosomes, there is normally only a single 
  61  record.  Alternatively, use this with a handle when downloading a single 
  62  record from the internet. 
  63   
  64  However, if you just want the first record from a file containing multiple 
  65  record, use the next() function on the iterator (or under Python 2, the 
  66  iterator's next() method): 
  67   
  68  >>> from Bio import SeqIO 
  69  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  70  >>> print("%s %i" % (record.id, len(record))) 
  71  gi|1348912|gb|G26680|G26680 633 
  72   
  73  The above code will work as long as the file contains at least one record. 
  74  Note that if there is more than one record, the remaining records will be 
  75  silently ignored. 
  76   
  77   
  78  Input - Multiple Records 
  79  ------------------------ 
  80  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  81  using a sequence iterator can save you a lot of memory (RAM).  There is 
  82  less benefit for interlaced file formats (e.g. most multiple alignment file 
  83  formats).  However, an iterator only lets you access the records one by one. 
  84   
  85  If you want random access to the records by number, turn this into a list: 
  86   
  87  >>> from Bio import SeqIO 
  88  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  89  >>> len(records) 
  90  3 
  91  >>> print(records[1].id) 
  92  gi|1348917|gb|G26685|G26685 
  93   
  94  If you want random access to the records by a key such as the record id, 
  95  turn the iterator into a dictionary: 
  96   
  97  >>> from Bio import SeqIO 
  98  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  99  >>> len(record_dict) 
 100  3 
 101  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 102  413 
 103   
 104  However, using list() or the to_dict() function will load all the records 
 105  into memory at once, and therefore is not possible on very large files. 
 106  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 107  providing dictionary like access to any record. For example, 
 108   
 109  >>> from Bio import SeqIO 
 110  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 111  >>> len(record_dict) 
 112  3 
 113  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 114  413 
 115  >>> record_dict.close() 
 116   
 117  Many but not all of the supported input file formats can be indexed like 
 118  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 119  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 120   
 121  In most cases you can also use SeqIO.index to get the record from the file 
 122  as a raw string (not a SeqRecord). This can be useful for example to extract 
 123  a sub-set of records from a file where SeqIO cannot output the file format 
 124  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 125  keep the output 100% identical to the input). For example, 
 126   
 127  >>> from Bio import SeqIO 
 128  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 129  >>> len(record_dict) 
 130  3 
 131  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 134  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 135  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 136  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 137  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 138  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 139  <BLANKLINE> 
 140  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 141  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 142  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 143  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 144  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 145  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 146  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 147  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 148  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 149  <BLANKLINE> 
 150  >>> record_dict.close() 
 151   
 152  Here the original file and what Biopython would output differ in the line 
 153  wrapping. Also note that under Python 3, the get_raw method will return a 
 154  bytes string, hence the use of decode to turn it into a (unicode) string. 
 155  This is unnecessary on Python 2. 
 156   
 157  Also note that the get_raw method will preserve the newline endings. This 
 158  example FASTQ file uses Unix style endings (b"\n" only), 
 159   
 160  >>> from Bio import SeqIO 
 161  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 162  >>> len(fastq_dict) 
 163  3 
 164  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 165  >>> raw.count(b"\n") 
 166  4 
 167  >>> raw.count(b"\r\n") 
 168  0 
 169  >>> b"\r" in raw 
 170  False 
 171  >>> len(raw) 
 172  78 
 173  >>> fastq_dict.close() 
 174   
 175  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 176   
 177  >>> from Bio import SeqIO 
 178  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 179  >>> len(fastq_dict) 
 180  3 
 181  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 182  >>> raw.count(b"\n") 
 183  4 
 184  >>> raw.count(b"\r\n") 
 185  4 
 186  >>> b"\r\n" in raw 
 187  True 
 188  >>> len(raw) 
 189  82 
 190  >>> fastq_dict.close() 
 191   
 192  Because this uses two bytes for each new line, the file is longer than 
 193  the Unix equivalent with only one byte. 
 194   
 195   
 196  Input - Alignments 
 197  ------------------ 
 198  You can read in alignment files as alignment objects using Bio.AlignIO. 
 199  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 200  you a SeqRecord for each row of each alignment: 
 201   
 202  >>> from Bio import SeqIO 
 203  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 204  ...     print("%s %i" % (record.id, len(record))) 
 205  gi|167877390|gb|EDS40773.1| 447 
 206  gi|167234445|ref|NP_001107837. 447 
 207  gi|74100009|gb|AAZ99217.1| 447 
 208  gi|13990994|dbj|BAA33523.2| 447 
 209  gi|56122354|gb|AAV74328.1| 447 
 210   
 211   
 212  Output 
 213  ------ 
 214  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 215  SeqRecord objects (either as a list, or an iterator), an output file handle 
 216  (or in recent versions of Biopython an output filename as a string) and of 
 217  course the file format:: 
 218   
 219    from Bio import SeqIO 
 220    records = ... 
 221    SeqIO.write(records, "example.faa", "fasta") 
 222   
 223  Or, using a handle:: 
 224   
 225      from Bio import SeqIO 
 226      records = ... 
 227      with open("example.faa", "w") as handle: 
 228        SeqIO.write(records, handle, "fasta") 
 229   
 230  You are expected to call this function once (with all your records) and if 
 231  using a handle, make sure you close it to flush the data to the hard disk. 
 232   
 233   
 234  Output - Advanced 
 235  ----------------- 
 236  The effect of calling write() multiple times on a single file will vary 
 237  depending on the file format, and is best avoided unless you have a strong 
 238  reason to do so. 
 239   
 240  If you give a filename, then each time you call write() the existing file 
 241  will be overwritten. For sequential files formats (e.g. fasta, genbank) each 
 242  "record block" holds a single sequence.  For these files it would probably 
 243  be safe to call write() multiple times by re-using the same handle. 
 244   
 245  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 246  stockholm) would have the effect of concatenating several multiple sequence 
 247  alignments together.  Such files are created by the PHYLIP suite of programs 
 248  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 249   
 250  Worse, many fileformats have an explicit header and/or footer structure 
 251  (e.g. any XMl format, and most binary file formats like SFF). Here making 
 252  multiple calls to write() will result in an invalid file. 
 253   
 254   
 255  Conversion 
 256  ---------- 
 257  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 258  file format conversions. Additionally, it may use file format specific 
 259  optimisations so this should be the fastest way too. 
 260   
 261  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 262  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 263  generator expressions or generator functions provides a memory efficient way 
 264  to perform filtering or other extra operations as part of the process. 
 265   
 266   
 267  File Formats 
 268  ------------ 
 269  When specifying the file format, use lowercase strings.  The same format 
 270  names are also used in Bio.AlignIO and include the following: 
 271   
 272      - abi     - Applied Biosystem's sequencing trace format 
 273      - abi-trim - Same as "abi" but with quality trimming with Mott's algorithm 
 274      - ace     - Reads the contig sequences from an ACE assembly file. 
 275      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 276      - fasta   - The generic sequence file format where each record starts with 
 277        an identifer line starting with a ">" character, followed by 
 278        lines of sequence. 
 279      - fasta-2line - Stricter interpretation of the FASTA format using exactly 
 280        two lines per record (no line wrapping). 
 281      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 282        sequence quality values (with an ASCII offset of 33). 
 283      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 284      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 285        encodes Solexa quality scores (not PHRED quality scores) with an 
 286        ASCII offset of 64. 
 287      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 288        which encodes PHRED quality scores with an ASCII offset of 64 
 289        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 290        will produce FASTQ files using the standard Sanger encoding. 
 291      - genbank - The GenBank or GenPept flat file format. 
 292      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 293      - ig      - The IntelliGenetics file format, apparently the same as the 
 294        MASE alignment format. 
 295      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 296        indented to allow for longer feature types. 
 297      - pdb-seqres -  Reads a Protein Data Bank (PDB) file to determine the 
 298        complete protein sequence as it appears in the header (no dependencies). 
 299      - pdb-atom - Uses Bio.PDB to determine the (partial) protein sequence as 
 300        it appears in the structure based on the atom coordinate section of the 
 301        file (requires NumPy for Bio.PDB). 
 302      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 303      - pir     - A "FASTA like" format introduced by the National Biomedical 
 304        Research Foundation (NBRF) for the Protein Information Resource 
 305        (PIR) database, now part of UniProt. 
 306      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 307      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 308      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 309      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 310      - tab     - Simple two column tab separated sequence files, where each 
 311        line holds a record's identifier and sequence. For example, 
 312        this is used as by Aligent's eArray software when saving 
 313        microarray probes in a minimal tab delimited text file. 
 314      - qual    - A "FASTA like" format holding PHRED quality values from 
 315        sequencing DNA, but no actual sequences (usually provided 
 316        in separate FASTA files). 
 317      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 318        text format which we call "swiss") 
 319   
 320  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 321  write to all of them. 
 322   
 323  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 324  "phylip" and "stockholm", which gives you access to the individual sequences 
 325  making up each alignment as SeqRecords. 
 326  """ 
 327   
 328  from __future__ import print_function 
 329  from Bio._py3k import basestring 
 330   
 331  # TODO 
 332  # - define policy on reading aligned sequences with gaps in 
 333  #   (e.g. - and . characters) including how the alphabet interacts 
 334  # 
 335  # - How best to handle unique/non unique record.id when writing. 
 336  #   For most file formats reading such files is fine; The stockholm 
 337  #   parser would fail. 
 338  # 
 339  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 340  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 341  # 
 342  # FAO BioPython Developers 
 343  # ------------------------ 
 344  # The way I envision this SeqIO system working as that for any sequence file 
 345  # format we have an iterator that returns SeqRecord objects. 
 346  # 
 347  # This also applies to interlaced file formats (like clustal - although that 
 348  # is now handled via Bio.AlignIO instead) where the file cannot be read record 
 349  # by record.  You should still return an iterator, even if the implementation 
 350  # could just as easily return a list. 
 351  # 
 352  # These file format specific sequence iterators may be implemented as: 
 353  #    - Classes which take a handle for __init__ and provide the __iter__ method 
 354  #    - Functions that take a handle, and return an iterator object 
 355  #    - Generator functions that take a handle, and yield SeqRecord objects 
 356  # 
 357  # It is then trivial to turn this iterator into a list of SeqRecord objects, 
 358  # an in memory dictionary, or a multiple sequence alignment object. 
 359  # 
 360  # For building the dictionary by default the id property of each SeqRecord is 
 361  # used as the key.  You should always populate the id property, and it should 
 362  # be unique in most cases. For some file formats the accession number is a good 
 363  # choice.  If the file itself contains ambiguous identifiers, don't try and 
 364  # dis-ambiguate them - return them as is. 
 365  # 
 366  # When adding a new file format, please use the same lower case format name 
 367  # as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 368  # 
 369  # See also http://biopython.org/wiki/SeqIO_dev 
 370  # 
 371  # --Peter 
 372   
 373  from Bio.File import as_handle 
 374  from Bio.SeqRecord import SeqRecord 
 375  from Bio.Align import MultipleSeqAlignment 
 376  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 377   
 378  from . import AbiIO 
 379  from . import AceIO 
 380  from . import FastaIO 
 381  from . import IgIO  # IntelliGenetics or MASE format 
 382  from . import InsdcIO  # EMBL and GenBank 
 383  from . import PdbIO 
 384  from . import PhdIO 
 385  from . import PirIO 
 386  from . import SeqXmlIO 
 387  from . import SffIO 
 388  from . import SwissIO 
 389  from . import TabIO 
 390  from . import QualityIO  # FastQ and qual files 
 391  from . import UniprotIO 
 392   
 393   
 394  # Convention for format names is "mainname-subtype" in lower case. 
 395  # Please use the same names as BioPerl or EMBOSS where possible. 
 396  # 
 397  # Note that this simple system copes with defining 
 398  # multiple possible iterators for a given format/extension 
 399  # with the -subtype suffix 
 400  # 
 401  # Most alignment file formats will be handled via Bio.AlignIO 
 402   
 403  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 404                       "fasta-2line": FastaIO.FastaTwoLineIterator, 
 405                       "gb": InsdcIO.GenBankIterator, 
 406                       "genbank": InsdcIO.GenBankIterator, 
 407                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 408                       "embl": InsdcIO.EmblIterator, 
 409                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 410                       "imgt": InsdcIO.ImgtIterator, 
 411                       "ig": IgIO.IgIterator, 
 412                       "swiss": SwissIO.SwissIterator, 
 413                       "pdb-atom": PdbIO.PdbAtomIterator, 
 414                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 415                       "phd": PhdIO.PhdIterator, 
 416                       "ace": AceIO.AceIterator, 
 417                       "tab": TabIO.TabIterator, 
 418                       "pir": PirIO.PirIterator, 
 419                       "fastq": QualityIO.FastqPhredIterator, 
 420                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 421                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 422                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 423                       "qual": QualityIO.QualPhredIterator, 
 424                       "sff": SffIO.SffIterator, 
 425                       # Not sure about this in the long run: 
 426                       "sff-trim": SffIO._SffTrimIterator, 
 427                       "uniprot-xml": UniprotIO.UniprotIterator, 
 428                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 429                       "abi": AbiIO.AbiIterator, 
 430                       "abi-trim": AbiIO._AbiTrimIterator, 
 431                       } 
 432   
 433  _FormatToString = { 
 434      "fasta": FastaIO.as_fasta, 
 435      "fasta-2line": FastaIO.as_fasta_2line, 
 436      "tab": TabIO.as_tab, 
 437      "fastq": QualityIO.as_fastq, 
 438      "fastq-sanger": QualityIO.as_fastq, 
 439      "fastq-solexa": QualityIO.as_fastq_solexa, 
 440      "fastq-illumina": QualityIO.as_fastq_illumina, 
 441      "qual": QualityIO.as_qual, 
 442  } 
 443   
 444  # This could exclude file formats covered by _FormatToString? 
 445  # Right now used in the unit tests as proxy for all supported outputs... 
 446  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 447                     "fasta-2line": FastaIO.FastaTwoLineWriter, 
 448                     "gb": InsdcIO.GenBankWriter, 
 449                     "genbank": InsdcIO.GenBankWriter, 
 450                     "embl": InsdcIO.EmblWriter, 
 451                     "imgt": InsdcIO.ImgtWriter, 
 452                     "tab": TabIO.TabWriter, 
 453                     "fastq": QualityIO.FastqPhredWriter, 
 454                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 455                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 456                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 457                     "phd": PhdIO.PhdWriter, 
 458                     "qual": QualityIO.QualPhredWriter, 
 459                     "sff": SffIO.SffWriter, 
 460                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 461                     "pir": PirIO.PirWriter, 
 462                     } 
 463   
 464  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim", "seqxml"] 
 465   
 466   
467 -def write(sequences, handle, format):
468 """Write complete set of sequences to a file. 469 470 Arguments: 471 - sequences - A list (or iterator) of SeqRecord objects, or (if using 472 Biopython 1.54 or later) a single SeqRecord. 473 - handle - File handle object to write to, or filename as string 474 (note older versions of Biopython only took a handle). 475 - format - lower case string describing the file format to write. 476 477 Note if providing a file handle, your code should close the handle 478 after calling this function (to ensure the data gets flushed to disk). 479 480 Returns the number of records written (as an integer). 481 """ 482 from Bio import AlignIO 483 484 # Try and give helpful error messages: 485 if not isinstance(format, basestring): 486 raise TypeError("Need a string for the file format (lower case)") 487 if not format: 488 raise ValueError("Format required (lower case string)") 489 if format != format.lower(): 490 raise ValueError("Format string '%s' should be lower case" % format) 491 492 if isinstance(handle, SeqRecord): 493 raise TypeError("Check arguments, handle should NOT be a SeqRecord") 494 if isinstance(handle, list): 495 # e.g. list of SeqRecord objects 496 raise TypeError("Check arguments, handle should NOT be a list") 497 498 if isinstance(sequences, SeqRecord): 499 # This raised an exception in older versions of Biopython 500 sequences = [sequences] 501 502 if format in _BinaryFormats: 503 mode = 'wb' 504 else: 505 mode = 'w' 506 507 with as_handle(handle, mode) as fp: 508 # Map the file format to a writer function/class 509 if format in _FormatToString: 510 format_function = _FormatToString[format] 511 count = 0 512 for record in sequences: 513 fp.write(format_function(record)) 514 count += 1 515 elif format in _FormatToWriter: 516 writer_class = _FormatToWriter[format] 517 count = writer_class(fp).write_file(sequences) 518 elif format in AlignIO._FormatToWriter: 519 # Try and turn all the records into a single alignment, 520 # and write that using Bio.AlignIO 521 alignment = MultipleSeqAlignment(sequences) 522 alignment_count = AlignIO.write([alignment], fp, format) 523 assert alignment_count == 1, \ 524 "Internal error - the underlying writer " \ 525 " should have returned 1, not %r" % alignment_count 526 count = len(alignment) 527 del alignment_count, alignment 528 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 529 raise ValueError("Reading format '%s' is supported, but not writing" 530 % format) 531 else: 532 raise ValueError("Unknown format '%s'" % format) 533 534 assert isinstance(count, int), "Internal error - the underlying %s " \ 535 "writer should have returned the record count, not %r" \ 536 % (format, count) 537 538 return count
539 540
541 -def parse(handle, format, alphabet=None):
542 r"""Turn a sequence file into an iterator returning SeqRecords. 543 544 Arguments: 545 - handle - handle to the file, or the filename as a string 546 (note older versions of Biopython only took a handle). 547 - format - lower case string describing the file format. 548 - alphabet - optional Alphabet object, useful when the sequence type 549 cannot be automatically inferred from the file itself 550 (e.g. format="fasta" or "tab") 551 552 Typical usage, opening a file to read in, and looping over the record(s): 553 554 >>> from Bio import SeqIO 555 >>> filename = "Fasta/sweetpea.nu" 556 >>> for record in SeqIO.parse(filename, "fasta"): 557 ... print("ID %s" % record.id) 558 ... print("Sequence length %i" % len(record)) 559 ... print("Sequence alphabet %s" % record.seq.alphabet) 560 ID gi|3176602|gb|U78617.1|LOU78617 561 Sequence length 309 562 Sequence alphabet SingleLetterAlphabet() 563 564 For file formats like FASTA where the alphabet cannot be determined, it 565 may be useful to specify the alphabet explicitly: 566 567 >>> from Bio import SeqIO 568 >>> from Bio.Alphabet import generic_dna 569 >>> filename = "Fasta/sweetpea.nu" 570 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 571 ... print("ID %s" % record.id) 572 ... print("Sequence length %i" % len(record)) 573 ... print("Sequence alphabet %s" % record.seq.alphabet) 574 ID gi|3176602|gb|U78617.1|LOU78617 575 Sequence length 309 576 Sequence alphabet DNAAlphabet() 577 578 If you have a string 'data' containing the file contents, you must 579 first turn this into a handle in order to parse it: 580 581 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 582 >>> from Bio import SeqIO 583 >>> try: 584 ... from StringIO import StringIO # Python 2 585 ... except ImportError: 586 ... from io import StringIO # Python 3 587 ... 588 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 589 ... print("%s %s" % (record.id, record.seq)) 590 Alpha ACCGGATGTA 591 Beta AGGCTCGGTTA 592 593 Use the Bio.SeqIO.read(...) function when you expect a single record 594 only. 595 """ 596 # NOTE - The above docstring has some raw \n characters needed 597 # for the StringIO example, hence the whole docstring is in raw 598 # string mode (see the leading r before the opening quote). 599 from Bio import AlignIO 600 601 # Hack for SFF, will need to make this more general in future 602 if format in _BinaryFormats: 603 mode = 'rb' 604 else: 605 mode = 'rU' 606 607 # Try and give helpful error messages: 608 if not isinstance(format, basestring): 609 raise TypeError("Need a string for the file format (lower case)") 610 if not format: 611 raise ValueError("Format required (lower case string)") 612 if format != format.lower(): 613 raise ValueError("Format string '%s' should be lower case" % format) 614 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 615 isinstance(alphabet, AlphabetEncoder)): 616 raise ValueError("Invalid alphabet, %r" % alphabet) 617 618 with as_handle(handle, mode) as fp: 619 # Map the file format to a sequence iterator: 620 if format in _FormatToIterator: 621 iterator_generator = _FormatToIterator[format] 622 if alphabet is None: 623 i = iterator_generator(fp) 624 else: 625 try: 626 i = iterator_generator(fp, alphabet=alphabet) 627 except TypeError: 628 i = _force_alphabet(iterator_generator(fp), alphabet) 629 elif format in AlignIO._FormatToIterator: 630 # Use Bio.AlignIO to read in the alignments 631 i = (r for alignment in AlignIO.parse(fp, format, 632 alphabet=alphabet) 633 for r in alignment) 634 else: 635 raise ValueError("Unknown format '%s'" % format) 636 # This imposes some overhead... wait until we drop Python 2.4 to fix it 637 for r in i: 638 yield r
639 640
641 -def _force_alphabet(record_iterator, alphabet):
642 """Iterate over records, over-riding the alphabet (PRIVATE).""" 643 # Assume the alphabet argument has been pre-validated 644 given_base_class = _get_base_alphabet(alphabet).__class__ 645 for record in record_iterator: 646 if isinstance(_get_base_alphabet(record.seq.alphabet), 647 given_base_class): 648 record.seq.alphabet = alphabet 649 yield record 650 else: 651 raise ValueError("Specified alphabet %r clashes with " 652 "that determined from the file, %r" 653 % (alphabet, record.seq.alphabet))
654 655
656 -def read(handle, format, alphabet=None):
657 """Turn a sequence file into a single SeqRecord. 658 659 Arguments: 660 - handle - handle to the file, or the filename as a string 661 (note older versions of Biopython only took a handle). 662 - format - string describing the file format. 663 - alphabet - optional Alphabet object, useful when the sequence type 664 cannot be automatically inferred from the file itself 665 (e.g. format="fasta" or "tab") 666 667 This function is for use parsing sequence files containing 668 exactly one record. For example, reading a GenBank file: 669 670 >>> from Bio import SeqIO 671 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 672 >>> print("ID %s" % record.id) 673 ID AC007323.5 674 >>> print("Sequence length %i" % len(record)) 675 Sequence length 86436 676 >>> print("Sequence alphabet %s" % record.seq.alphabet) 677 Sequence alphabet IUPACAmbiguousDNA() 678 679 If the handle contains no records, or more than one record, 680 an exception is raised. For example: 681 682 >>> from Bio import SeqIO 683 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 684 Traceback (most recent call last): 685 ... 686 ValueError: More than one record found in handle 687 688 If however you want the first record from a file containing 689 multiple records this function would raise an exception (as 690 shown in the example above). Instead use: 691 692 >>> from Bio import SeqIO 693 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 694 >>> print("First record's ID %s" % record.id) 695 First record's ID X55053.1 696 697 Use the Bio.SeqIO.parse(handle, format) function if you want 698 to read multiple records from the handle. 699 """ 700 iterator = parse(handle, format, alphabet) 701 try: 702 first = next(iterator) 703 except StopIteration: 704 first = None 705 if first is None: 706 raise ValueError("No records found in handle") 707 try: 708 second = next(iterator) 709 except StopIteration: 710 second = None 711 if second is not None: 712 raise ValueError("More than one record found in handle") 713 return first
714 715
716 -def to_dict(sequences, key_function=None):
717 """Turn a sequence iterator or list into a dictionary. 718 719 Arguments: 720 - sequences - An iterator that returns SeqRecord objects, 721 or simply a list of SeqRecord objects. 722 - key_function - Optional callback function which when given a 723 SeqRecord should return a unique key for the dictionary. 724 725 e.g. key_function = lambda rec : rec.name 726 or, key_function = lambda rec : rec.description.split()[0] 727 728 If key_function is omitted then record.id is used, on the assumption 729 that the records objects returned are SeqRecords with a unique id. 730 731 If there are duplicate keys, an error is raised. 732 733 Example usage, defaulting to using the record.id as key: 734 735 >>> from Bio import SeqIO 736 >>> filename = "GenBank/cor6_6.gb" 737 >>> format = "genbank" 738 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 739 >>> print(sorted(id_dict)) 740 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 741 >>> print(id_dict["L31939.1"].description) 742 Brassica rapa (clone bif72) kin mRNA, complete cds 743 744 A more complex example, using the key_function argument in order to 745 use a sequence checksum as the dictionary key: 746 747 >>> from Bio import SeqIO 748 >>> from Bio.SeqUtils.CheckSum import seguid 749 >>> filename = "GenBank/cor6_6.gb" 750 >>> format = "genbank" 751 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 752 ... key_function = lambda rec : seguid(rec.seq)) 753 >>> for key, record in sorted(seguid_dict.items()): 754 ... print("%s %s" % (key, record.id)) 755 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 756 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 757 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 758 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 759 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 760 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 761 762 This approach is not suitable for very large sets of sequences, as all 763 the SeqRecord objects are held in memory. Instead, consider using the 764 Bio.SeqIO.index() function (if it supports your particular file format). 765 """ 766 if key_function is None: 767 key_function = lambda rec: rec.id 768 769 d = dict() 770 for record in sequences: 771 key = key_function(record) 772 if key in d: 773 raise ValueError("Duplicate key '%s'" % key) 774 d[key] = record 775 return d
776 777
778 -def index(filename, format, alphabet=None, key_function=None):
779 """Indexes a sequence file and returns a dictionary like object. 780 781 Arguments: 782 - filename - string giving name of file to be indexed 783 - format - lower case string describing the file format 784 - alphabet - optional Alphabet object, useful when the sequence type 785 cannot be automatically inferred from the file itself 786 (e.g. format="fasta" or "tab") 787 - key_function - Optional callback function which when given a 788 SeqRecord identifier string should return a unique key for the 789 dictionary. 790 791 This indexing function will return a dictionary like object, giving the 792 SeqRecord objects as values: 793 794 >>> from Bio import SeqIO 795 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 796 >>> len(records) 797 3 798 >>> sorted(records) 799 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 800 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 801 >EAS54_6_R1_2_1_540_792 802 TTGGCAGGCCAAGGCCGATGGATCA 803 <BLANKLINE> 804 >>> "EAS54_6_R1_2_1_540_792" in records 805 True 806 >>> print(records.get("Missing", None)) 807 None 808 >>> records.close() 809 810 If the file is BGZF compressed, this is detected automatically. Ordinary 811 GZIP files are not supported: 812 813 >>> from Bio import SeqIO 814 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 815 >>> len(records) 816 3 817 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 818 TTGGCAGGCCAAGGCCGATGGATCA 819 >>> records.close() 820 821 Note that this pseudo dictionary will not support all the methods of a 822 true Python dictionary, for example values() is not defined since this 823 would require loading all of the records into memory at once. 824 825 When you call the index function, it will scan through the file, noting 826 the location of each record. When you access a particular record via the 827 dictionary methods, the code will jump to the appropriate part of the 828 file and then parse that section into a SeqRecord. 829 830 Note that not all the input formats supported by Bio.SeqIO can be used 831 with this index function. It is designed to work only with sequential 832 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 833 interlaced file format (e.g. alignment formats such as "clustal"). 834 835 For small files, it may be more efficient to use an in memory Python 836 dictionary, e.g. 837 838 >>> from Bio import SeqIO 839 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 840 >>> len(records) 841 3 842 >>> sorted(records) 843 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 844 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 845 >EAS54_6_R1_2_1_540_792 846 TTGGCAGGCCAAGGCCGATGGATCA 847 <BLANKLINE> 848 849 As with the to_dict() function, by default the id string of each record 850 is used as the key. You can specify a callback function to transform 851 this (the record identifier string) into your preferred key. For example: 852 853 >>> from Bio import SeqIO 854 >>> def make_tuple(identifier): 855 ... parts = identifier.split("_") 856 ... return int(parts[-2]), int(parts[-1]) 857 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 858 ... key_function=make_tuple) 859 >>> len(records) 860 3 861 >>> sorted(records) 862 [(413, 324), (443, 348), (540, 792)] 863 >>> print(records[(540, 792)].format("fasta")) 864 >EAS54_6_R1_2_1_540_792 865 TTGGCAGGCCAAGGCCGATGGATCA 866 <BLANKLINE> 867 >>> (540, 792) in records 868 True 869 >>> "EAS54_6_R1_2_1_540_792" in records 870 False 871 >>> print(records.get("Missing", None)) 872 None 873 >>> records.close() 874 875 Another common use case would be indexing an NCBI style FASTA file, 876 where you might want to extract the GI number from the FASTA identifier 877 to use as the dictionary key. 878 879 Notice that unlike the to_dict() function, here the key_function does 880 not get given the full SeqRecord to use to generate the key. Doing so 881 would impose a severe performance penalty as it would require the file 882 to be completely parsed while building the index. Right now this is 883 usually avoided. 884 885 See Also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 886 887 """ 888 # Try and give helpful error messages: 889 if not isinstance(filename, basestring): 890 raise TypeError("Need a filename (not a handle)") 891 if not isinstance(format, basestring): 892 raise TypeError("Need a string for the file format (lower case)") 893 if not format: 894 raise ValueError("Format required (lower case string)") 895 if format != format.lower(): 896 raise ValueError("Format string '%s' should be lower case" % format) 897 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 898 isinstance(alphabet, AlphabetEncoder)): 899 raise ValueError("Invalid alphabet, %r" % alphabet) 900 901 # Map the file format to a sequence iterator: 902 from ._index import _FormatToRandomAccess # Lazy import 903 from Bio.File import _IndexedSeqFileDict 904 try: 905 proxy_class = _FormatToRandomAccess[format] 906 except KeyError: 907 raise ValueError("Unsupported format %r" % format) 908 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 909 % (filename, format, alphabet, key_function) 910 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 911 key_function, repr, "SeqRecord")
912 913
914 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 915 key_function=None):
916 """Index several sequence files and return a dictionary like object. 917 918 The index is stored in an SQLite database rather than in memory (as in the 919 Bio.SeqIO.index(...) function). 920 921 Arguments: 922 - index_filename - Where to store the SQLite index 923 - filenames - list of strings specifying file(s) to be indexed, or when 924 indexing a single file this can be given as a string. 925 (optional if reloading an existing index, but must match) 926 - format - lower case string describing the file format 927 (optional if reloading an existing index, but must match) 928 - alphabet - optional Alphabet object, useful when the sequence type 929 cannot be automatically inferred from the file itself 930 (e.g. format="fasta" or "tab") 931 - key_function - Optional callback function which when given a 932 SeqRecord identifier string should return a unique 933 key for the dictionary. 934 935 This indexing function will return a dictionary like object, giving the 936 SeqRecord objects as values: 937 938 >>> from Bio.Alphabet import generic_protein 939 >>> from Bio import SeqIO 940 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 941 >>> def get_gi(name): 942 ... parts = name.split("|") 943 ... i = parts.index("gi") 944 ... assert i != -1 945 ... return parts[i+1] 946 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 947 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 948 >>> len(records) 949 95 950 >>> records["7525076"].description 951 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 952 >>> records["45478717"].description 953 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 954 >>> records.close() 955 956 In this example the two files contain 85 and 10 records respectively. 957 958 BGZF compressed files are supported, and detected automatically. Ordinary 959 GZIP compressed files are not supported. 960 961 See Also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 962 glob which is useful for building lists of files. 963 964 """ 965 # Try and give helpful error messages: 966 if not isinstance(index_filename, basestring): 967 raise TypeError("Need a string for the index filename") 968 if isinstance(filenames, basestring): 969 # Make the API a little more friendly, and more similar 970 # to Bio.SeqIO.index(...) for indexing just one file. 971 filenames = [filenames] 972 if filenames is not None and not isinstance(filenames, list): 973 raise TypeError( 974 "Need a list of filenames (as strings), or one filename") 975 if format is not None and not isinstance(format, basestring): 976 raise TypeError("Need a string for the file format (lower case)") 977 if format and format != format.lower(): 978 raise ValueError("Format string '%s' should be lower case" % format) 979 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 980 isinstance(alphabet, AlphabetEncoder)): 981 raise ValueError("Invalid alphabet, %r" % alphabet) 982 983 # Map the file format to a sequence iterator: 984 from ._index import _FormatToRandomAccess # Lazy import 985 from Bio.File import _SQLiteManySeqFilesDict 986 repr = ("SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" 987 % (index_filename, filenames, format, alphabet, key_function)) 988 989 def proxy_factory(format, filename=None): 990 """Given a filename returns proxy object, else boolean if format OK.""" 991 if filename: 992 return _FormatToRandomAccess[format](filename, format, alphabet) 993 else: 994 return format in _FormatToRandomAccess
995 996 return _SQLiteManySeqFilesDict(index_filename, filenames, 997 proxy_factory, format, 998 key_function, repr) 999 1000
1001 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
1002 """Convert between two sequence file formats, return number of records. 1003 1004 Arguments: 1005 - in_file - an input handle or filename 1006 - in_format - input file format, lower case string 1007 - out_file - an output handle or filename 1008 - out_format - output file format, lower case string 1009 - alphabet - optional alphabet to assume 1010 1011 **NOTE** - If you provide an output filename, it will be opened which will 1012 overwrite any existing file without warning. This may happen if even 1013 the conversion is aborted (e.g. an invalid out_format name is given). 1014 1015 For example, going from a filename to a handle: 1016 1017 >>> from Bio import SeqIO 1018 >>> try: 1019 ... from StringIO import StringIO # Python 2 1020 ... except ImportError: 1021 ... from io import StringIO # Python 3 1022 ... 1023 >>> handle = StringIO("") 1024 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 1025 3 1026 >>> print(handle.getvalue()) 1027 >EAS54_6_R1_2_1_413_324 1028 CCCTTCTTGTCTTCAGCGTTTCTCC 1029 >EAS54_6_R1_2_1_540_792 1030 TTGGCAGGCCAAGGCCGATGGATCA 1031 >EAS54_6_R1_2_1_443_348 1032 GTTGCTTCTGGCGTGGGTGGGGGGG 1033 <BLANKLINE> 1034 """ 1035 if in_format in _BinaryFormats: 1036 in_mode = 'rb' 1037 else: 1038 in_mode = 'rU' 1039 1040 if out_format in _BinaryFormats: 1041 out_mode = 'wb' 1042 else: 1043 out_mode = 'w' 1044 1045 # This will check the arguments and issue error messages, 1046 # after we have opened the file which is a shame. 1047 from ._convert import _handle_convert # Lazy import 1048 with as_handle(in_file, in_mode) as in_handle: 1049 with as_handle(out_file, out_mode) as out_handle: 1050 count = _handle_convert(in_handle, in_format, 1051 out_handle, out_format, 1052 alphabet) 1053 return count
1054 1055 1056 # This helpful trick for testing no longer works with the 1057 # local imports :( 1058 # 1059 # if __name__ == "__main__": 1060 # from Bio._utils import run_doctest 1061 # run_doctest() 1062