Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at SeqIO_ and by 
  12  a whole chapter in our tutorial: 
  13   
  14    - `HTML Tutorial`_ 
  15    - `PDF Tutorial`_ 
  16   
  17  .. _SeqIO: http://biopython.org/wiki/SeqIO 
  18  .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html 
  19  .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
  20   
  21  Input 
  22  ----- 
  23  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  24  (or in recent versions of Biopython alternatively a filename as a string), 
  25  and format string.  This returns an iterator giving SeqRecord objects: 
  26   
  27  >>> from Bio import SeqIO 
  28  >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  29  ...     print("%s %i" % (record.id, len(record))) 
  30  gi|1348912|gb|G26680|G26680 633 
  31  gi|1348917|gb|G26685|G26685 413 
  32  gi|1592936|gb|G29385|G29385 471 
  33   
  34  Note that the parse() function will invoke the relevant parser for the 
  35  format with its default settings.  You may want more control, in which case 
  36  you need to create a format specific sequence iterator directly. 
  37   
  38  Input - Single Records 
  39  ---------------------- 
  40  If you expect your file to contain one-and-only-one record, then we provide 
  41  the following 'helper' function which will return a single SeqRecord, or 
  42  raise an exception if there are no records or more than one record: 
  43   
  44  >>> from Bio import SeqIO 
  45  >>> record = SeqIO.read("Fasta/f001", "fasta") 
  46  >>> print("%s %i" % (record.id, len(record))) 
  47  gi|3318709|pdb|1A91| 79 
  48   
  49  This style is useful when you expect a single record only (and would 
  50  consider multiple records an error).  For example, when dealing with GenBank 
  51  files for bacterial genomes or chromosomes, there is normally only a single 
  52  record.  Alternatively, use this with a handle when downloading a single 
  53  record from the internet. 
  54   
  55  However, if you just want the first record from a file containing multiple 
  56  record, use the next() function on the iterator (or under Python 2, the 
  57  iterator's next() method): 
  58   
  59  >>> from Bio import SeqIO 
  60  >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  61  >>> print("%s %i" % (record.id, len(record))) 
  62  gi|1348912|gb|G26680|G26680 633 
  63   
  64  The above code will work as long as the file contains at least one record. 
  65  Note that if there is more than one record, the remaining records will be 
  66  silently ignored. 
  67   
  68   
  69  Input - Multiple Records 
  70  ------------------------ 
  71  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  72  using a sequence iterator can save you a lot of memory (RAM).  There is 
  73  less benefit for interlaced file formats (e.g. most multiple alignment file 
  74  formats).  However, an iterator only lets you access the records one by one. 
  75   
  76  If you want random access to the records by number, turn this into a list: 
  77   
  78  >>> from Bio import SeqIO 
  79  >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  80  >>> len(records) 
  81  3 
  82  >>> print(records[1].id) 
  83  gi|1348917|gb|G26685|G26685 
  84   
  85  If you want random access to the records by a key such as the record id, 
  86  turn the iterator into a dictionary: 
  87   
  88  >>> from Bio import SeqIO 
  89  >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  90  >>> len(record_dict) 
  91  3 
  92  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
  93  413 
  94   
  95  However, using list() or the to_dict() function will load all the records 
  96  into memory at once, and therefore is not possible on very large files. 
  97  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
  98  providing dictionary like access to any record. For example, 
  99   
 100  >>> from Bio import SeqIO 
 101  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 102  >>> len(record_dict) 
 103  3 
 104  >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 105  413 
 106  >>> record_dict.close() 
 107   
 108  Many but not all of the supported input file formats can be indexed like 
 109  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 110  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 111   
 112  In most cases you can also use SeqIO.index to get the record from the file 
 113  as a raw string (not a SeqRecord). This can be useful for example to extract 
 114  a sub-set of records from a file where SeqIO cannot output the file format 
 115  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 116  keep the output 100% identical to the input). For example, 
 117   
 118  >>> from Bio import SeqIO 
 119  >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 120  >>> len(record_dict) 
 121  3 
 122  >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 123  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 124  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 125  ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 126  GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 127  GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 128  TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 129  AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 130  <BLANKLINE> 
 131  >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 132  >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 133  CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 134  TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 135  TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 136  ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 137  AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 138  TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 139  CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 140  <BLANKLINE> 
 141  >>> record_dict.close() 
 142   
 143  Here the original file and what Biopython would output differ in the line 
 144  wrapping. Also note that under Python 3, the get_raw method will return a 
 145  bytes string, hence the use of decode to turn it into a (unicode) string. 
 146  This is uncessary on Python 2. 
 147   
 148  Also note that the get_raw method will preserve the newline endings. This 
 149  example FASTQ file uses Unix style endings (b"\n" only), 
 150   
 151  >>> from Bio import SeqIO 
 152  >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 153  >>> len(fastq_dict) 
 154  3 
 155  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 156  >>> raw.count(b"\n") 
 157  4 
 158  >>> raw.count(b"\r\n") 
 159  0 
 160  >>> b"\r" in raw 
 161  False 
 162  >>> len(raw) 
 163  78 
 164  >>> fastq_dict.close() 
 165   
 166  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 167   
 168  >>> from Bio import SeqIO 
 169  >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 170  >>> len(fastq_dict) 
 171  3 
 172  >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 173  >>> raw.count(b"\n") 
 174  4 
 175  >>> raw.count(b"\r\n") 
 176  4 
 177  >>> b"\r\n" in raw 
 178  True 
 179  >>> len(raw) 
 180  82 
 181  >>> fastq_dict.close() 
 182   
 183  Because this uses two bytes for each new line, the file is longer than 
 184  the Unix equivalent with only one byte. 
 185   
 186   
 187  Input - Alignments 
 188  ------------------ 
 189  You can read in alignment files as alignment objects using Bio.AlignIO. 
 190  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 191  you a SeqRecord for each row of each alignment: 
 192   
 193  >>> from Bio import SeqIO 
 194  >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 195  ...     print("%s %i" % (record.id, len(record))) 
 196  gi|167877390|gb|EDS40773.1| 447 
 197  gi|167234445|ref|NP_001107837. 447 
 198  gi|74100009|gb|AAZ99217.1| 447 
 199  gi|13990994|dbj|BAA33523.2| 447 
 200  gi|56122354|gb|AAV74328.1| 447 
 201   
 202   
 203  Output 
 204  ------ 
 205  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 206  SeqRecord objects (either as a list, or an iterator), an output file handle 
 207  (or in recent versions of Biopython an output filename as a string) and of 
 208  course the file format:: 
 209   
 210    from Bio import SeqIO 
 211    records = ... 
 212    SeqIO.write(records, "example.faa", "fasta") 
 213   
 214  Or, using a handle:: 
 215   
 216      from Bio import SeqIO 
 217      records = ... 
 218      with open("example.faa", "w") as handle: 
 219        SeqIO.write(records, handle, "fasta") 
 220   
 221  You are expected to call this function once (with all your records) and if 
 222  using a handle, make sure you close it to flush the data to the hard disk. 
 223   
 224   
 225  Output - Advanced 
 226  ----------------- 
 227  The effect of calling write() multiple times on a single file will vary 
 228  depending on the file format, and is best avoided unless you have a strong 
 229  reason to do so. 
 230   
 231  If you give a filename, then each time you call write() the existing file 
 232  will be overwriten. For sequential files formats (e.g. fasta, genbank) each 
 233  "record block" holds a single sequence.  For these files it would probably 
 234  be safe to call write() multiple times by re-using the same handle. 
 235   
 236   
 237  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 238  stockholm) would have the effect of concatenating several multiple sequence 
 239  alignments together.  Such files are created by the PHYLIP suite of programs 
 240  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 241   
 242   
 243  Conversion 
 244  ---------- 
 245  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 246  file format conversions. Additionally, it may use file format specific 
 247  optimisations so this should be the fastest way too. 
 248   
 249  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 250  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 251  generator expressions or generator functions provides a memory efficient way 
 252  to perform filtering or other extra operations as part of the process. 
 253   
 254   
 255  File Formats 
 256  ------------ 
 257  When specifying the file format, use lowercase strings.  The same format 
 258  names are also used in Bio.AlignIO and include the following: 
 259   
 260      - abif    - Applied Biosystem's sequencing trace format 
 261      - ace     - Reads the contig sequences from an ACE assembly file. 
 262      - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 263      - fasta   - The generic sequence file format where each record starts with 
 264        an identifer line starting with a ">" character, followed by 
 265        lines of sequence. 
 266      - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 267        sequence quality values (with an ASCII offset of 33). 
 268      - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 269      - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 270        encodes Solexa quality scores (not PHRED quality scores) with an 
 271        ASCII offset of 64. 
 272      - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 273        which encodes PHRED quality scores with an ASCII offset of 64 
 274        (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 275        will produce FASTQ files using the standard Sanger encoding. 
 276      - genbank - The GenBank or GenPept flat file format. 
 277      - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 278      - ig      - The IntelliGenetics file format, apparently the same as the 
 279        MASE alignment format. 
 280      - imgt    - An EMBL like format from IMGT where the feature tables are more 
 281        indented to allow for longer feature types. 
 282      - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 283      - pir     - A "FASTA like" format introduced by the National Biomedical 
 284        Research Foundation (NBRF) for the Protein Information Resource 
 285        (PIR) database, now part of UniProt. 
 286      - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 287      - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 288      - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 289      - swiss   - Plain text Swiss-Prot aka UniProt format. 
 290      - tab     - Simple two column tab separated sequence files, where each 
 291        line holds a record's identifier and sequence. For example, 
 292        this is used as by Aligent's eArray software when saving 
 293        microarray probes in a minimal tab delimited text file. 
 294      - qual    - A "FASTA like" format holding PHRED quality values from 
 295        sequencing DNA, but no actual sequences (usually provided 
 296        in separate FASTA files). 
 297      - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 298        text format which we call "swiss") 
 299   
 300  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 301  write to all of them. 
 302   
 303  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 304  "phlip" and "stockholm", which gives you access to the individual sequences 
 305  making up each alignment as SeqRecords. 
 306  """ 
 307   
 308  from __future__ import print_function 
 309  from Bio._py3k import basestring 
 310   
 311  __docformat__ = "restructuredtext en"  # not just plaintext 
 312   
 313  # TODO 
 314  # - define policy on reading aligned sequences with gaps in 
 315  #   (e.g. - and . characters) including how the alphabet interacts 
 316  # 
 317  # - How best to handle unique/non unique record.id when writing. 
 318  #   For most file formats reading such files is fine; The stockholm 
 319  #   parser would fail. 
 320  # 
 321  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 322  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 323   
 324  """ 
 325  FAO BioPython Developers 
 326  ------------------------ 
 327  The way I envision this SeqIO system working as that for any sequence file 
 328  format we have an iterator that returns SeqRecord objects. 
 329   
 330  This also applies to interlaced fileformats (like clustal - although that 
 331  is now handled via Bio.AlignIO instead) where the file cannot be read record 
 332  by record.  You should still return an iterator, even if the implementation 
 333  could just as easily return a list. 
 334   
 335  These file format specific sequence iterators may be implemented as: 
 336      - Classes which take a handle for __init__ and provide the __iter__ method 
 337      - Functions that take a handle, and return an iterator object 
 338      - Generator functions that take a handle, and yield SeqRecord objects 
 339   
 340  It is then trivial to turn this iterator into a list of SeqRecord objects, 
 341  an in memory dictionary, or a multiple sequence alignment object. 
 342   
 343  For building the dictionary by default the id propery of each SeqRecord is 
 344  used as the key.  You should always populate the id property, and it should 
 345  be unique in most cases. For some file formats the accession number is a good 
 346  choice.  If the file itself contains ambiguous identifiers, don't try and 
 347  dis-ambiguate them - return them as is. 
 348   
 349  When adding a new file format, please use the same lower case format name 
 350  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 351   
 352  See also http://biopython.org/wiki/SeqIO_dev 
 353   
 354  --Peter 
 355  """ 
 356   
 357   
 358  from Bio.File import as_handle 
 359  from Bio.SeqRecord import SeqRecord 
 360  from Bio.Align import MultipleSeqAlignment 
 361  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 362   
 363  from . import AbiIO 
 364  from . import AceIO 
 365  from . import FastaIO 
 366  from . import IgIO  # IntelliGenetics or MASE format 
 367  from . import InsdcIO  # EMBL and GenBank 
 368  from . import PdbIO 
 369  from . import PhdIO 
 370  from . import PirIO 
 371  from . import SeqXmlIO 
 372  from . import SffIO 
 373  from . import SwissIO 
 374  from . import TabIO 
 375  from . import QualityIO  # FastQ and qual files 
 376  from . import UniprotIO 
 377   
 378   
 379  # Convention for format names is "mainname-subtype" in lower case. 
 380  # Please use the same names as BioPerl or EMBOSS where possible. 
 381  # 
 382  # Note that this simple system copes with defining 
 383  # multiple possible iterators for a given format/extension 
 384  # with the -subtype suffix 
 385  # 
 386  # Most alignment file formats will be handled via Bio.AlignIO 
 387   
 388  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 389                       "gb": InsdcIO.GenBankIterator, 
 390                       "genbank": InsdcIO.GenBankIterator, 
 391                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 392                       "embl": InsdcIO.EmblIterator, 
 393                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 394                       "imgt": InsdcIO.ImgtIterator, 
 395                       "ig": IgIO.IgIterator, 
 396                       "swiss": SwissIO.SwissIterator, 
 397                       "pdb-atom": PdbIO.PdbAtomIterator, 
 398                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 399                       "phd": PhdIO.PhdIterator, 
 400                       "ace": AceIO.AceIterator, 
 401                       "tab": TabIO.TabIterator, 
 402                       "pir": PirIO.PirIterator, 
 403                       "fastq": QualityIO.FastqPhredIterator, 
 404                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 405                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 406                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 407                       "qual": QualityIO.QualPhredIterator, 
 408                       "sff": SffIO.SffIterator, 
 409                       # Not sure about this in the long run: 
 410                       "sff-trim": SffIO._SffTrimIterator, 
 411                       "uniprot-xml": UniprotIO.UniprotIterator, 
 412                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 413                       "abi": AbiIO.AbiIterator, 
 414                       "abi-trim": AbiIO._AbiTrimIterator, 
 415                       } 
 416   
 417  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 418                     "gb": InsdcIO.GenBankWriter, 
 419                     "genbank": InsdcIO.GenBankWriter, 
 420                     "embl": InsdcIO.EmblWriter, 
 421                     "imgt": InsdcIO.ImgtWriter, 
 422                     "tab": TabIO.TabWriter, 
 423                     "fastq": QualityIO.FastqPhredWriter, 
 424                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 425                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 426                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 427                     "phd": PhdIO.PhdWriter, 
 428                     "qual": QualityIO.QualPhredWriter, 
 429                     "sff": SffIO.SffWriter, 
 430                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 431                     } 
 432   
 433  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 434   
 435   
436 -def write(sequences, handle, format):
437 """Write complete set of sequences to a file. 438 439 - sequences - A list (or iterator) of SeqRecord objects, or (if using 440 Biopython 1.54 or later) a single SeqRecord. 441 - handle - File handle object to write to, or filename as string 442 (note older versions of Biopython only took a handle). 443 - format - lower case string describing the file format to write. 444 445 You should close the handle after calling this function. 446 447 Returns the number of records written (as an integer). 448 """ 449 from Bio import AlignIO 450 451 # Try and give helpful error messages: 452 if not isinstance(format, basestring): 453 raise TypeError("Need a string for the file format (lower case)") 454 if not format: 455 raise ValueError("Format required (lower case string)") 456 if format != format.lower(): 457 raise ValueError("Format string '%s' should be lower case" % format) 458 459 if isinstance(sequences, SeqRecord): 460 # This raised an exception in order version of Biopython 461 sequences = [sequences] 462 463 if format in _BinaryFormats: 464 mode = 'wb' 465 else: 466 mode = 'w' 467 468 with as_handle(handle, mode) as fp: 469 # Map the file format to a writer class 470 if format in _FormatToWriter: 471 writer_class = _FormatToWriter[format] 472 count = writer_class(fp).write_file(sequences) 473 elif format in AlignIO._FormatToWriter: 474 # Try and turn all the records into a single alignment, 475 # and write that using Bio.AlignIO 476 alignment = MultipleSeqAlignment(sequences) 477 alignment_count = AlignIO.write([alignment], fp, format) 478 assert alignment_count == 1, \ 479 "Internal error - the underlying writer " \ 480 " should have returned 1, not %s" % repr(alignment_count) 481 count = len(alignment) 482 del alignment_count, alignment 483 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 484 raise ValueError("Reading format '%s' is supported, but not writing" 485 % format) 486 else: 487 raise ValueError("Unknown format '%s'" % format) 488 489 assert isinstance(count, int), "Internal error - the underlying %s " \ 490 "writer should have returned the record count, not %s" \ 491 % (format, repr(count)) 492 493 return count
494 495
496 -def parse(handle, format, alphabet=None):
497 r"""Turns a sequence file into an iterator returning SeqRecords. 498 499 - handle - handle to the file, or the filename as a string 500 (note older versions of Biopython only took a handle). 501 - format - lower case string describing the file format. 502 - alphabet - optional Alphabet object, useful when the sequence type 503 cannot be automatically inferred from the file itself 504 (e.g. format="fasta" or "tab") 505 506 Typical usage, opening a file to read in, and looping over the record(s): 507 508 >>> from Bio import SeqIO 509 >>> filename = "Fasta/sweetpea.nu" 510 >>> for record in SeqIO.parse(filename, "fasta"): 511 ... print("ID %s" % record.id) 512 ... print("Sequence length %i" % len(record)) 513 ... print("Sequence alphabet %s" % record.seq.alphabet) 514 ID gi|3176602|gb|U78617.1|LOU78617 515 Sequence length 309 516 Sequence alphabet SingleLetterAlphabet() 517 518 For file formats like FASTA where the alphabet cannot be determined, it 519 may be useful to specify the alphabet explicitly: 520 521 >>> from Bio import SeqIO 522 >>> from Bio.Alphabet import generic_dna 523 >>> filename = "Fasta/sweetpea.nu" 524 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 525 ... print("ID %s" % record.id) 526 ... print("Sequence length %i" % len(record)) 527 ... print("Sequence alphabet %s" % record.seq.alphabet) 528 ID gi|3176602|gb|U78617.1|LOU78617 529 Sequence length 309 530 Sequence alphabet DNAAlphabet() 531 532 If you have a string 'data' containing the file contents, you must 533 first turn this into a handle in order to parse it: 534 535 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 536 >>> from Bio import SeqIO 537 >>> try: 538 ... from StringIO import StringIO # Python 2 539 ... except ImportError: 540 ... from io import StringIO # Python 3 541 ... 542 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 543 ... print("%s %s" % (record.id, record.seq)) 544 Alpha ACCGGATGTA 545 Beta AGGCTCGGTTA 546 547 Use the Bio.SeqIO.read(...) function when you expect a single record 548 only. 549 """ 550 # NOTE - The above docstring has some raw \n characters needed 551 # for the StringIO example, hense the whole docstring is in raw 552 # string mode (see the leading r before the opening quote). 553 from Bio import AlignIO 554 555 # Hack for SFF, will need to make this more general in future 556 if format in _BinaryFormats: 557 mode = 'rb' 558 else: 559 mode = 'rU' 560 561 # Try and give helpful error messages: 562 if not isinstance(format, basestring): 563 raise TypeError("Need a string for the file format (lower case)") 564 if not format: 565 raise ValueError("Format required (lower case string)") 566 if format != format.lower(): 567 raise ValueError("Format string '%s' should be lower case" % format) 568 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 569 isinstance(alphabet, AlphabetEncoder)): 570 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 571 572 with as_handle(handle, mode) as fp: 573 # Map the file format to a sequence iterator: 574 if format in _FormatToIterator: 575 iterator_generator = _FormatToIterator[format] 576 if alphabet is None: 577 i = iterator_generator(fp) 578 else: 579 try: 580 i = iterator_generator(fp, alphabet=alphabet) 581 except TypeError: 582 i = _force_alphabet(iterator_generator(fp), alphabet) 583 elif format in AlignIO._FormatToIterator: 584 # Use Bio.AlignIO to read in the alignments 585 i = (r for alignment in AlignIO.parse(fp, format, 586 alphabet=alphabet) 587 for r in alignment) 588 else: 589 raise ValueError("Unknown format '%s'" % format) 590 # This imposes some overhead... wait until we drop Python 2.4 to fix it 591 for r in i: 592 yield r
593 594
595 -def _force_alphabet(record_iterator, alphabet):
596 """Iterate over records, over-riding the alphabet (PRIVATE).""" 597 # Assume the alphabet argument has been pre-validated 598 given_base_class = _get_base_alphabet(alphabet).__class__ 599 for record in record_iterator: 600 if isinstance(_get_base_alphabet(record.seq.alphabet), 601 given_base_class): 602 record.seq.alphabet = alphabet 603 yield record 604 else: 605 raise ValueError("Specified alphabet %s clashes with " 606 "that determined from the file, %s" 607 % (repr(alphabet), repr(record.seq.alphabet)))
608 609
610 -def read(handle, format, alphabet=None):
611 """Turns a sequence file into a single SeqRecord. 612 613 - handle - handle to the file, or the filename as a string 614 (note older versions of Biopython only took a handle). 615 - format - string describing the file format. 616 - alphabet - optional Alphabet object, useful when the sequence type 617 cannot be automatically inferred from the file itself 618 (e.g. format="fasta" or "tab") 619 620 This function is for use parsing sequence files containing 621 exactly one record. For example, reading a GenBank file: 622 623 >>> from Bio import SeqIO 624 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 625 >>> print("ID %s" % record.id) 626 ID AC007323.5 627 >>> print("Sequence length %i" % len(record)) 628 Sequence length 86436 629 >>> print("Sequence alphabet %s" % record.seq.alphabet) 630 Sequence alphabet IUPACAmbiguousDNA() 631 632 If the handle contains no records, or more than one record, 633 an exception is raised. For example: 634 635 >>> from Bio import SeqIO 636 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 637 Traceback (most recent call last): 638 ... 639 ValueError: More than one record found in handle 640 641 If however you want the first record from a file containing 642 multiple records this function would raise an exception (as 643 shown in the example above). Instead use: 644 645 >>> from Bio import SeqIO 646 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 647 >>> print("First record's ID %s" % record.id) 648 First record's ID X55053.1 649 650 Use the Bio.SeqIO.parse(handle, format) function if you want 651 to read multiple records from the handle. 652 """ 653 iterator = parse(handle, format, alphabet) 654 try: 655 first = next(iterator) 656 except StopIteration: 657 first = None 658 if first is None: 659 raise ValueError("No records found in handle") 660 try: 661 second = next(iterator) 662 except StopIteration: 663 second = None 664 if second is not None: 665 raise ValueError("More than one record found in handle") 666 return first
667 668
669 -def to_dict(sequences, key_function=None):
670 """Turns a sequence iterator or list into a dictionary. 671 672 - sequences - An iterator that returns SeqRecord objects, 673 or simply a list of SeqRecord objects. 674 - key_function - Optional callback function which when given a 675 SeqRecord should return a unique key for the dictionary. 676 677 e.g. key_function = lambda rec : rec.name 678 or, key_function = lambda rec : rec.description.split()[0] 679 680 If key_function is omitted then record.id is used, on the assumption 681 that the records objects returned are SeqRecords with a unique id. 682 683 If there are duplicate keys, an error is raised. 684 685 Example usage, defaulting to using the record.id as key: 686 687 >>> from Bio import SeqIO 688 >>> filename = "GenBank/cor6_6.gb" 689 >>> format = "genbank" 690 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 691 >>> print(sorted(id_dict)) 692 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 693 >>> print(id_dict["L31939.1"].description) 694 Brassica rapa (clone bif72) kin mRNA, complete cds. 695 696 A more complex example, using the key_function argument in order to 697 use a sequence checksum as the dictionary key: 698 699 >>> from Bio import SeqIO 700 >>> from Bio.SeqUtils.CheckSum import seguid 701 >>> filename = "GenBank/cor6_6.gb" 702 >>> format = "genbank" 703 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 704 ... key_function = lambda rec : seguid(rec.seq)) 705 >>> for key, record in sorted(seguid_dict.items()): 706 ... print("%s %s" % (key, record.id)) 707 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 708 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 709 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 710 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 711 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 712 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 713 714 This approach is not suitable for very large sets of sequences, as all 715 the SeqRecord objects are held in memory. Instead, consider using the 716 Bio.SeqIO.index() function (if it supports your particular file format). 717 """ 718 if key_function is None: 719 key_function = lambda rec: rec.id 720 721 d = dict() 722 for record in sequences: 723 key = key_function(record) 724 if key in d: 725 raise ValueError("Duplicate key '%s'" % key) 726 d[key] = record 727 return d
728 729
730 -def index(filename, format, alphabet=None, key_function=None):
731 """Indexes a sequence file and returns a dictionary like object. 732 733 - filename - string giving name of file to be indexed 734 - format - lower case string describing the file format 735 - alphabet - optional Alphabet object, useful when the sequence type 736 cannot be automatically inferred from the file itself 737 (e.g. format="fasta" or "tab") 738 - key_function - Optional callback function which when given a 739 SeqRecord identifier string should return a unique 740 key for the dictionary. 741 742 This indexing function will return a dictionary like object, giving the 743 SeqRecord objects as values: 744 745 >>> from Bio import SeqIO 746 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 747 >>> len(records) 748 3 749 >>> sorted(records) 750 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 751 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 752 >EAS54_6_R1_2_1_540_792 753 TTGGCAGGCCAAGGCCGATGGATCA 754 <BLANKLINE> 755 >>> "EAS54_6_R1_2_1_540_792" in records 756 True 757 >>> print(records.get("Missing", None)) 758 None 759 >>> records.close() 760 761 If the file is BGZF compressed, this is detected automatically. Ordinary 762 GZIP files are not supported: 763 764 >>> from Bio import SeqIO 765 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 766 >>> len(records) 767 3 768 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 769 TTGGCAGGCCAAGGCCGATGGATCA 770 >>> records.close() 771 772 Note that this pseudo dictionary will not support all the methods of a 773 true Python dictionary, for example values() is not defined since this 774 would require loading all of the records into memory at once. 775 776 When you call the index function, it will scan through the file, noting 777 the location of each record. When you access a particular record via the 778 dictionary methods, the code will jump to the appropriate part of the 779 file and then parse that section into a SeqRecord. 780 781 Note that not all the input formats supported by Bio.SeqIO can be used 782 with this index function. It is designed to work only with sequential 783 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 784 interlaced file format (e.g. alignment formats such as "clustal"). 785 786 For small files, it may be more efficient to use an in memory Python 787 dictionary, e.g. 788 789 >>> from Bio import SeqIO 790 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 791 >>> len(records) 792 3 793 >>> sorted(records) 794 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 795 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 796 >EAS54_6_R1_2_1_540_792 797 TTGGCAGGCCAAGGCCGATGGATCA 798 <BLANKLINE> 799 800 As with the to_dict() function, by default the id string of each record 801 is used as the key. You can specify a callback function to transform 802 this (the record identifier string) into your preferred key. For example: 803 804 >>> from Bio import SeqIO 805 >>> def make_tuple(identifier): 806 ... parts = identifier.split("_") 807 ... return int(parts[-2]), int(parts[-1]) 808 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 809 ... key_function=make_tuple) 810 >>> len(records) 811 3 812 >>> sorted(records) 813 [(413, 324), (443, 348), (540, 792)] 814 >>> print(records[(540, 792)].format("fasta")) 815 >EAS54_6_R1_2_1_540_792 816 TTGGCAGGCCAAGGCCGATGGATCA 817 <BLANKLINE> 818 >>> (540, 792) in records 819 True 820 >>> "EAS54_6_R1_2_1_540_792" in records 821 False 822 >>> print(records.get("Missing", None)) 823 None 824 >>> records.close() 825 826 Another common use case would be indexing an NCBI style FASTA file, 827 where you might want to extract the GI number from the FASTA identifer 828 to use as the dictionary key. 829 830 Notice that unlike the to_dict() function, here the key_function does 831 not get given the full SeqRecord to use to generate the key. Doing so 832 would impose a severe performance penalty as it would require the file 833 to be completely parsed while building the index. Right now this is 834 usually avoided. 835 836 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 837 """ 838 # Try and give helpful error messages: 839 if not isinstance(filename, basestring): 840 raise TypeError("Need a filename (not a handle)") 841 if not isinstance(format, basestring): 842 raise TypeError("Need a string for the file format (lower case)") 843 if not format: 844 raise ValueError("Format required (lower case string)") 845 if format != format.lower(): 846 raise ValueError("Format string '%s' should be lower case" % format) 847 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 848 isinstance(alphabet, AlphabetEncoder)): 849 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 850 851 # Map the file format to a sequence iterator: 852 from ._index import _FormatToRandomAccess # Lazy import 853 from Bio.File import _IndexedSeqFileDict 854 try: 855 proxy_class = _FormatToRandomAccess[format] 856 except KeyError: 857 raise ValueError("Unsupported format %r" % format) 858 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 859 % (filename, format, alphabet, key_function) 860 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 861 key_function, repr, "SeqRecord")
862 863
864 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 865 key_function=None):
866 """Index several sequence files and return a dictionary like object. 867 868 The index is stored in an SQLite database rather than in memory (as in the 869 Bio.SeqIO.index(...) function). 870 871 - index_filename - Where to store the SQLite index 872 - filenames - list of strings specifying file(s) to be indexed, or when 873 indexing a single file this can be given as a string. 874 (optional if reloading an existing index, but must match) 875 - format - lower case string describing the file format 876 (optional if reloading an existing index, but must match) 877 - alphabet - optional Alphabet object, useful when the sequence type 878 cannot be automatically inferred from the file itself 879 (e.g. format="fasta" or "tab") 880 - key_function - Optional callback function which when given a 881 SeqRecord identifier string should return a unique 882 key for the dictionary. 883 884 This indexing function will return a dictionary like object, giving the 885 SeqRecord objects as values: 886 887 >>> from Bio.Alphabet import generic_protein 888 >>> from Bio import SeqIO 889 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 890 >>> def get_gi(name): 891 ... parts = name.split("|") 892 ... i = parts.index("gi") 893 ... assert i != -1 894 ... return parts[i+1] 895 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 896 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 897 >>> len(records) 898 95 899 >>> records["7525076"].description 900 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 901 >>> records["45478717"].description 902 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 903 >>> records.close() 904 905 In this example the two files contain 85 and 10 records respectively. 906 907 BGZF compressed files are supported, and detected automatically. Ordinary 908 GZIP compressed files are not supported. 909 910 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 911 glob which is useful for building lists of files. 912 """ 913 # Try and give helpful error messages: 914 if not isinstance(index_filename, basestring): 915 raise TypeError("Need a string for the index filename") 916 if isinstance(filenames, basestring): 917 # Make the API a little more friendly, and more similar 918 # to Bio.SeqIO.index(...) for indexing just one file. 919 filenames = [filenames] 920 if filenames is not None and not isinstance(filenames, list): 921 raise TypeError( 922 "Need a list of filenames (as strings), or one filename") 923 if format is not None and not isinstance(format, basestring): 924 raise TypeError("Need a string for the file format (lower case)") 925 if format and format != format.lower(): 926 raise ValueError("Format string '%s' should be lower case" % format) 927 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 928 isinstance(alphabet, AlphabetEncoder)): 929 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 930 931 # Map the file format to a sequence iterator: 932 from ._index import _FormatToRandomAccess # Lazy import 933 from Bio.File import _SQLiteManySeqFilesDict 934 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 935 % (index_filename, filenames, format, alphabet, key_function) 936 937 def proxy_factory(format, filename=None): 938 """Given a filename returns proxy object, else boolean if format OK.""" 939 if filename: 940 return _FormatToRandomAccess[format](filename, format, alphabet) 941 else: 942 return format in _FormatToRandomAccess
943 944 return _SQLiteManySeqFilesDict(index_filename, filenames, 945 proxy_factory, format, 946 key_function, repr) 947 948
949 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
950 """Convert between two sequence file formats, return number of records. 951 952 - in_file - an input handle or filename 953 - in_format - input file format, lower case string 954 - out_file - an output handle or filename 955 - out_format - output file format, lower case string 956 - alphabet - optional alphabet to assume 957 958 **NOTE** - If you provide an output filename, it will be opened which will 959 overwrite any existing file without warning. This may happen if even 960 the conversion is aborted (e.g. an invalid out_format name is given). 961 962 For example, going from a filename to a handle: 963 964 >>> from Bio import SeqIO 965 >>> try: 966 ... from StringIO import StringIO # Python 2 967 ... except ImportError: 968 ... from io import StringIO # Python 3 969 ... 970 >>> handle = StringIO("") 971 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 972 3 973 >>> print(handle.getvalue()) 974 >EAS54_6_R1_2_1_413_324 975 CCCTTCTTGTCTTCAGCGTTTCTCC 976 >EAS54_6_R1_2_1_540_792 977 TTGGCAGGCCAAGGCCGATGGATCA 978 >EAS54_6_R1_2_1_443_348 979 GTTGCTTCTGGCGTGGGTGGGGGGG 980 <BLANKLINE> 981 """ 982 # Hack for SFF, will need to make this more general in future 983 if in_format in _BinaryFormats: 984 in_mode = 'rb' 985 else: 986 in_mode = 'rU' 987 988 # Don't open the output file until we've checked the input is OK? 989 if out_format in ["sff", "sff_trim"]: 990 out_mode = 'wb' 991 else: 992 out_mode = 'w' 993 994 # This will check the arguments and issue error messages, 995 # after we have opened the file which is a shame. 996 from ._convert import _handle_convert # Lazy import 997 with as_handle(in_file, in_mode) as in_handle: 998 with as_handle(out_file, out_mode) as out_handle: 999 count = _handle_convert(in_handle, in_format, 1000 out_handle, out_format, 1001 alphabet) 1002 return count
1003 1004 1005 # This helpful trick for testing no longer works with the 1006 # local imports :( 1007 # 1008 # if __name__ == "__main__": 1009 # from Bio._utils import run_doctest 1010 # run_doctest() 1011