Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

   1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  #Nice link: 
   7  # http://www.ebi.ac.uk/help/formats_frame.html 
   8   
   9  r"""Sequence input/output as SeqRecord objects. 
  10   
  11  Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by 
  12  a whole chapter in our tutorial: 
  13   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
  14   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf} 
  15   
  16  Input 
  17  ===== 
  18  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
  19  (or in recent versions of Biopython alternatively a filename as a string), 
  20  and format string.  This returns an iterator giving SeqRecord objects: 
  21   
  22      >>> from Bio import SeqIO 
  23      >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
  24      ...     print("%s %i" % (record.id, len(record))) 
  25      gi|1348912|gb|G26680|G26680 633 
  26      gi|1348917|gb|G26685|G26685 413 
  27      gi|1592936|gb|G29385|G29385 471 
  28   
  29  Note that the parse() function will invoke the relevant parser for the 
  30  format with its default settings.  You may want more control, in which case 
  31  you need to create a format specific sequence iterator directly. 
  32   
  33  Input - Single Records 
  34  ====================== 
  35  If you expect your file to contain one-and-only-one record, then we provide 
  36  the following 'helper' function which will return a single SeqRecord, or 
  37  raise an exception if there are no records or more than one record: 
  38   
  39      >>> from Bio import SeqIO 
  40      >>> record = SeqIO.read("Fasta/f001", "fasta") 
  41      >>> print("%s %i" % (record.id, len(record))) 
  42      gi|3318709|pdb|1A91| 79 
  43   
  44  This style is useful when you expect a single record only (and would 
  45  consider multiple records an error).  For example, when dealing with GenBank 
  46  files for bacterial genomes or chromosomes, there is normally only a single 
  47  record.  Alternatively, use this with a handle when downloading a single 
  48  record from the internet. 
  49   
  50  However, if you just want the first record from a file containing multiple 
  51  record, use the next() function on the iterator (or under Python 2, the 
  52  iterator's next() method): 
  53   
  54      >>> from Bio import SeqIO 
  55      >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
  56      >>> print("%s %i" % (record.id, len(record))) 
  57      gi|1348912|gb|G26680|G26680 633 
  58   
  59  The above code will work as long as the file contains at least one record. 
  60  Note that if there is more than one record, the remaining records will be 
  61  silently ignored. 
  62   
  63   
  64  Input - Multiple Records 
  65  ======================== 
  66  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
  67  using a sequence iterator can save you a lot of memory (RAM).  There is 
  68  less benefit for interlaced file formats (e.g. most multiple alignment file 
  69  formats).  However, an iterator only lets you access the records one by one. 
  70   
  71  If you want random access to the records by number, turn this into a list: 
  72   
  73      >>> from Bio import SeqIO 
  74      >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
  75      >>> len(records) 
  76      3 
  77      >>> print(records[1].id) 
  78      gi|1348917|gb|G26685|G26685 
  79   
  80  If you want random access to the records by a key such as the record id, 
  81  turn the iterator into a dictionary: 
  82   
  83      >>> from Bio import SeqIO 
  84      >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
  85      >>> len(record_dict) 
  86      3 
  87      >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
  88      413 
  89   
  90  However, using list() or the to_dict() function will load all the records 
  91  into memory at once, and therefore is not possible on very large files. 
  92  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
  93  providing dictionary like access to any record. For example, 
  94   
  95      >>> from Bio import SeqIO 
  96      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
  97      >>> len(record_dict) 
  98      3 
  99      >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 100      413 
 101      >>> record_dict.close() 
 102   
 103  Many but not all of the supported input file formats can be indexed like 
 104  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
 105  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
 106   
 107  In most cases you can also use SeqIO.index to get the record from the file 
 108  as a raw string (not a SeqRecord). This can be useful for example to extract 
 109  a sub-set of records from a file where SeqIO cannot output the file format 
 110  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
 111  keep the output 100% identical to the input). For example, 
 112   
 113      >>> from Bio import SeqIO 
 114      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 115      >>> len(record_dict) 
 116      3 
 117      >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
 118      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 119      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
 120      ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
 121      GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
 122      GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
 123      TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
 124      AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 125      <BLANKLINE> 
 126      >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
 127      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
 128      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
 129      TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
 130      TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
 131      ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
 132      AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
 133      TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
 134      CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
 135      <BLANKLINE> 
 136      >>> record_dict.close() 
 137   
 138  Here the original file and what Biopython would output differ in the line 
 139  wrapping. Also note that under Python 3, the get_raw method will return a 
 140  bytes string, hence the use of decode to turn it into a (unicode) string. 
 141  This is uncessary on Python 2. 
 142   
 143  Also note that the get_raw method will preserve the newline endings. This 
 144  example FASTQ file uses Unix style endings (b"\n" only), 
 145   
 146      >>> from Bio import SeqIO 
 147      >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
 148      >>> len(fastq_dict) 
 149      3 
 150      >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 151      >>> raw.count(b"\n") 
 152      4 
 153      >>> raw.count(b"\r\n") 
 154      0 
 155      >>> b"\r" in raw 
 156      False 
 157      >>> len(raw) 
 158      78 
 159      >>> fastq_dict.close() 
 160   
 161  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
 162   
 163      >>> from Bio import SeqIO 
 164      >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
 165      >>> len(fastq_dict) 
 166      3 
 167      >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
 168      >>> raw.count(b"\n") 
 169      4 
 170      >>> raw.count(b"\r\n") 
 171      4 
 172      >>> b"\r\n" in raw 
 173      True 
 174      >>> len(raw) 
 175      82 
 176      >>> fastq_dict.close() 
 177   
 178  Because this uses two bytes for each new line, the file is longer than 
 179  the Unix equivalent with only one byte. 
 180   
 181   
 182  Input - Alignments 
 183  ================== 
 184  You can read in alignment files as alignment objects using Bio.AlignIO. 
 185  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
 186  you a SeqRecord for each row of each alignment: 
 187   
 188      >>> from Bio import SeqIO 
 189      >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
 190      ...     print("%s %i" % (record.id, len(record))) 
 191      gi|167877390|gb|EDS40773.1| 447 
 192      gi|167234445|ref|NP_001107837. 447 
 193      gi|74100009|gb|AAZ99217.1| 447 
 194      gi|13990994|dbj|BAA33523.2| 447 
 195      gi|56122354|gb|AAV74328.1| 447 
 196   
 197   
 198  Output 
 199  ====== 
 200  Use the function Bio.SeqIO.write(...), which takes a complete set of 
 201  SeqRecord objects (either as a list, or an iterator), an output file handle 
 202  (or in recent versions of Biopython an output filename as a string) and of 
 203  course the file format:: 
 204   
 205      from Bio import SeqIO 
 206      records = ... 
 207      SeqIO.write(records, "example.faa", "fasta") 
 208   
 209  Or, using a handle:: 
 210   
 211      from Bio import SeqIO 
 212      records = ... 
 213      with open("example.faa", "w") as handle: 
 214          SeqIO.write(records, handle, "fasta") 
 215   
 216  You are expected to call this function once (with all your records) and if 
 217  using a handle, make sure you close it to flush the data to the hard disk. 
 218   
 219   
 220  Output - Advanced 
 221  ================= 
 222  The effect of calling write() multiple times on a single file will vary 
 223  depending on the file format, and is best avoided unless you have a strong 
 224  reason to do so. 
 225   
 226  If you give a filename, then each time you call write() the existing file 
 227  will be overwriten. For sequential files formats (e.g. fasta, genbank) each 
 228  "record block" holds a single sequence.  For these files it would probably 
 229  be safe to call write() multiple times by re-using the same handle. 
 230   
 231   
 232  However, trying this for certain alignment formats (e.g. phylip, clustal, 
 233  stockholm) would have the effect of concatenating several multiple sequence 
 234  alignments together.  Such files are created by the PHYLIP suite of programs 
 235  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
 236   
 237   
 238  Conversion 
 239  ========== 
 240  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
 241  file format conversions. Additionally, it may use file format specific 
 242  optimisations so this should be the fastest way too. 
 243   
 244  In general however, you can combine the Bio.SeqIO.parse(...) function with 
 245  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
 246  generator expressions or generator functions provides a memory efficient way 
 247  to perform filtering or other extra operations as part of the process. 
 248   
 249   
 250  File Formats 
 251  ============ 
 252  When specifying the file format, use lowercase strings.  The same format 
 253  names are also used in Bio.AlignIO and include the following: 
 254   
 255   - abif    - Applied Biosystem's sequencing trace format 
 256   - ace     - Reads the contig sequences from an ACE assembly file. 
 257   - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
 258   - fasta   - The generic sequence file format where each record starts with 
 259               an identifer line starting with a ">" character, followed by 
 260               lines of sequence. 
 261   - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
 262               sequence quality values (with an ASCII offset of 33). 
 263   - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
 264   - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
 265               encodes Solexa quality scores (not PHRED quality scores) with an 
 266               ASCII offset of 64. 
 267   - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
 268               which encodes PHRED quality scores with an ASCII offset of 64 
 269               (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
 270               will produce FASTQ files using the standard Sanger encoding. 
 271   - genbank - The GenBank or GenPept flat file format. 
 272   - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
 273   - ig      - The IntelliGenetics file format, apparently the same as the 
 274               MASE alignment format. 
 275   - imgt    - An EMBL like format from IMGT where the feature tables are more 
 276               indented to allow for longer feature types. 
 277   - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
 278   - pir     - A "FASTA like" format introduced by the National Biomedical 
 279               Research Foundation (NBRF) for the Protein Information Resource 
 280               (PIR) database, now part of UniProt. 
 281   - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
 282   - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
 283   - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
 284   - swiss   - Plain text Swiss-Prot aka UniProt format. 
 285   - tab     - Simple two column tab separated sequence files, where each 
 286               line holds a record's identifier and sequence. For example, 
 287               this is used as by Aligent's eArray software when saving 
 288               microarray probes in a minimal tab delimited text file. 
 289   - qual    - A "FASTA like" format holding PHRED quality values from 
 290               sequencing DNA, but no actual sequences (usually provided 
 291               in separate FASTA files). 
 292   - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
 293               text format which we call "swiss") 
 294   
 295  Note that while Bio.SeqIO can read all the above file formats, it cannot 
 296  write to all of them. 
 297   
 298  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
 299  "phlip" and "stockholm", which gives you access to the individual sequences 
 300  making up each alignment as SeqRecords. 
 301  """ 
 302   
 303  from __future__ import print_function 
 304  from Bio._py3k import basestring 
 305   
 306  __docformat__ = "epytext en"  # not just plaintext 
 307   
 308  #TODO 
 309  # - define policy on reading aligned sequences with gaps in 
 310  #   (e.g. - and . characters) including how the alphabet interacts 
 311  # 
 312  # - How best to handle unique/non unique record.id when writing. 
 313  #   For most file formats reading such files is fine; The stockholm 
 314  #   parser would fail. 
 315  # 
 316  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
 317  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
 318   
 319  """ 
 320  FAO BioPython Developers 
 321  ======================== 
 322  The way I envision this SeqIO system working as that for any sequence file 
 323  format we have an iterator that returns SeqRecord objects. 
 324   
 325  This also applies to interlaced fileformats (like clustal - although that 
 326  is now handled via Bio.AlignIO instead) where the file cannot be read record 
 327  by record.  You should still return an iterator, even if the implementation 
 328  could just as easily return a list. 
 329   
 330  These file format specific sequence iterators may be implemented as: 
 331  * Classes which take a handle for __init__ and provide the __iter__ method 
 332  * Functions that take a handle, and return an iterator object 
 333  * Generator functions that take a handle, and yield SeqRecord objects 
 334   
 335  It is then trivial to turn this iterator into a list of SeqRecord objects, 
 336  an in memory dictionary, or a multiple sequence alignment object. 
 337   
 338  For building the dictionary by default the id propery of each SeqRecord is 
 339  used as the key.  You should always populate the id property, and it should 
 340  be unique in most cases. For some file formats the accession number is a good 
 341  choice.  If the file itself contains ambiguous identifiers, don't try and 
 342  dis-ambiguate them - return them as is. 
 343   
 344  When adding a new file format, please use the same lower case format name 
 345  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
 346   
 347  See also http://biopython.org/wiki/SeqIO_dev 
 348   
 349  --Peter 
 350  """ 
 351   
 352   
 353  from Bio.File import as_handle 
 354  from Bio.SeqRecord import SeqRecord 
 355  from Bio.Align import MultipleSeqAlignment 
 356  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
 357   
 358  from . import AbiIO 
 359  from . import AceIO 
 360  from . import FastaIO 
 361  from . import IgIO  # IntelliGenetics or MASE format 
 362  from . import InsdcIO  # EMBL and GenBank 
 363  from . import PdbIO 
 364  from . import PhdIO 
 365  from . import PirIO 
 366  from . import SeqXmlIO 
 367  from . import SffIO 
 368  from . import SwissIO 
 369  from . import TabIO 
 370  from . import QualityIO  # FastQ and qual files 
 371  from . import UniprotIO 
 372   
 373   
 374  #Convention for format names is "mainname-subtype" in lower case. 
 375  #Please use the same names as BioPerl or EMBOSS where possible. 
 376  # 
 377  #Note that this simple system copes with defining 
 378  #multiple possible iterators for a given format/extension 
 379  #with the -subtype suffix 
 380  # 
 381  #Most alignment file formats will be handled via Bio.AlignIO 
 382   
 383  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
 384                       "gb": InsdcIO.GenBankIterator, 
 385                       "genbank": InsdcIO.GenBankIterator, 
 386                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
 387                       "embl": InsdcIO.EmblIterator, 
 388                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
 389                       "imgt": InsdcIO.ImgtIterator, 
 390                       "ig": IgIO.IgIterator, 
 391                       "swiss": SwissIO.SwissIterator, 
 392                       "pdb-atom": PdbIO.PdbAtomIterator, 
 393                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
 394                       "phd": PhdIO.PhdIterator, 
 395                       "ace": AceIO.AceIterator, 
 396                       "tab": TabIO.TabIterator, 
 397                       "pir": PirIO.PirIterator, 
 398                       "fastq": QualityIO.FastqPhredIterator, 
 399                       "fastq-sanger": QualityIO.FastqPhredIterator, 
 400                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
 401                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
 402                       "qual": QualityIO.QualPhredIterator, 
 403                       "sff": SffIO.SffIterator, 
 404                       #Not sure about this in the long run: 
 405                       "sff-trim": SffIO._SffTrimIterator, 
 406                       "uniprot-xml": UniprotIO.UniprotIterator, 
 407                       "seqxml": SeqXmlIO.SeqXmlIterator, 
 408                       "abi": AbiIO.AbiIterator, 
 409                       "abi-trim": AbiIO._AbiTrimIterator, 
 410                       } 
 411   
 412  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
 413                     "gb": InsdcIO.GenBankWriter, 
 414                     "genbank": InsdcIO.GenBankWriter, 
 415                     "embl": InsdcIO.EmblWriter, 
 416                     "imgt": InsdcIO.ImgtWriter, 
 417                     "tab": TabIO.TabWriter, 
 418                     "fastq": QualityIO.FastqPhredWriter, 
 419                     "fastq-sanger": QualityIO.FastqPhredWriter, 
 420                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
 421                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
 422                     "phd": PhdIO.PhdWriter, 
 423                     "qual": QualityIO.QualPhredWriter, 
 424                     "sff": SffIO.SffWriter, 
 425                     "seqxml": SeqXmlIO.SeqXmlWriter, 
 426                     } 
 427   
 428  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
 429   
 430   
431 -def write(sequences, handle, format):
432 """Write complete set of sequences to a file. 433 434 - sequences - A list (or iterator) of SeqRecord objects, or (if using 435 Biopython 1.54 or later) a single SeqRecord. 436 - handle - File handle object to write to, or filename as string 437 (note older versions of Biopython only took a handle). 438 - format - lower case string describing the file format to write. 439 440 You should close the handle after calling this function. 441 442 Returns the number of records written (as an integer). 443 """ 444 from Bio import AlignIO 445 446 #Try and give helpful error messages: 447 if not isinstance(format, basestring): 448 raise TypeError("Need a string for the file format (lower case)") 449 if not format: 450 raise ValueError("Format required (lower case string)") 451 if format != format.lower(): 452 raise ValueError("Format string '%s' should be lower case" % format) 453 454 if isinstance(sequences, SeqRecord): 455 #This raised an exception in order version of Biopython 456 sequences = [sequences] 457 458 if format in _BinaryFormats: 459 mode = 'wb' 460 else: 461 mode = 'w' 462 463 with as_handle(handle, mode) as fp: 464 #Map the file format to a writer class 465 if format in _FormatToWriter: 466 writer_class = _FormatToWriter[format] 467 count = writer_class(fp).write_file(sequences) 468 elif format in AlignIO._FormatToWriter: 469 #Try and turn all the records into a single alignment, 470 #and write that using Bio.AlignIO 471 alignment = MultipleSeqAlignment(sequences) 472 alignment_count = AlignIO.write([alignment], fp, format) 473 assert alignment_count == 1, \ 474 "Internal error - the underlying writer " \ 475 " should have returned 1, not %s" % repr(alignment_count) 476 count = len(alignment) 477 del alignment_count, alignment 478 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 479 raise ValueError("Reading format '%s' is supported, but not writing" 480 % format) 481 else: 482 raise ValueError("Unknown format '%s'" % format) 483 484 assert isinstance(count, int), "Internal error - the underlying %s " \ 485 "writer should have returned the record count, not %s" \ 486 % (format, repr(count)) 487 488 return count
489 490
491 -def parse(handle, format, alphabet=None):
492 r"""Turns a sequence file into an iterator returning SeqRecords. 493 494 - handle - handle to the file, or the filename as a string 495 (note older versions of Biopython only took a handle). 496 - format - lower case string describing the file format. 497 - alphabet - optional Alphabet object, useful when the sequence type 498 cannot be automatically inferred from the file itself 499 (e.g. format="fasta" or "tab") 500 501 Typical usage, opening a file to read in, and looping over the record(s): 502 503 >>> from Bio import SeqIO 504 >>> filename = "Fasta/sweetpea.nu" 505 >>> for record in SeqIO.parse(filename, "fasta"): 506 ... print("ID %s" % record.id) 507 ... print("Sequence length %i" % len(record)) 508 ... print("Sequence alphabet %s" % record.seq.alphabet) 509 ID gi|3176602|gb|U78617.1|LOU78617 510 Sequence length 309 511 Sequence alphabet SingleLetterAlphabet() 512 513 For file formats like FASTA where the alphabet cannot be determined, it 514 may be useful to specify the alphabet explicitly: 515 516 >>> from Bio import SeqIO 517 >>> from Bio.Alphabet import generic_dna 518 >>> filename = "Fasta/sweetpea.nu" 519 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 520 ... print("ID %s" % record.id) 521 ... print("Sequence length %i" % len(record)) 522 ... print("Sequence alphabet %s" % record.seq.alphabet) 523 ID gi|3176602|gb|U78617.1|LOU78617 524 Sequence length 309 525 Sequence alphabet DNAAlphabet() 526 527 If you have a string 'data' containing the file contents, you must 528 first turn this into a handle in order to parse it: 529 530 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 531 >>> from Bio import SeqIO 532 >>> try: 533 ... from StringIO import StringIO # Python 2 534 ... except ImportError: 535 ... from io import StringIO # Python 3 536 ... 537 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 538 ... print("%s %s" % (record.id, record.seq)) 539 Alpha ACCGGATGTA 540 Beta AGGCTCGGTTA 541 542 Use the Bio.SeqIO.read(...) function when you expect a single record 543 only. 544 """ 545 #NOTE - The above docstring has some raw \n characters needed 546 #for the StringIO example, hense the whole docstring is in raw 547 #string mode (see the leading r before the opening quote). 548 from Bio import AlignIO 549 550 #Hack for SFF, will need to make this more general in future 551 if format in _BinaryFormats: 552 mode = 'rb' 553 else: 554 mode = 'rU' 555 556 #Try and give helpful error messages: 557 if not isinstance(format, basestring): 558 raise TypeError("Need a string for the file format (lower case)") 559 if not format: 560 raise ValueError("Format required (lower case string)") 561 if format != format.lower(): 562 raise ValueError("Format string '%s' should be lower case" % format) 563 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 564 isinstance(alphabet, AlphabetEncoder)): 565 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 566 567 with as_handle(handle, mode) as fp: 568 #Map the file format to a sequence iterator: 569 if format in _FormatToIterator: 570 iterator_generator = _FormatToIterator[format] 571 if alphabet is None: 572 i = iterator_generator(fp) 573 else: 574 try: 575 i = iterator_generator(fp, alphabet=alphabet) 576 except TypeError: 577 i = _force_alphabet(iterator_generator(fp), alphabet) 578 elif format in AlignIO._FormatToIterator: 579 #Use Bio.AlignIO to read in the alignments 580 i = (r for alignment in AlignIO.parse(fp, format, 581 alphabet=alphabet) 582 for r in alignment) 583 else: 584 raise ValueError("Unknown format '%s'" % format) 585 #This imposes some overhead... wait until we drop Python 2.4 to fix it 586 for r in i: 587 yield r
588 589
590 -def _force_alphabet(record_iterator, alphabet):
591 """Iterate over records, over-riding the alphabet (PRIVATE).""" 592 #Assume the alphabet argument has been pre-validated 593 given_base_class = _get_base_alphabet(alphabet).__class__ 594 for record in record_iterator: 595 if isinstance(_get_base_alphabet(record.seq.alphabet), 596 given_base_class): 597 record.seq.alphabet = alphabet 598 yield record 599 else: 600 raise ValueError("Specified alphabet %s clashes with " 601 "that determined from the file, %s" 602 % (repr(alphabet), repr(record.seq.alphabet)))
603 604
605 -def read(handle, format, alphabet=None):
606 """Turns a sequence file into a single SeqRecord. 607 608 - handle - handle to the file, or the filename as a string 609 (note older versions of Biopython only took a handle). 610 - format - string describing the file format. 611 - alphabet - optional Alphabet object, useful when the sequence type 612 cannot be automatically inferred from the file itself 613 (e.g. format="fasta" or "tab") 614 615 This function is for use parsing sequence files containing 616 exactly one record. For example, reading a GenBank file: 617 618 >>> from Bio import SeqIO 619 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 620 >>> print("ID %s" % record.id) 621 ID AC007323.5 622 >>> print("Sequence length %i" % len(record)) 623 Sequence length 86436 624 >>> print("Sequence alphabet %s" % record.seq.alphabet) 625 Sequence alphabet IUPACAmbiguousDNA() 626 627 If the handle contains no records, or more than one record, 628 an exception is raised. For example: 629 630 >>> from Bio import SeqIO 631 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 632 Traceback (most recent call last): 633 ... 634 ValueError: More than one record found in handle 635 636 If however you want the first record from a file containing 637 multiple records this function would raise an exception (as 638 shown in the example above). Instead use: 639 640 >>> from Bio import SeqIO 641 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 642 >>> print("First record's ID %s" % record.id) 643 First record's ID X55053.1 644 645 Use the Bio.SeqIO.parse(handle, format) function if you want 646 to read multiple records from the handle. 647 """ 648 iterator = parse(handle, format, alphabet) 649 try: 650 first = next(iterator) 651 except StopIteration: 652 first = None 653 if first is None: 654 raise ValueError("No records found in handle") 655 try: 656 second = next(iterator) 657 except StopIteration: 658 second = None 659 if second is not None: 660 raise ValueError("More than one record found in handle") 661 return first
662 663
664 -def to_dict(sequences, key_function=None):
665 """Turns a sequence iterator or list into a dictionary. 666 667 - sequences - An iterator that returns SeqRecord objects, 668 or simply a list of SeqRecord objects. 669 - key_function - Optional callback function which when given a 670 SeqRecord should return a unique key for the dictionary. 671 672 e.g. key_function = lambda rec : rec.name 673 or, key_function = lambda rec : rec.description.split()[0] 674 675 If key_function is omitted then record.id is used, on the assumption 676 that the records objects returned are SeqRecords with a unique id. 677 678 If there are duplicate keys, an error is raised. 679 680 Example usage, defaulting to using the record.id as key: 681 682 >>> from Bio import SeqIO 683 >>> filename = "GenBank/cor6_6.gb" 684 >>> format = "genbank" 685 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 686 >>> print(sorted(id_dict)) 687 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 688 >>> print(id_dict["L31939.1"].description) 689 Brassica rapa (clone bif72) kin mRNA, complete cds. 690 691 A more complex example, using the key_function argument in order to 692 use a sequence checksum as the dictionary key: 693 694 >>> from Bio import SeqIO 695 >>> from Bio.SeqUtils.CheckSum import seguid 696 >>> filename = "GenBank/cor6_6.gb" 697 >>> format = "genbank" 698 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 699 ... key_function = lambda rec : seguid(rec.seq)) 700 >>> for key, record in sorted(seguid_dict.items()): 701 ... print("%s %s" % (key, record.id)) 702 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 703 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 704 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 705 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 706 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 707 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 708 709 This approach is not suitable for very large sets of sequences, as all 710 the SeqRecord objects are held in memory. Instead, consider using the 711 Bio.SeqIO.index() function (if it supports your particular file format). 712 """ 713 if key_function is None: 714 key_function = lambda rec: rec.id 715 716 d = dict() 717 for record in sequences: 718 key = key_function(record) 719 if key in d: 720 raise ValueError("Duplicate key '%s'" % key) 721 d[key] = record 722 return d
723 724
725 -def index(filename, format, alphabet=None, key_function=None):
726 """Indexes a sequence file and returns a dictionary like object. 727 728 - filename - string giving name of file to be indexed 729 - format - lower case string describing the file format 730 - alphabet - optional Alphabet object, useful when the sequence type 731 cannot be automatically inferred from the file itself 732 (e.g. format="fasta" or "tab") 733 - key_function - Optional callback function which when given a 734 SeqRecord identifier string should return a unique 735 key for the dictionary. 736 737 This indexing function will return a dictionary like object, giving the 738 SeqRecord objects as values: 739 740 >>> from Bio import SeqIO 741 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 742 >>> len(records) 743 3 744 >>> sorted(records) 745 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 746 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 747 >EAS54_6_R1_2_1_540_792 748 TTGGCAGGCCAAGGCCGATGGATCA 749 <BLANKLINE> 750 >>> "EAS54_6_R1_2_1_540_792" in records 751 True 752 >>> print(records.get("Missing", None)) 753 None 754 >>> records.close() 755 756 If the file is BGZF compressed, this is detected automatically. Ordinary 757 GZIP files are not supported: 758 759 >>> from Bio import SeqIO 760 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 761 >>> len(records) 762 3 763 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 764 TTGGCAGGCCAAGGCCGATGGATCA 765 >>> records.close() 766 767 Note that this pseudo dictionary will not support all the methods of a 768 true Python dictionary, for example values() is not defined since this 769 would require loading all of the records into memory at once. 770 771 When you call the index function, it will scan through the file, noting 772 the location of each record. When you access a particular record via the 773 dictionary methods, the code will jump to the appropriate part of the 774 file and then parse that section into a SeqRecord. 775 776 Note that not all the input formats supported by Bio.SeqIO can be used 777 with this index function. It is designed to work only with sequential 778 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 779 interlaced file format (e.g. alignment formats such as "clustal"). 780 781 For small files, it may be more efficient to use an in memory Python 782 dictionary, e.g. 783 784 >>> from Bio import SeqIO 785 >>> records = SeqIO.to_dict(SeqIO.parse("Quality/example.fastq", "fastq")) 786 >>> len(records) 787 3 788 >>> sorted(records) 789 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 790 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 791 >EAS54_6_R1_2_1_540_792 792 TTGGCAGGCCAAGGCCGATGGATCA 793 <BLANKLINE> 794 795 As with the to_dict() function, by default the id string of each record 796 is used as the key. You can specify a callback function to transform 797 this (the record identifier string) into your preferred key. For example: 798 799 >>> from Bio import SeqIO 800 >>> def make_tuple(identifier): 801 ... parts = identifier.split("_") 802 ... return int(parts[-2]), int(parts[-1]) 803 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 804 ... key_function=make_tuple) 805 >>> len(records) 806 3 807 >>> sorted(records) 808 [(413, 324), (443, 348), (540, 792)] 809 >>> print(records[(540, 792)].format("fasta")) 810 >EAS54_6_R1_2_1_540_792 811 TTGGCAGGCCAAGGCCGATGGATCA 812 <BLANKLINE> 813 >>> (540, 792) in records 814 True 815 >>> "EAS54_6_R1_2_1_540_792" in records 816 False 817 >>> print(records.get("Missing", None)) 818 None 819 >>> records.close() 820 821 Another common use case would be indexing an NCBI style FASTA file, 822 where you might want to extract the GI number from the FASTA identifer 823 to use as the dictionary key. 824 825 Notice that unlike the to_dict() function, here the key_function does 826 not get given the full SeqRecord to use to generate the key. Doing so 827 would impose a severe performance penalty as it would require the file 828 to be completely parsed while building the index. Right now this is 829 usually avoided. 830 831 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 832 """ 833 #Try and give helpful error messages: 834 if not isinstance(filename, basestring): 835 raise TypeError("Need a filename (not a handle)") 836 if not isinstance(format, basestring): 837 raise TypeError("Need a string for the file format (lower case)") 838 if not format: 839 raise ValueError("Format required (lower case string)") 840 if format != format.lower(): 841 raise ValueError("Format string '%s' should be lower case" % format) 842 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 843 isinstance(alphabet, AlphabetEncoder)): 844 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 845 846 #Map the file format to a sequence iterator: 847 from ._index import _FormatToRandomAccess # Lazy import 848 from Bio.File import _IndexedSeqFileDict 849 try: 850 proxy_class = _FormatToRandomAccess[format] 851 except KeyError: 852 raise ValueError("Unsupported format %r" % format) 853 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 854 % (filename, format, alphabet, key_function) 855 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 856 key_function, repr, "SeqRecord")
857 858
859 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 860 key_function=None):
861 """Index several sequence files and return a dictionary like object. 862 863 The index is stored in an SQLite database rather than in memory (as in the 864 Bio.SeqIO.index(...) function). 865 866 - index_filename - Where to store the SQLite index 867 - filenames - list of strings specifying file(s) to be indexed, or when 868 indexing a single file this can be given as a string. 869 (optional if reloading an existing index, but must match) 870 - format - lower case string describing the file format 871 (optional if reloading an existing index, but must match) 872 - alphabet - optional Alphabet object, useful when the sequence type 873 cannot be automatically inferred from the file itself 874 (e.g. format="fasta" or "tab") 875 - key_function - Optional callback function which when given a 876 SeqRecord identifier string should return a unique 877 key for the dictionary. 878 879 This indexing function will return a dictionary like object, giving the 880 SeqRecord objects as values: 881 882 >>> from Bio.Alphabet import generic_protein 883 >>> from Bio import SeqIO 884 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 885 >>> def get_gi(name): 886 ... parts = name.split("|") 887 ... i = parts.index("gi") 888 ... assert i != -1 889 ... return parts[i+1] 890 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 891 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 892 >>> len(records) 893 95 894 >>> records["7525076"].description 895 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 896 >>> records["45478717"].description 897 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 898 >>> records.close() 899 900 In this example the two files contain 85 and 10 records respectively. 901 902 BGZF compressed files are supported, and detected automatically. Ordinary 903 GZIP compressed files are not supported. 904 905 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict(), and the Python module 906 glob which is useful for building lists of files. 907 """ 908 #Try and give helpful error messages: 909 if not isinstance(index_filename, basestring): 910 raise TypeError("Need a string for the index filename") 911 if isinstance(filenames, basestring): 912 #Make the API a little more friendly, and more similar 913 #to Bio.SeqIO.index(...) for indexing just one file. 914 filenames = [filenames] 915 if filenames is not None and not isinstance(filenames, list): 916 raise TypeError( 917 "Need a list of filenames (as strings), or one filename") 918 if format is not None and not isinstance(format, basestring): 919 raise TypeError("Need a string for the file format (lower case)") 920 if format and format != format.lower(): 921 raise ValueError("Format string '%s' should be lower case" % format) 922 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 923 isinstance(alphabet, AlphabetEncoder)): 924 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 925 926 #Map the file format to a sequence iterator: 927 from ._index import _FormatToRandomAccess # Lazy import 928 from Bio.File import _SQLiteManySeqFilesDict 929 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 930 % (index_filename, filenames, format, alphabet, key_function) 931 932 def proxy_factory(format, filename=None): 933 """Given a filename returns proxy object, else boolean if format OK.""" 934 if filename: 935 return _FormatToRandomAccess[format](filename, format, alphabet) 936 else: 937 return format in _FormatToRandomAccess
938 939 return _SQLiteManySeqFilesDict(index_filename, filenames, 940 proxy_factory, format, 941 key_function, repr) 942 943
944 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
945 """Convert between two sequence file formats, return number of records. 946 947 - in_file - an input handle or filename 948 - in_format - input file format, lower case string 949 - out_file - an output handle or filename 950 - out_format - output file format, lower case string 951 - alphabet - optional alphabet to assume 952 953 NOTE - If you provide an output filename, it will be opened which will 954 overwrite any existing file without warning. This may happen if even 955 the conversion is aborted (e.g. an invalid out_format name is given). 956 957 For example, going from a filename to a handle: 958 959 >>> from Bio import SeqIO 960 >>> try: 961 ... from StringIO import StringIO # Python 2 962 ... except ImportError: 963 ... from io import StringIO # Python 3 964 ... 965 >>> handle = StringIO("") 966 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 967 3 968 >>> print(handle.getvalue()) 969 >EAS54_6_R1_2_1_413_324 970 CCCTTCTTGTCTTCAGCGTTTCTCC 971 >EAS54_6_R1_2_1_540_792 972 TTGGCAGGCCAAGGCCGATGGATCA 973 >EAS54_6_R1_2_1_443_348 974 GTTGCTTCTGGCGTGGGTGGGGGGG 975 <BLANKLINE> 976 """ 977 #Hack for SFF, will need to make this more general in future 978 if in_format in _BinaryFormats: 979 in_mode = 'rb' 980 else: 981 in_mode = 'rU' 982 983 #Don't open the output file until we've checked the input is OK? 984 if out_format in ["sff", "sff_trim"]: 985 out_mode = 'wb' 986 else: 987 out_mode = 'w' 988 989 #This will check the arguments and issue error messages, 990 #after we have opened the file which is a shame. 991 from ._convert import _handle_convert # Lazy import 992 with as_handle(in_file, in_mode) as in_handle: 993 with as_handle(out_file, out_mode) as out_handle: 994 count = _handle_convert(in_handle, in_format, 995 out_handle, out_format, 996 alphabet) 997 return count
998 999 1000 # This helpful trick for testing no longer works with the 1001 # local imports :( 1002 # 1003 #if __name__ == "__main__": 1004 # from Bio._utils import run_doctest 1005 # run_doctest() 1006