Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78   - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79   - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80      offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81      These can potentially hold PHRED scores from 0 to 93. 
  82   - "fastq-sanger" is an alias for "fastq". 
  83   - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84      files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85      scores from -5 to 62. 
  86   - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87      PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88      to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114      >>> from Bio import SeqIO 
 115      >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116      ...     print record.id, record.seq 
 117      EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118      EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119      EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123      >>> print record 
 124      ID: EAS54_6_R1_2_1_443_348 
 125      Name: EAS54_6_R1_2_1_443_348 
 126      Description: EAS54_6_R1_2_1_443_348 
 127      Number of features: 0 
 128      Per letter annotation for: phred_quality 
 129      Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130      >>> print record.letter_annotations["phred_quality"] 
 131      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135      >>> print record.format("qual") 
 136      >EAS54_6_R1_2_1_443_348 
 137      26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138      24 18 18 18 18 
 139      <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143      >>> print record.format("fastq") 
 144      @EAS54_6_R1_2_1_443_348 
 145      GTTGCTTCTGGCGTGGGTGGGGGGG 
 146      + 
 147      ;;;;;;;;;;;9;7;;.7;393333 
 148      <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153      >>> print record.format("fastq-illumina") 
 154      @EAS54_6_R1_2_1_443_348 
 155      GTTGCTTCTGGCGTGGGTGGGGGGG 
 156      + 
 157      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158      <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163      >>> print record.format("fastq-solexa") 
 164      @EAS54_6_R1_2_1_443_348 
 165      GTTGCTTCTGGCGTGGGTGGGGGGG 
 166      + 
 167      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168      <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179      >>> sub_rec = record[5:15] 
 180      >>> print sub_rec 
 181      ID: EAS54_6_R1_2_1_443_348 
 182      Name: EAS54_6_R1_2_1_443_348 
 183      Description: EAS54_6_R1_2_1_443_348 
 184      Number of features: 0 
 185      Per letter annotation for: phred_quality 
 186      Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187      >>> print sub_rec.letter_annotations["phred_quality"] 
 188      [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189      >>> print sub_rec.format("fastq") 
 190      @EAS54_6_R1_2_1_443_348 
 191      TTCTGGCGTG 
 192      + 
 193      ;;;;;;9;7; 
 194      <BLANKLINE> 
 195   
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198      >>> from Bio import SeqIO 
 199      >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200      >>> out_handle = open("Quality/temp.qual", "w") 
 201      >>> SeqIO.write(record_iterator, out_handle, "qual") 
 202      3 
 203      >>> out_handle.close() 
 204   
 205  You can of course read in a QUAL file, such as the one we just created: 
 206   
 207      >>> from Bio import SeqIO 
 208      >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 209      ...     print record.id, record.seq 
 210      EAS54_6_R1_2_1_413_324 ????????????????????????? 
 211      EAS54_6_R1_2_1_540_792 ????????????????????????? 
 212      EAS54_6_R1_2_1_443_348 ????????????????????????? 
 213   
 214  Notice that QUAL files don't have a proper sequence present!  But the quality 
 215  information is there: 
 216   
 217      >>> print record 
 218      ID: EAS54_6_R1_2_1_443_348 
 219      Name: EAS54_6_R1_2_1_443_348 
 220      Description: EAS54_6_R1_2_1_443_348 
 221      Number of features: 0 
 222      Per letter annotation for: phred_quality 
 223      UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 224      >>> print record.letter_annotations["phred_quality"] 
 225      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 226   
 227  Just to keep things tidy, if you are following this example yourself, you can 
 228  delete this temporary file now: 
 229   
 230      >>> import os 
 231      >>> os.remove("Quality/temp.qual") 
 232   
 233  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 234  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 235  would have to read the two in separately and then combine the data.  However, 
 236  since this is such a common thing to want to do, there is a helper iterator 
 237  defined in this module that does this for you - PairedFastaQualIterator. 
 238   
 239  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 240  then a simple dictionary approach would work: 
 241   
 242      >>> from Bio import SeqIO 
 243      >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta")) 
 244      >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"): 
 245      ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 246   
 247  You can then access any record by its key, and get both the sequence and the 
 248  quality scores. 
 249   
 250      >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq") 
 251      @EAS54_6_R1_2_1_540_792 
 252      TTGGCAGGCCAAGGCCGATGGATCA 
 253      + 
 254      ;;;;;;;;;;;7;;;;;-;;;3;83 
 255      <BLANKLINE> 
 256   
 257  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 258  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 259  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 260  for the more recent variant), as this cannot be detected reliably 
 261  automatically. 
 262   
 263  To illustrate this problem, let's consider an artifical example: 
 264   
 265      >>> from Bio.Seq import Seq 
 266      >>> from Bio.Alphabet import generic_dna 
 267      >>> from Bio.SeqRecord import SeqRecord 
 268      >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 269      ... description="Made up!") 
 270      >>> print test.format("fasta") 
 271      >Test Made up! 
 272      NACGTACGTA 
 273      <BLANKLINE> 
 274      >>> print test.format("fastq") 
 275      Traceback (most recent call last): 
 276       ... 
 277      ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 278   
 279  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 280  or FASTQ format we need to provide some quality scores. These are held as a 
 281  list of integers (one for each base) in the letter_annotations dictionary: 
 282   
 283      >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40] 
 284      >>> print test.format("qual") 
 285      >Test Made up! 
 286      0 1 2 3 4 5 10 20 30 40 
 287      <BLANKLINE> 
 288      >>> print test.format("fastq") 
 289      @Test Made up! 
 290      NACGTACGTA 
 291      + 
 292      !"#$%&+5?I 
 293      <BLANKLINE> 
 294   
 295  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 296  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 297  the letter "I": 
 298   
 299      >>> ord('!') - 33 
 300      0 
 301      >>> ord('I') - 33 
 302      40 
 303      >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 304      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 305   
 306  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 307  scores with an offset of 64: 
 308   
 309      >>> print test.format("fastq-illumina") 
 310      @Test Made up! 
 311      NACGTACGTA 
 312      + 
 313      @ABCDEJT^h 
 314      <BLANKLINE> 
 315   
 316  And we can check this too - the first PHRED score was zero, and this mapped to 
 317  "@", while the final score was 40 and this mapped to "h": 
 318   
 319      >>> ord("@") - 64 
 320      0 
 321      >>> ord("h") - 64 
 322      40 
 323      >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 324      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 325   
 326  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 327  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 328  format to consider which encodes Solexa scores instead of PHRED scores. 
 329   
 330  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 331  scores (rounding to one decimal place): 
 332   
 333      >>> for q in [0,1,2,3,4,5,10,20,30,40]: 
 334      ...     print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)) 
 335      PHRED 0 maps to Solexa -5.0 
 336      PHRED 1 maps to Solexa -5.0 
 337      PHRED 2 maps to Solexa -2.3 
 338      PHRED 3 maps to Solexa -0.0 
 339      PHRED 4 maps to Solexa 1.8 
 340      PHRED 5 maps to Solexa 3.3 
 341      PHRED 10 maps to Solexa 9.5 
 342      PHRED 20 maps to Solexa 20.0 
 343      PHRED 30 maps to Solexa 30.0 
 344      PHRED 40 maps to Solexa 40.0 
 345   
 346  Now here is the record using the old Solexa style FASTQ file: 
 347   
 348      >>> print test.format("fastq-solexa") 
 349      @Test Made up! 
 350      NACGTACGTA 
 351      + 
 352      ;;>@BCJT^h 
 353      <BLANKLINE> 
 354   
 355  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 356   
 357      >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 358      [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 359   
 360  This explains why the last few letters of this FASTQ output matched that using 
 361  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 362  are approximately equal. 
 363   
 364  """ 
 365  __docformat__ = "epytext en"  # Don't just use plain text in epydoc API pages! 
 366   
 367  from Bio.Alphabet import single_letter_alphabet 
 368  from Bio.Seq import Seq, UnknownSeq 
 369  from Bio.SeqRecord import SeqRecord 
 370  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 371  from math import log 
 372  import warnings 
 373  from Bio import BiopythonWarning, BiopythonParserWarning 
 374   
 375   
 376  # define score offsets. See discussion for differences between Sanger and 
 377  # Solexa offsets. 
 378  SANGER_SCORE_OFFSET = 33 
 379  SOLEXA_SCORE_OFFSET = 64 
 380   
 381   
382 -def solexa_quality_from_phred(phred_quality):
383 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 384 385 PHRED and Solexa quality scores are both log transformations of a 386 probality of error (high score = low probability of error). This function 387 takes a PHRED score, transforms it back to a probability of error, and 388 then re-expresses it as a Solexa score. This assumes the error estimates 389 are equivalent. 390 391 How does this work exactly? Well the PHRED quality is minus ten times the 392 base ten logarithm of the probability of error:: 393 394 phred_quality = -10*log(error,10) 395 396 Therefore, turning this round:: 397 398 error = 10 ** (- phred_quality / 10) 399 400 Now, Solexa qualities use a different log transformation:: 401 402 solexa_quality = -10*log(error/(1-error),10) 403 404 After substitution and a little manipulation we get:: 405 406 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 407 408 However, real Solexa files use a minimum quality of -5. This does have a 409 good reason - a random base call would be correct 25% of the time, 410 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 411 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 412 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 413 414 Taken literally, this logarithic formula would map a PHRED quality of zero 415 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 416 score of zero means a probability of error of one (i.e. the base call is 417 definitely wrong), which is worse than random! In practice, a PHRED quality 418 of zero usually means a default value, or perhaps random - and therefore 419 mapping it to the minimum Solexa score of -5 is reasonable. 420 421 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 422 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 423 quality of zero to -5.0 as well. 424 425 Note this function will return a floating point number, it is up to you to 426 round this to the nearest integer if appropriate. e.g. 427 428 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2) 429 80.00 430 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2) 431 50.00 432 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2) 433 19.96 434 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2) 435 9.54 436 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2) 437 3.35 438 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2) 439 1.80 440 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2) 441 -0.02 442 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2) 443 -2.33 444 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2) 445 -5.00 446 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2) 447 -5.00 448 449 Notice that for high quality reads PHRED and Solexa scores are numerically 450 equal. The differences are important for poor quality reads, where PHRED 451 has a minimum of zero but Solexa scores can be negative. 452 453 Finally, as a special case where None is used for a "missing value", None 454 is returned: 455 456 >>> print solexa_quality_from_phred(None) 457 None 458 """ 459 if phred_quality is None: 460 #Assume None is used as some kind of NULL or NA value; return None 461 #e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 462 return None 463 elif phred_quality > 0: 464 #Solexa uses a minimum value of -5, which after rounding matches a 465 #random nucleotide base call. 466 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 467 elif phred_quality == 0: 468 #Special case, map to -5 as discussed in the docstring 469 return -5.0 470 else: 471 raise ValueError("PHRED qualities must be positive (or zero), not %s" 472 % repr(phred_quality))
473 474
475 -def phred_quality_from_solexa(solexa_quality):
476 """Convert a Solexa quality (which can be negative) to a PHRED quality. 477 478 PHRED and Solexa quality scores are both log transformations of a 479 probality of error (high score = low probability of error). This function 480 takes a Solexa score, transforms it back to a probability of error, and 481 then re-expresses it as a PHRED score. This assumes the error estimates 482 are equivalent. 483 484 The underlying formulas are given in the documentation for the sister 485 function solexa_quality_from_phred, in this case the operation is:: 486 487 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 488 489 This will return a floating point number, it is up to you to round this to 490 the nearest integer if appropriate. e.g. 491 492 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2) 493 80.00 494 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2) 495 20.04 496 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2) 497 10.41 498 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2) 499 3.01 500 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2) 501 1.19 502 503 Note that a solexa_quality less then -5 is not expected, will trigger a 504 warning, but will still be converted as per the logarithmic mapping 505 (giving a number between 0 and 1.19 back). 506 507 As a special case where None is used for a "missing value", None is 508 returned: 509 510 >>> print phred_quality_from_solexa(None) 511 None 512 """ 513 if solexa_quality is None: 514 #Assume None is used as some kind of NULL or NA value; return None 515 return None 516 if solexa_quality < -5: 517 warnings.warn("Solexa quality less than -5 passed, %s" 518 % repr(solexa_quality), BiopythonWarning) 519 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
520 521
522 -def _get_phred_quality(record):
523 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 524 525 If there are no PHRED qualities, but there are Solexa qualities, those are 526 used instead after conversion. 527 """ 528 try: 529 return record.letter_annotations["phred_quality"] 530 except KeyError: 531 pass 532 try: 533 return [phred_quality_from_solexa(q) for 534 q in record.letter_annotations["solexa_quality"]] 535 except KeyError: 536 raise ValueError("No suitable quality scores found in " 537 "letter_annotations of SeqRecord (id=%s)." 538 % record.id)
539 540 #Only map 0 to 93, we need to give a warning on truncating at 93 541 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 542 for qp in range(0, 93 + 1)) 543 #Only map -5 to 93, we need to give a warning on truncating at 93 544 _solexa_to_sanger_quality_str = dict( 545 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 546 SANGER_SCORE_OFFSET))) 547 for qs in range(-5, 93 + 1)) 548 549
550 -def _get_sanger_quality_str(record):
551 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 552 553 >>> from Bio.Seq import Seq 554 >>> from Bio.SeqRecord import SeqRecord 555 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 556 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]}) 557 >>> _get_sanger_quality_str(r) 558 'SI?5+!' 559 560 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 561 the PHRED qualities are integers, this function is able to use a very fast 562 pre-cached mapping. However, if they are floats which differ slightly, then 563 it has to do the appropriate rounding - which is slower: 564 565 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 566 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]}) 567 >>> _get_sanger_quality_str(r2) 568 'SI?5+!' 569 570 If your scores include a None value, this raises an exception: 571 572 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 573 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]}) 574 >>> _get_sanger_quality_str(r3) 575 Traceback (most recent call last): 576 ... 577 TypeError: A quality value of None was found 578 579 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 580 scores are used in preference: 581 582 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 583 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0], 584 ... "solexa_quality":[-5,-4,0,None,0,40]}) 585 >>> _get_sanger_quality_str(r4) 586 'SI?5+!' 587 588 If there are no PHRED scores, but there are Solexa scores, these are used 589 instead (after the approriate conversion): 590 591 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 592 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]}) 593 >>> _get_sanger_quality_str(r5) 594 'I?5+$"' 595 596 Again, integer Solexa scores can be looked up in a pre-cached mapping making 597 this very fast. You can still use approximate floating point scores: 598 599 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 600 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]}) 601 >>> _get_sanger_quality_str(r6) 602 'I?5+$"' 603 604 Notice that due to the limited range of printable ASCII characters, a 605 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 606 file (using ASCII 126, the tilde). This function will issue a warning 607 in this situation. 608 """ 609 #TODO - This functions works and is fast, but it is also ugly 610 #and there is considerable repetition of code for the other 611 #two FASTQ variants. 612 try: 613 #These take priority (in case both Solexa and PHRED scores found) 614 qualities = record.letter_annotations["phred_quality"] 615 except KeyError: 616 #Fall back on solexa scores... 617 pass 618 else: 619 #Try and use the precomputed mapping: 620 try: 621 return "".join([_phred_to_sanger_quality_str[qp] 622 for qp in qualities]) 623 except KeyError: 624 #Could be a float, or a None in the list, or a high value. 625 pass 626 if None in qualities: 627 raise TypeError("A quality value of None was found") 628 if max(qualities) >= 93.5: 629 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 630 BiopythonWarning) 631 #This will apply the truncation at 93, giving max ASCII 126 632 return "".join([chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 633 for qp in qualities]) 634 #Fall back on the Solexa scores... 635 try: 636 qualities = record.letter_annotations["solexa_quality"] 637 except KeyError: 638 raise ValueError("No suitable quality scores found in " 639 "letter_annotations of SeqRecord (id=%s)." 640 % record.id) 641 #Try and use the precomputed mapping: 642 try: 643 return "".join([_solexa_to_sanger_quality_str[qs] 644 for qs in qualities]) 645 except KeyError: 646 #Either no PHRED scores, or something odd like a float or None 647 pass 648 if None in qualities: 649 raise TypeError("A quality value of None was found") 650 #Must do this the slow way, first converting the PHRED scores into 651 #Solexa scores: 652 if max(qualities) >= 93.5: 653 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 654 BiopythonWarning) 655 #This will apply the truncation at 93, giving max ASCII 126 656 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 657 for qs in qualities])
658 659 #Only map 0 to 62, we need to give a warning on truncating at 62 660 assert 62 + SOLEXA_SCORE_OFFSET == 126 661 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 662 for qp in range(0, 62 + 1)) 663 #Only map -5 to 62, we need to give a warning on truncating at 62 664 _solexa_to_illumina_quality_str = dict( 665 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 666 for qs in range(-5, 62 + 1)) 667 668
669 -def _get_illumina_quality_str(record):
670 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 671 672 Notice that due to the limited range of printable ASCII characters, a 673 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 674 file (using ASCII 126, the tilde). This function will issue a warning 675 in this situation. 676 """ 677 #TODO - This functions works and is fast, but it is also ugly 678 #and there is considerable repetition of code for the other 679 #two FASTQ variants. 680 try: 681 #These take priority (in case both Solexa and PHRED scores found) 682 qualities = record.letter_annotations["phred_quality"] 683 except KeyError: 684 #Fall back on solexa scores... 685 pass 686 else: 687 #Try and use the precomputed mapping: 688 try: 689 return "".join([_phred_to_illumina_quality_str[qp] 690 for qp in qualities]) 691 except KeyError: 692 #Could be a float, or a None in the list, or a high value. 693 pass 694 if None in qualities: 695 raise TypeError("A quality value of None was found") 696 if max(qualities) >= 62.5: 697 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 698 BiopythonWarning) 699 #This will apply the truncation at 62, giving max ASCII 126 700 return "".join([chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 701 for qp in qualities]) 702 #Fall back on the Solexa scores... 703 try: 704 qualities = record.letter_annotations["solexa_quality"] 705 except KeyError: 706 raise ValueError("No suitable quality scores found in " 707 "letter_annotations of SeqRecord (id=%s)." 708 % record.id) 709 #Try and use the precomputed mapping: 710 try: 711 return "".join([_solexa_to_illumina_quality_str[qs] 712 for qs in qualities]) 713 except KeyError: 714 #Either no PHRED scores, or something odd like a float or None 715 pass 716 if None in qualities: 717 raise TypeError("A quality value of None was found") 718 #Must do this the slow way, first converting the PHRED scores into 719 #Solexa scores: 720 if max(qualities) >= 62.5: 721 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 722 BiopythonWarning) 723 #This will apply the truncation at 62, giving max ASCII 126 724 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 725 for qs in qualities])
726 727 #Only map 0 to 62, we need to give a warning on truncating at 62 728 assert 62 + SOLEXA_SCORE_OFFSET == 126 729 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 730 for qs in range(-5, 62 + 1)) 731 #Only map -5 to 62, we need to give a warning on truncating at 62 732 _phred_to_solexa_quality_str = dict( 733 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 734 SOLEXA_SCORE_OFFSET))) 735 for qp in range(0, 62 + 1)) 736 737
738 -def _get_solexa_quality_str(record):
739 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 740 741 Notice that due to the limited range of printable ASCII characters, a 742 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 743 file (using ASCII 126, the tilde). This function will issue a warning 744 in this situation. 745 """ 746 #TODO - This functions works and is fast, but it is also ugly 747 #and there is considerable repetition of code for the other 748 #two FASTQ variants. 749 try: 750 #These take priority (in case both Solexa and PHRED scores found) 751 qualities = record.letter_annotations["solexa_quality"] 752 except KeyError: 753 #Fall back on PHRED scores... 754 pass 755 else: 756 #Try and use the precomputed mapping: 757 try: 758 return "".join([_solexa_to_solexa_quality_str[qs] 759 for qs in qualities]) 760 except KeyError: 761 #Could be a float, or a None in the list, or a high value. 762 pass 763 if None in qualities: 764 raise TypeError("A quality value of None was found") 765 if max(qualities) >= 62.5: 766 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 767 BiopythonWarning) 768 #This will apply the truncation at 62, giving max ASCII 126 769 return "".join([chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 770 for qs in qualities]) 771 #Fall back on the PHRED scores... 772 try: 773 qualities = record.letter_annotations["phred_quality"] 774 except KeyError: 775 raise ValueError("No suitable quality scores found in " 776 "letter_annotations of SeqRecord (id=%s)." 777 % record.id) 778 #Try and use the precomputed mapping: 779 try: 780 return "".join([_phred_to_solexa_quality_str[qp] 781 for qp in qualities]) 782 except KeyError: 783 #Either no PHRED scores, or something odd like a float or None 784 #or too big to be in the cache 785 pass 786 if None in qualities: 787 raise TypeError("A quality value of None was found") 788 #Must do this the slow way, first converting the PHRED scores into 789 #Solexa scores: 790 if max(qualities) >= 62.5: 791 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 792 BiopythonWarning) 793 return "".join([chr(min(126, 794 int(round(solexa_quality_from_phred(qp))) + 795 SOLEXA_SCORE_OFFSET)) 796 for qp in qualities])
797 798 799 #TODO - Default to nucleotide or even DNA?
800 -def FastqGeneralIterator(handle):
801 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 802 803 This code does not try to interpret the quality string numerically. It 804 just returns tuples of the title, sequence and quality as strings. For 805 the sequence and quality, any whitespace (such as new lines) is removed. 806 807 Our SeqRecord based FASTQ iterators call this function internally, and then 808 turn the strings into a SeqRecord objects, mapping the quality string into 809 a list of numerical scores. If you want to do a custom quality mapping, 810 then you might consider calling this function directly. 811 812 For parsing FASTQ files, the title string from the "@" line at the start 813 of each record can optionally be omitted on the "+" lines. If it is 814 repeated, it must be identical. 815 816 The sequence string and the quality string can optionally be split over 817 multiple lines, although several sources discourage this. In comparison, 818 for the FASTA file format line breaks between 60 and 80 characters are 819 the norm. 820 821 WARNING - Because the "@" character can appear in the quality string, 822 this can cause problems as this is also the marker for the start of 823 a new sequence. In fact, the "+" sign can also appear as well. Some 824 sources recommended having no line breaks in the quality to avoid this, 825 but even that is not enough, consider this example:: 826 827 @071113_EAS56_0053:1:1:998:236 828 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 829 +071113_EAS56_0053:1:1:998:236 830 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 831 @071113_EAS56_0053:1:1:182:712 832 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 833 + 834 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 835 @071113_EAS56_0053:1:1:153:10 836 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 837 + 838 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 839 @071113_EAS56_0053:1:3:990:501 840 TGGGAGGTTTTATGTGGA 841 AAGCAGCAATGTACAAGA 842 + 843 IIIIIII.IIIIII1@44 844 @-7.%<&+/$/%4(++(% 845 846 This is four PHRED encoded FASTQ entries originally from an NCBI source 847 (given the read length of 36, these are probably Solexa Illumna reads where 848 the quality has been mapped onto the PHRED values). 849 850 This example has been edited to illustrate some of the nasty things allowed 851 in the FASTQ format. Firstly, on the "+" lines most but not all of the 852 (redundant) identifiers are omitted. In real files it is likely that all or 853 none of these extra identifiers will be present. 854 855 Secondly, while the first three sequences have been shown without line 856 breaks, the last has been split over multiple lines. In real files any line 857 breaks are likely to be consistent. 858 859 Thirdly, some of the quality string lines start with an "@" character. For 860 the second record this is unavoidable. However for the fourth sequence this 861 only happens because its quality string is split over two lines. A naive 862 parser could wrongly treat any line starting with an "@" as the beginning of 863 a new sequence! This code copes with this possible ambiguity by keeping 864 track of the length of the sequence which gives the expected length of the 865 quality string. 866 867 Using this tricky example file as input, this short bit of code demonstrates 868 what this parsing function would return: 869 870 >>> handle = open("Quality/tricky.fastq", "rU") 871 >>> for (title, sequence, quality) in FastqGeneralIterator(handle): 872 ... print title 873 ... print sequence, quality 874 071113_EAS56_0053:1:1:998:236 875 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 876 071113_EAS56_0053:1:1:182:712 877 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 878 071113_EAS56_0053:1:1:153:10 879 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 880 071113_EAS56_0053:1:3:990:501 881 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 882 >>> handle.close() 883 884 Finally we note that some sources state that the quality string should 885 start with "!" (which using the PHRED mapping means the first letter always 886 has a quality score of zero). This rather restrictive rule is not widely 887 observed, so is therefore ignored here. One plus point about this "!" rule 888 is that (provided there are no line breaks in the quality sequence) it 889 would prevent the above problem with the "@" character. 890 """ 891 #We need to call handle.readline() at least four times per record, 892 #so we'll save a property look up each time: 893 handle_readline = handle.readline 894 895 #Skip any text before the first record (e.g. blank lines, comments?) 896 while True: 897 line = handle_readline() 898 if not line: 899 return # Premature end of file, or just empty? 900 if line[0] == "@": 901 break 902 if isinstance(line[0], int): 903 raise ValueError("Is this handle in binary mode not text mode?") 904 905 while line: 906 if line[0] != "@": 907 raise ValueError( 908 "Records in Fastq files should start with '@' character") 909 title_line = line[1:].rstrip() 910 #Will now be at least one line of quality data - in most FASTQ files 911 #just one line! We therefore use string concatenation (if needed) 912 #rather using than the "".join(...) trick just in case it is multiline: 913 seq_string = handle_readline().rstrip() 914 #There may now be more sequence lines, or the "+" quality marker line: 915 while True: 916 line = handle_readline() 917 if not line: 918 raise ValueError("End of file without quality information.") 919 if line[0] == "+": 920 #The title here is optional, but if present must match! 921 second_title = line[1:].rstrip() 922 if second_title and second_title != title_line: 923 raise ValueError("Sequence and quality captions differ.") 924 break 925 seq_string += line.rstrip() # removes trailing newlines 926 #This is going to slow things down a little, but assuming 927 #this isn't allowed we should try and catch it here: 928 if " " in seq_string or "\t" in seq_string: 929 raise ValueError("Whitespace is not allowed in the sequence.") 930 seq_len = len(seq_string) 931 932 #Will now be at least one line of quality data... 933 quality_string = handle_readline().rstrip() 934 #There may now be more quality data, or another sequence, or EOF 935 while True: 936 line = handle_readline() 937 if not line: 938 break # end of file 939 if line[0] == "@": 940 #This COULD be the start of a new sequence. However, it MAY just 941 #be a line of quality data which starts with a "@" character. We 942 #should be able to check this by looking at the sequence length 943 #and the amount of quality data found so far. 944 if len(quality_string) >= seq_len: 945 #We expect it to be equal if this is the start of a new record. 946 #If the quality data is longer, we'll raise an error below. 947 break 948 #Continue - its just some (more) quality data. 949 quality_string += line.rstrip() 950 951 if seq_len != len(quality_string): 952 raise ValueError("Lengths of sequence and quality values differs " 953 " for %s (%i and %i)." 954 % (title_line, seq_len, len(quality_string))) 955 956 #Return the record and then continue... 957 yield (title_line, seq_string, quality_string) 958 raise StopIteration
959 960
961 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
962 """Generator function to iterate over FASTQ records (as SeqRecord objects). 963 964 - handle - input file 965 - alphabet - optional alphabet 966 - title2ids - A function that, when given the title line from the FASTQ 967 file (without the beginning >), will return the id, name and 968 description (in that order) for the record as a tuple of 969 strings. If this is not given, then the entire title line 970 will be used as the description, and the first word as the 971 id and name. 972 973 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 974 975 For each sequence in a (Sanger style) FASTQ file there is a matching string 976 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 977 values with an offset of 33. 978 979 For example, consider a file containing three short reads:: 980 981 @EAS54_6_R1_2_1_413_324 982 CCCTTCTTGTCTTCAGCGTTTCTCC 983 + 984 ;;3;;;;;;;;;;;;7;;;;;;;88 985 @EAS54_6_R1_2_1_540_792 986 TTGGCAGGCCAAGGCCGATGGATCA 987 + 988 ;;;;;;;;;;;7;;;;;-;;;3;83 989 @EAS54_6_R1_2_1_443_348 990 GTTGCTTCTGGCGTGGGTGGGGGGG 991 + 992 ;;;;;;;;;;;9;7;;.7;393333 993 994 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 995 string encoding the PHRED qualities using a ASCI values with an offset of 996 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 997 998 Using this module directly you might run: 999 1000 >>> handle = open("Quality/example.fastq", "rU") 1001 >>> for record in FastqPhredIterator(handle): 1002 ... print record.id, record.seq 1003 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1004 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1005 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1006 >>> handle.close() 1007 1008 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1009 (or "fastq-sanger") as the format: 1010 1011 >>> from Bio import SeqIO 1012 >>> handle = open("Quality/example.fastq", "rU") 1013 >>> for record in SeqIO.parse(handle, "fastq"): 1014 ... print record.id, record.seq 1015 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1016 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1017 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1018 >>> handle.close() 1019 1020 If you want to look at the qualities, they are record in each record's 1021 per-letter-annotation dictionary as a simple list of integers: 1022 1023 >>> print record.letter_annotations["phred_quality"] 1024 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1025 1026 """ 1027 assert SANGER_SCORE_OFFSET == ord("!") 1028 #Originally, I used a list expression for each record: 1029 # 1030 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1031 # 1032 #Precomputing is faster, perhaps partly by avoiding the subtractions. 1033 q_mapping = dict() 1034 for letter in range(0, 255): 1035 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1036 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1037 if title2ids: 1038 id, name, descr = title2ids(title_line) 1039 else: 1040 descr = title_line 1041 id = descr.split()[0] 1042 name = id 1043 record = SeqRecord(Seq(seq_string, alphabet), 1044 id=id, name=name, description=descr) 1045 qualities = [q_mapping[letter] for letter in quality_string] 1046 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1047 raise ValueError("Invalid character in quality string") 1048 #For speed, will now use a dirty trick to speed up assigning the 1049 #qualities. We do this to bypass the length check imposed by the 1050 #per-letter-annotations restricted dict (as this has already been 1051 #checked by FastqGeneralIterator). This is equivalent to: 1052 #record.letter_annotations["phred_quality"] = qualities 1053 dict.__setitem__(record._per_letter_annotations, 1054 "phred_quality", qualities) 1055 yield record
1056 1057
1058 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1059 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1060 1061 The optional arguments are the same as those for the FastqPhredIterator. 1062 1063 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1064 encoding the Solexa integer qualities using ASCII values with an offset 1065 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1066 will NOT perform any automatic conversion when loading. 1067 1068 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1069 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1070 1071 For example, consider a file containing these five records:: 1072 1073 @SLXA-B3_649_FC8437_R1_1_1_610_79 1074 GATGTGCAATACCTTTGTAGAGGAA 1075 +SLXA-B3_649_FC8437_R1_1_1_610_79 1076 YYYYYYYYYYYYYYYYYYWYWYYSU 1077 @SLXA-B3_649_FC8437_R1_1_1_397_389 1078 GGTTTGAGAAAGAGAAATGAGATAA 1079 +SLXA-B3_649_FC8437_R1_1_1_397_389 1080 YYYYYYYYYWYYYYWWYYYWYWYWW 1081 @SLXA-B3_649_FC8437_R1_1_1_850_123 1082 GAGGGTGTTGATCATGATGATGGCG 1083 +SLXA-B3_649_FC8437_R1_1_1_850_123 1084 YYYYYYYYYYYYYWYYWYYSYYYSY 1085 @SLXA-B3_649_FC8437_R1_1_1_362_549 1086 GGAAACAAAGTTTTTCTCAACATAG 1087 +SLXA-B3_649_FC8437_R1_1_1_362_549 1088 YYYYYYYYYYYYYYYYYYWWWWYWY 1089 @SLXA-B3_649_FC8437_R1_1_1_183_714 1090 GTATTATTTAATGGCATACACTCAA 1091 +SLXA-B3_649_FC8437_R1_1_1_183_714 1092 YYYYYYYYYYWYYYYWYWWUWWWQQ 1093 1094 Using this module directly you might run: 1095 1096 >>> handle = open("Quality/solexa_example.fastq", "rU") 1097 >>> for record in FastqSolexaIterator(handle): 1098 ... print record.id, record.seq 1099 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1100 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1101 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1102 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1103 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1104 >>> handle.close() 1105 1106 Typically however, you would call this via Bio.SeqIO instead with 1107 "fastq-solexa" as the format: 1108 1109 >>> from Bio import SeqIO 1110 >>> handle = open("Quality/solexa_example.fastq", "rU") 1111 >>> for record in SeqIO.parse(handle, "fastq-solexa"): 1112 ... print record.id, record.seq 1113 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1114 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1115 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1116 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1117 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1118 >>> handle.close() 1119 1120 If you want to look at the qualities, they are recorded in each record's 1121 per-letter-annotation dictionary as a simple list of integers: 1122 1123 >>> print record.letter_annotations["solexa_quality"] 1124 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1125 1126 These scores aren't very good, but they are high enough that they map 1127 almost exactly onto PHRED scores: 1128 1129 >>> print "%0.2f" % phred_quality_from_solexa(25) 1130 25.01 1131 1132 Let's look at faked example read which is even worse, where there are 1133 more noticeable differences between the Solexa and PHRED scores:: 1134 1135 @slxa_0001_1_0001_01 1136 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1137 +slxa_0001_1_0001_01 1138 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1139 1140 Again, you would typically use Bio.SeqIO to read this file in (rather than 1141 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1142 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1143 as shown above. This example has only as one entry, so instead we can 1144 use the Bio.SeqIO.read() function: 1145 1146 >>> from Bio import SeqIO 1147 >>> handle = open("Quality/solexa_faked.fastq", "rU") 1148 >>> record = SeqIO.read(handle, "fastq-solexa") 1149 >>> handle.close() 1150 >>> print record.id, record.seq 1151 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1152 >>> print record.letter_annotations["solexa_quality"] 1153 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1154 1155 These quality scores are so low that when converted from the Solexa scheme 1156 into PHRED scores they look quite different: 1157 1158 >>> print "%0.2f" % phred_quality_from_solexa(-1) 1159 2.54 1160 >>> print "%0.2f" % phred_quality_from_solexa(-5) 1161 1.19 1162 1163 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1164 method to output the record(s): 1165 1166 >>> print record.format("fastq-solexa") 1167 @slxa_0001_1_0001_01 1168 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1169 + 1170 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1171 <BLANKLINE> 1172 1173 Note this output is slightly different from the input file as Biopython 1174 has left out the optional repetition of the sequence identifier on the "+" 1175 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1176 output format instead, and Biopython will do the conversion for you: 1177 1178 >>> print record.format("fastq") 1179 @slxa_0001_1_0001_01 1180 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1181 + 1182 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1183 <BLANKLINE> 1184 1185 >>> print record.format("qual") 1186 >slxa_0001_1_0001_01 1187 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1188 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1189 1 1 1190 <BLANKLINE> 1191 1192 As shown above, the poor quality Solexa reads have been mapped to the 1193 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1194 """ 1195 q_mapping = dict() 1196 for letter in range(0, 255): 1197 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1198 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1199 if title2ids: 1200 id, name, descr = title_line 1201 else: 1202 descr = title_line 1203 id = descr.split()[0] 1204 name = id 1205 record = SeqRecord(Seq(seq_string, alphabet), 1206 id=id, name=name, description=descr) 1207 qualities = [q_mapping[letter] for letter in quality_string] 1208 #DO NOT convert these into PHRED qualities automatically! 1209 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1210 raise ValueError("Invalid character in quality string") 1211 #Dirty trick to speed up this line: 1212 #record.letter_annotations["solexa_quality"] = qualities 1213 dict.__setitem__(record._per_letter_annotations, 1214 "solexa_quality", qualities) 1215 yield record
1216 1217
1218 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1219 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1220 1221 The optional arguments are the same as those for the FastqPhredIterator. 1222 1223 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1224 encoding PHRED integer qualities using ASCII values with an offset of 64. 1225 1226 >>> from Bio import SeqIO 1227 >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina") 1228 >>> print record.id, record.seq 1229 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1230 >>> max(record.letter_annotations["phred_quality"]) 1231 40 1232 >>> min(record.letter_annotations["phred_quality"]) 1233 0 1234 1235 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1236 with an ASCII offset of 64. They are approximately equal but only for high 1237 quality reads. If you have an old Solexa/Illumina file with negative 1238 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1239 1240 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina") 1241 Traceback (most recent call last): 1242 ... 1243 ValueError: Invalid character in quality string 1244 1245 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1246 """ 1247 q_mapping = dict() 1248 for letter in range(0, 255): 1249 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1250 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1251 if title2ids: 1252 id, name, descr = title2ids(title_line) 1253 else: 1254 descr = title_line 1255 id = descr.split()[0] 1256 name = id 1257 record = SeqRecord(Seq(seq_string, alphabet), 1258 id=id, name=name, description=descr) 1259 qualities = [q_mapping[letter] for letter in quality_string] 1260 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1261 raise ValueError("Invalid character in quality string") 1262 #Dirty trick to speed up this line: 1263 #record.letter_annotations["phred_quality"] = qualities 1264 dict.__setitem__(record._per_letter_annotations, 1265 "phred_quality", qualities) 1266 yield record
1267 1268
1269 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1270 """For QUAL files which include PHRED quality scores, but no sequence. 1271 1272 For example, consider this short QUAL file:: 1273 1274 >EAS54_6_R1_2_1_413_324 1275 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1276 26 26 26 23 23 1277 >EAS54_6_R1_2_1_540_792 1278 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1279 26 18 26 23 18 1280 >EAS54_6_R1_2_1_443_348 1281 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1282 24 18 18 18 18 1283 1284 Using this module directly you might run: 1285 1286 >>> handle = open("Quality/example.qual", "rU") 1287 >>> for record in QualPhredIterator(handle): 1288 ... print record.id, record.seq 1289 EAS54_6_R1_2_1_413_324 ????????????????????????? 1290 EAS54_6_R1_2_1_540_792 ????????????????????????? 1291 EAS54_6_R1_2_1_443_348 ????????????????????????? 1292 >>> handle.close() 1293 1294 Typically however, you would call this via Bio.SeqIO instead with "qual" 1295 as the format: 1296 1297 >>> from Bio import SeqIO 1298 >>> handle = open("Quality/example.qual", "rU") 1299 >>> for record in SeqIO.parse(handle, "qual"): 1300 ... print record.id, record.seq 1301 EAS54_6_R1_2_1_413_324 ????????????????????????? 1302 EAS54_6_R1_2_1_540_792 ????????????????????????? 1303 EAS54_6_R1_2_1_443_348 ????????????????????????? 1304 >>> handle.close() 1305 1306 Becase QUAL files don't contain the sequence string itself, the seq 1307 property is set to an UnknownSeq object. As no alphabet was given, this 1308 has defaulted to a generic single letter alphabet and the character "?" 1309 used. 1310 1311 By specifying a nucleotide alphabet, "N" is used instead: 1312 1313 >>> from Bio import SeqIO 1314 >>> from Bio.Alphabet import generic_dna 1315 >>> handle = open("Quality/example.qual", "rU") 1316 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1317 ... print record.id, record.seq 1318 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1319 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1320 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1321 >>> handle.close() 1322 1323 However, the quality scores themselves are available as a list of integers 1324 in each record's per-letter-annotation: 1325 1326 >>> print record.letter_annotations["phred_quality"] 1327 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1328 1329 You can still slice one of these SeqRecord objects with an UnknownSeq: 1330 1331 >>> sub_record = record[5:10] 1332 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"] 1333 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1334 1335 As of Biopython 1.59, this parser will accept files with negatives quality 1336 scores but will replace them with the lowest possible PHRED score of zero. 1337 This will trigger a warning, previously it raised a ValueError exception. 1338 """ 1339 #Skip any text before the first record (e.g. blank lines, comments) 1340 while True: 1341 line = handle.readline() 1342 if line == "": 1343 return # Premature end of file, or just empty? 1344 if line[0] == ">": 1345 break 1346 1347 while True: 1348 if line[0] != ">": 1349 raise ValueError( 1350 "Records in Fasta files should start with '>' character") 1351 if title2ids: 1352 id, name, descr = title2ids(line[1:].rstrip()) 1353 else: 1354 descr = line[1:].rstrip() 1355 id = descr.split()[0] 1356 name = id 1357 1358 qualities = [] 1359 line = handle.readline() 1360 while True: 1361 if not line: 1362 break 1363 if line[0] == ">": 1364 break 1365 qualities.extend([int(word) for word in line.split()]) 1366 line = handle.readline() 1367 1368 if qualities and min(qualities) < 0: 1369 warnings.warn(("Negative quality score %i found, " + 1370 "substituting PHRED zero instead.") 1371 % min(qualities), BiopythonParserWarning) 1372 qualities = [max(0, q) for q in qualities] 1373 1374 #Return the record and then continue... 1375 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1376 id=id, name=name, description=descr) 1377 #Dirty trick to speed up this line: 1378 #record.letter_annotations["phred_quality"] = qualities 1379 dict.__setitem__(record._per_letter_annotations, 1380 "phred_quality", qualities) 1381 yield record 1382 1383 if not line: 1384 return # StopIteration 1385 assert False, "Should not reach this line"
1386 1387
1388 -class FastqPhredWriter(SequentialSequenceWriter):
1389 """Class to write standard FASTQ format files (using PHRED quality scores). 1390 1391 Although you can use this class directly, you are strongly encouraged 1392 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1393 or the alias "fastq-sanger". For example, this code reads in a standard 1394 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1395 Sanger style FASTQ file: 1396 1397 >>> from Bio import SeqIO 1398 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1399 >>> out_handle = open("Quality/temp.fastq", "w") 1400 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1401 3 1402 >>> out_handle.close() 1403 1404 You might want to do this if the original file included extra line breaks, 1405 which while valid may not be supported by all tools. The output file from 1406 Biopython will have each sequence on a single line, and each quality 1407 string on a single line (which is considered desirable for maximum 1408 compatibility). 1409 1410 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1411 quality scores) is converted into a standard Sanger style FASTQ file using 1412 PHRED qualities: 1413 1414 >>> from Bio import SeqIO 1415 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1416 >>> out_handle = open("Quality/temp.fastq", "w") 1417 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1418 5 1419 >>> out_handle.close() 1420 1421 This code is also called if you use the .format("fastq") method of a 1422 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1423 1424 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1425 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1426 warning is issued. 1427 1428 P.S. To avoid cluttering up your working directory, you can delete this 1429 temporary file now: 1430 1431 >>> import os 1432 >>> os.remove("Quality/temp.fastq") 1433 """ 1434 assert SANGER_SCORE_OFFSET == ord("!") 1435
1436 - def write_record(self, record):
1437 """Write a single FASTQ record to the file.""" 1438 assert self._header_written 1439 assert not self._footer_written 1440 self._record_written = True 1441 #TODO - Is an empty sequence allowed in FASTQ format? 1442 if record.seq is None: 1443 raise ValueError("No sequence for record %s" % record.id) 1444 seq_str = str(record.seq) 1445 qualities_str = _get_sanger_quality_str(record) 1446 if len(qualities_str) != len(seq_str): 1447 raise ValueError("Record %s has sequence length %i but %i quality scores" 1448 % (record.id, len(seq_str), len(qualities_str))) 1449 1450 #FASTQ files can include a description, just like FASTA files 1451 #(at least, this is what the NCBI Short Read Archive does) 1452 id = self.clean(record.id) 1453 description = self.clean(record.description) 1454 if description and description.split(None, 1)[0] == id: 1455 #The description includes the id at the start 1456 title = description 1457 elif description: 1458 title = "%s %s" % (id, description) 1459 else: 1460 title = id 1461 1462 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1463 1464
1465 -class QualPhredWriter(SequentialSequenceWriter):
1466 """Class to write QUAL format files (using PHRED quality scores). 1467 1468 Although you can use this class directly, you are strongly encouraged 1469 to use the Bio.SeqIO.write() function instead. For example, this code 1470 reads in a FASTQ file and saves the quality scores into a QUAL file: 1471 1472 >>> from Bio import SeqIO 1473 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1474 >>> out_handle = open("Quality/temp.qual", "w") 1475 >>> SeqIO.write(record_iterator, out_handle, "qual") 1476 3 1477 >>> out_handle.close() 1478 1479 This code is also called if you use the .format("qual") method of a 1480 SeqRecord. 1481 1482 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1483 1484 >>> import os 1485 >>> os.remove("Quality/temp.qual") 1486 """
1487 - def __init__(self, handle, wrap=60, record2title=None):
1488 """Create a QUAL writer. 1489 1490 Arguments: 1491 - handle - Handle to an output file, e.g. as returned 1492 by open(filename, "w") 1493 - wrap - Optional line length used to wrap sequence lines. 1494 Defaults to wrapping the sequence at 60 characters 1495 Use zero (or None) for no wrapping, giving a single 1496 long line for the sequence. 1497 - record2title - Optional function to return the text to be 1498 used for the title line of each record. By default 1499 a combination of the record.id and record.description 1500 is used. If the record.description starts with the 1501 record.id, then just the record.description is used. 1502 1503 The record2title argument is present for consistency with the 1504 Bio.SeqIO.FastaIO writer class. 1505 """ 1506 SequentialSequenceWriter.__init__(self, handle) 1507 #self.handle = handle 1508 self.wrap = None 1509 if wrap: 1510 if wrap < 1: 1511 raise ValueError 1512 self.wrap = wrap 1513 self.record2title = record2title
1514
1515 - def write_record(self, record):
1516 """Write a single QUAL record to the file.""" 1517 assert self._header_written 1518 assert not self._footer_written 1519 self._record_written = True 1520 1521 handle = self.handle 1522 wrap = self.wrap 1523 1524 if self.record2title: 1525 title = self.clean(self.record2title(record)) 1526 else: 1527 id = self.clean(record.id) 1528 description = self.clean(record.description) 1529 if description and description.split(None, 1)[0] == id: 1530 #The description includes the id at the start 1531 title = description 1532 elif description: 1533 title = "%s %s" % (id, description) 1534 else: 1535 title = id 1536 handle.write(">%s\n" % title) 1537 1538 qualities = _get_phred_quality(record) 1539 try: 1540 #This rounds to the nearest integer. 1541 #TODO - can we record a float in a qual file? 1542 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1543 except TypeError, e: 1544 if None in qualities: 1545 raise TypeError("A quality value of None was found") 1546 else: 1547 raise e 1548 1549 if wrap > 5: 1550 #Fast wrapping 1551 data = " ".join(qualities_strs) 1552 while True: 1553 if len(data) <= wrap: 1554 self.handle.write(data + "\n") 1555 break 1556 else: 1557 #By construction there must be spaces in the first X chars 1558 #(unless we have X digit or higher quality scores!) 1559 i = data.rfind(" ", 0, wrap) 1560 handle.write(data[:i] + "\n") 1561 data = data[i + 1:] 1562 elif wrap: 1563 #Safe wrapping 1564 while qualities_strs: 1565 line = qualities_strs.pop(0) 1566 while qualities_strs \ 1567 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1568 line += " " + qualities_strs.pop(0) 1569 handle.write(line + "\n") 1570 else: 1571 #No wrapping 1572 data = " ".join(qualities_strs) 1573 handle.write(data + "\n")
1574 1575
1576 -class FastqSolexaWriter(SequentialSequenceWriter):
1577 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1578 1579 This outputs FASTQ files like those from the early Solexa/Illumina 1580 pipeline, using Solexa scores and an ASCII offset of 64. These are 1581 NOT compatible with the standard Sanger style PHRED FASTQ files. 1582 1583 If your records contain a "solexa_quality" entry under letter_annotations, 1584 this is used, otherwise any "phred_quality" entry will be used after 1585 conversion using the solexa_quality_from_phred function. If neither style 1586 of quality scores are present, an exception is raised. 1587 1588 Although you can use this class directly, you are strongly encouraged 1589 to use the Bio.SeqIO.write() function instead. For example, this code 1590 reads in a FASTQ file and re-saves it as another FASTQ file: 1591 1592 >>> from Bio import SeqIO 1593 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1594 >>> out_handle = open("Quality/temp.fastq", "w") 1595 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1596 5 1597 >>> out_handle.close() 1598 1599 You might want to do this if the original file included extra line breaks, 1600 which (while valid) may not be supported by all tools. The output file 1601 from Biopython will have each sequence on a single line, and each quality 1602 string on a single line (which is considered desirable for maximum 1603 compatibility). 1604 1605 This code is also called if you use the .format("fastq-solexa") method of 1606 a SeqRecord. For example, 1607 1608 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1609 >>> print record.format("fastq-solexa") 1610 @Test PHRED qualities from 40 to 0 inclusive 1611 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1612 + 1613 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1614 <BLANKLINE> 1615 1616 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1617 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1618 a warning is issued. 1619 1620 P.S. Don't forget to delete the temp file if you don't need it anymore: 1621 1622 >>> import os 1623 >>> os.remove("Quality/temp.fastq") 1624 """
1625 - def write_record(self, record):
1626 """Write a single FASTQ record to the file.""" 1627 assert self._header_written 1628 assert not self._footer_written 1629 self._record_written = True 1630 1631 #TODO - Is an empty sequence allowed in FASTQ format? 1632 if record.seq is None: 1633 raise ValueError("No sequence for record %s" % record.id) 1634 seq_str = str(record.seq) 1635 qualities_str = _get_solexa_quality_str(record) 1636 if len(qualities_str) != len(seq_str): 1637 raise ValueError("Record %s has sequence length %i but %i quality scores" 1638 % (record.id, len(seq_str), len(qualities_str))) 1639 1640 #FASTQ files can include a description, just like FASTA files 1641 #(at least, this is what the NCBI Short Read Archive does) 1642 id = self.clean(record.id) 1643 description = self.clean(record.description) 1644 if description and description.split(None, 1)[0] == id: 1645 #The description includes the id at the start 1646 title = description 1647 elif description: 1648 title = "%s %s" % (id, description) 1649 else: 1650 title = id 1651 1652 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1653 1654
1655 -class FastqIlluminaWriter(SequentialSequenceWriter):
1656 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1657 1658 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1659 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1660 compatible with the standard Sanger style PHRED FASTQ files which use an 1661 ASCII offset of 32. 1662 1663 Although you can use this class directly, you are strongly encouraged to 1664 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1665 instead. This code is also called if you use the .format("fastq-illumina") 1666 method of a SeqRecord. For example, 1667 1668 >>> from Bio import SeqIO 1669 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1670 >>> print record.format("fastq-illumina") 1671 @Test PHRED qualities from 40 to 0 inclusive 1672 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1673 + 1674 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1675 <BLANKLINE> 1676 1677 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1678 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1679 warning is issued. 1680 """
1681 - def write_record(self, record):
1682 """Write a single FASTQ record to the file.""" 1683 assert self._header_written 1684 assert not self._footer_written 1685 self._record_written = True 1686 1687 #TODO - Is an empty sequence allowed in FASTQ format? 1688 if record.seq is None: 1689 raise ValueError("No sequence for record %s" % record.id) 1690 seq_str = str(record.seq) 1691 qualities_str = _get_illumina_quality_str(record) 1692 if len(qualities_str) != len(seq_str): 1693 raise ValueError("Record %s has sequence length %i but %i quality scores" 1694 % (record.id, len(seq_str), len(qualities_str))) 1695 1696 #FASTQ files can include a description, just like FASTA files 1697 #(at least, this is what the NCBI Short Read Archive does) 1698 id = self.clean(record.id) 1699 description = self.clean(record.description) 1700 if description and description.split(None, 1)[0] == id: 1701 #The description includes the id at the start 1702 title = description 1703 elif description: 1704 title = "%s %s" % (id, description) 1705 else: 1706 title = id 1707 1708 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1709 1710
1711 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1712 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1713 1714 For example, consider this short QUAL file with PHRED quality scores:: 1715 1716 >EAS54_6_R1_2_1_413_324 1717 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1718 26 26 26 23 23 1719 >EAS54_6_R1_2_1_540_792 1720 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1721 26 18 26 23 18 1722 >EAS54_6_R1_2_1_443_348 1723 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1724 24 18 18 18 18 1725 1726 And a matching FASTA file:: 1727 1728 >EAS54_6_R1_2_1_413_324 1729 CCCTTCTTGTCTTCAGCGTTTCTCC 1730 >EAS54_6_R1_2_1_540_792 1731 TTGGCAGGCCAAGGCCGATGGATCA 1732 >EAS54_6_R1_2_1_443_348 1733 GTTGCTTCTGGCGTGGGTGGGGGGG 1734 1735 You can parse these separately using Bio.SeqIO with the "qual" and 1736 "fasta" formats, but then you'll get a group of SeqRecord objects with 1737 no sequence, and a matching group with the sequence but not the 1738 qualities. Because it only deals with one input file handle, Bio.SeqIO 1739 can't be used to read the two files together - but this function can! 1740 For example, 1741 1742 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1743 ... open("Quality/example.qual", "rU")) 1744 >>> for record in rec_iter: 1745 ... print record.id, record.seq 1746 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1747 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1748 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1749 1750 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1751 they are in each record's per-letter-annotation dictionary as a simple 1752 list of integers: 1753 1754 >>> print record.letter_annotations["phred_quality"] 1755 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1756 1757 If you have access to data as a FASTQ format file, using that directly 1758 would be simpler and more straight forward. Note that you can easily use 1759 this function to convert paired FASTA and QUAL files into FASTQ files: 1760 1761 >>> from Bio import SeqIO 1762 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1763 ... open("Quality/example.qual", "rU")) 1764 >>> out_handle = open("Quality/temp.fastq", "w") 1765 >>> SeqIO.write(rec_iter, out_handle, "fastq") 1766 3 1767 >>> out_handle.close() 1768 1769 And don't forget to clean up the temp file if you don't need it anymore: 1770 1771 >>> import os 1772 >>> os.remove("Quality/temp.fastq") 1773 """ 1774 from Bio.SeqIO.FastaIO import FastaIterator 1775 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1776 title2ids=title2ids) 1777 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1778 title2ids=title2ids) 1779 1780 #Using zip(...) would create a list loading everything into memory! 1781 #It would also not catch any extra records found in only one file. 1782 while True: 1783 try: 1784 f_rec = fasta_iter.next() 1785 except StopIteration: 1786 f_rec = None 1787 try: 1788 q_rec = qual_iter.next() 1789 except StopIteration: 1790 q_rec = None 1791 if f_rec is None and q_rec is None: 1792 #End of both files 1793 break 1794 if f_rec is None: 1795 raise ValueError("FASTA file has more entries than the QUAL file.") 1796 if q_rec is None: 1797 raise ValueError("QUAL file has more entries than the FASTA file.") 1798 if f_rec.id != q_rec.id: 1799 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1800 % (f_rec.id, q_rec.id)) 1801 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1802 raise ValueError("Sequence length and number of quality scores disagree for %s" 1803 % f_rec.id) 1804 #Merge the data.... 1805 f_rec.letter_annotations[ 1806 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1807 yield f_rec
1808 #Done 1809 1810 1811 if __name__ == "__main__": 1812 from Bio._utils import run_doctest 1813 run_doctest(verbose=0) 1814