Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2015 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78      - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79      - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80        offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81        These can potentially hold PHRED scores from 0 to 93. 
  82      - "fastq-sanger" is an alias for "fastq". 
  83      - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84        files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85        scores from -5 to 62. 
  86      - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87        PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88        to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114  >>> from Bio import SeqIO 
 115  >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116  ...     print("%s %s" % (record.id, record.seq)) 
 117  EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118  EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119  EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123  >>> print(record) 
 124  ID: EAS54_6_R1_2_1_443_348 
 125  Name: EAS54_6_R1_2_1_443_348 
 126  Description: EAS54_6_R1_2_1_443_348 
 127  Number of features: 0 
 128  Per letter annotation for: phred_quality 
 129  Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130  >>> print(record.letter_annotations["phred_quality"]) 
 131  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135  >>> print(record.format("qual")) 
 136  >EAS54_6_R1_2_1_443_348 
 137  26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138  24 18 18 18 18 
 139  <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143  >>> print(record.format("fastq")) 
 144  @EAS54_6_R1_2_1_443_348 
 145  GTTGCTTCTGGCGTGGGTGGGGGGG 
 146  + 
 147  ;;;;;;;;;;;9;7;;.7;393333 
 148  <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153  >>> print(record.format("fastq-illumina")) 
 154  @EAS54_6_R1_2_1_443_348 
 155  GTTGCTTCTGGCGTGGGTGGGGGGG 
 156  + 
 157  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158  <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163  >>> print(record.format("fastq-solexa")) 
 164  @EAS54_6_R1_2_1_443_348 
 165  GTTGCTTCTGGCGTGGGTGGGGGGG 
 166  + 
 167  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168  <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179  >>> sub_rec = record[5:15] 
 180  >>> print(sub_rec) 
 181  ID: EAS54_6_R1_2_1_443_348 
 182  Name: EAS54_6_R1_2_1_443_348 
 183  Description: EAS54_6_R1_2_1_443_348 
 184  Number of features: 0 
 185  Per letter annotation for: phred_quality 
 186  Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187  >>> print(sub_rec.letter_annotations["phred_quality"]) 
 188  [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189  >>> print(sub_rec.format("fastq")) 
 190  @EAS54_6_R1_2_1_443_348 
 191  TTCTGGCGTG 
 192  + 
 193  ;;;;;;9;7; 
 194  <BLANKLINE> 
 195   
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198  >>> from Bio import SeqIO 
 199  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200  >>> with open("Quality/temp.qual", "w") as out_handle: 
 201  ...     SeqIO.write(record_iterator, out_handle, "qual") 
 202  3 
 203   
 204  You can of course read in a QUAL file, such as the one we just created: 
 205   
 206  >>> from Bio import SeqIO 
 207  >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 208  ...     print("%s %s" % (record.id, record.seq)) 
 209  EAS54_6_R1_2_1_413_324 ????????????????????????? 
 210  EAS54_6_R1_2_1_540_792 ????????????????????????? 
 211  EAS54_6_R1_2_1_443_348 ????????????????????????? 
 212   
 213  Notice that QUAL files don't have a proper sequence present!  But the quality 
 214  information is there: 
 215   
 216  >>> print(record) 
 217  ID: EAS54_6_R1_2_1_443_348 
 218  Name: EAS54_6_R1_2_1_443_348 
 219  Description: EAS54_6_R1_2_1_443_348 
 220  Number of features: 0 
 221  Per letter annotation for: phred_quality 
 222  UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 223  >>> print(record.letter_annotations["phred_quality"]) 
 224  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 225   
 226  Just to keep things tidy, if you are following this example yourself, you can 
 227  delete this temporary file now: 
 228   
 229  >>> import os 
 230  >>> os.remove("Quality/temp.qual") 
 231   
 232  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 233  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 234  would have to read the two in separately and then combine the data.  However, 
 235  since this is such a common thing to want to do, there is a helper iterator 
 236  defined in this module that does this for you - PairedFastaQualIterator. 
 237   
 238  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 239  then a simple dictionary approach would work: 
 240   
 241  >>> from Bio import SeqIO 
 242  >>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) 
 243  >>> for rec in SeqIO.parse("Quality/example.qual", "qual"): 
 244  ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 245   
 246  You can then access any record by its key, and get both the sequence and the 
 247  quality scores. 
 248   
 249  >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 
 250  @EAS54_6_R1_2_1_540_792 
 251  TTGGCAGGCCAAGGCCGATGGATCA 
 252  + 
 253  ;;;;;;;;;;;7;;;;;-;;;3;83 
 254  <BLANKLINE> 
 255   
 256  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 257  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 258  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 259  for the more recent variant), as this cannot be detected reliably 
 260  automatically. 
 261   
 262  To illustrate this problem, let's consider an artificial example: 
 263   
 264  >>> from Bio.Seq import Seq 
 265  >>> from Bio.Alphabet import generic_dna 
 266  >>> from Bio.SeqRecord import SeqRecord 
 267  >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 268  ... description="Made up!") 
 269  >>> print(test.format("fasta")) 
 270  >Test Made up! 
 271  NACGTACGTA 
 272  <BLANKLINE> 
 273  >>> print(test.format("fastq")) 
 274  Traceback (most recent call last): 
 275   ... 
 276  ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 277   
 278  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 279  or FASTQ format we need to provide some quality scores. These are held as a 
 280  list of integers (one for each base) in the letter_annotations dictionary: 
 281   
 282  >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 283  >>> print(test.format("qual")) 
 284  >Test Made up! 
 285  0 1 2 3 4 5 10 20 30 40 
 286  <BLANKLINE> 
 287  >>> print(test.format("fastq")) 
 288  @Test Made up! 
 289  NACGTACGTA 
 290  + 
 291  !"#$%&+5?I 
 292  <BLANKLINE> 
 293   
 294  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 295  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 296  the letter "I": 
 297   
 298  >>> ord('!') - 33 
 299  0 
 300  >>> ord('I') - 33 
 301  40 
 302  >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 303  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 304   
 305  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 306  scores with an offset of 64: 
 307   
 308  >>> print(test.format("fastq-illumina")) 
 309  @Test Made up! 
 310  NACGTACGTA 
 311  + 
 312  @ABCDEJT^h 
 313  <BLANKLINE> 
 314   
 315  And we can check this too - the first PHRED score was zero, and this mapped to 
 316  "@", while the final score was 40 and this mapped to "h": 
 317   
 318  >>> ord("@") - 64 
 319  0 
 320  >>> ord("h") - 64 
 321  40 
 322  >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 323  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 324   
 325  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 326  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 327  format to consider which encodes Solexa scores instead of PHRED scores. 
 328   
 329  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 330  scores (rounding to one decimal place): 
 331   
 332  >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 
 333  ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 
 334  PHRED 0 maps to Solexa -5.0 
 335  PHRED 1 maps to Solexa -5.0 
 336  PHRED 2 maps to Solexa -2.3 
 337  PHRED 3 maps to Solexa -0.0 
 338  PHRED 4 maps to Solexa 1.8 
 339  PHRED 5 maps to Solexa 3.3 
 340  PHRED 10 maps to Solexa 9.5 
 341  PHRED 20 maps to Solexa 20.0 
 342  PHRED 30 maps to Solexa 30.0 
 343  PHRED 40 maps to Solexa 40.0 
 344   
 345  Now here is the record using the old Solexa style FASTQ file: 
 346   
 347  >>> print(test.format("fastq-solexa")) 
 348  @Test Made up! 
 349  NACGTACGTA 
 350  + 
 351  ;;>@BCJT^h 
 352  <BLANKLINE> 
 353   
 354  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 355   
 356  >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 357  [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 358   
 359  This explains why the last few letters of this FASTQ output matched that using 
 360  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 361  are approximately equal. 
 362   
 363  """ 
 364  from __future__ import print_function 
 365   
 366  from Bio.Alphabet import single_letter_alphabet 
 367  from Bio.Seq import Seq, UnknownSeq 
 368  from Bio.SeqRecord import SeqRecord 
 369  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 370  from math import log 
 371  import warnings 
 372  from Bio import BiopythonWarning, BiopythonParserWarning 
 373   
 374   
 375  # define score offsets. See discussion for differences between Sanger and 
 376  # Solexa offsets. 
 377  SANGER_SCORE_OFFSET = 33 
 378  SOLEXA_SCORE_OFFSET = 64 
 379   
 380   
381 -def solexa_quality_from_phred(phred_quality):
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 383 384 PHRED and Solexa quality scores are both log transformations of a 385 probality of error (high score = low probability of error). This function 386 takes a PHRED score, transforms it back to a probability of error, and 387 then re-expresses it as a Solexa score. This assumes the error estimates 388 are equivalent. 389 390 How does this work exactly? Well the PHRED quality is minus ten times the 391 base ten logarithm of the probability of error:: 392 393 phred_quality = -10*log(error,10) 394 395 Therefore, turning this round:: 396 397 error = 10 ** (- phred_quality / 10) 398 399 Now, Solexa qualities use a different log transformation:: 400 401 solexa_quality = -10*log(error/(1-error),10) 402 403 After substitution and a little manipulation we get:: 404 405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 406 407 However, real Solexa files use a minimum quality of -5. This does have a 408 good reason - a random base call would be correct 25% of the time, 409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 412 413 Taken literally, this logarithic formula would map a PHRED quality of zero 414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 415 score of zero means a probability of error of one (i.e. the base call is 416 definitely wrong), which is worse than random! In practice, a PHRED quality 417 of zero usually means a default value, or perhaps random - and therefore 418 mapping it to the minimum Solexa score of -5 is reasonable. 419 420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 422 quality of zero to -5.0 as well. 423 424 Note this function will return a floating point number, it is up to you to 425 round this to the nearest integer if appropriate. e.g. 426 427 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 428 80.00 429 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 430 50.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 432 19.96 433 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 434 9.54 435 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 436 3.35 437 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 438 1.80 439 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 440 -0.02 441 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 442 -2.33 443 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 444 -5.00 445 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 446 -5.00 447 448 Notice that for high quality reads PHRED and Solexa scores are numerically 449 equal. The differences are important for poor quality reads, where PHRED 450 has a minimum of zero but Solexa scores can be negative. 451 452 Finally, as a special case where None is used for a "missing value", None 453 is returned: 454 455 >>> print(solexa_quality_from_phred(None)) 456 None 457 """ 458 if phred_quality is None: 459 # Assume None is used as some kind of NULL or NA value; return None 460 # e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 461 return None 462 elif phred_quality > 0: 463 # Solexa uses a minimum value of -5, which after rounding matches a 464 # random nucleotide base call. 465 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 466 elif phred_quality == 0: 467 # Special case, map to -5 as discussed in the docstring 468 return -5.0 469 else: 470 raise ValueError("PHRED qualities must be positive (or zero), not %r" 471 % phred_quality)
472 473
474 -def phred_quality_from_solexa(solexa_quality):
475 """Convert a Solexa quality (which can be negative) to a PHRED quality. 476 477 PHRED and Solexa quality scores are both log transformations of a 478 probality of error (high score = low probability of error). This function 479 takes a Solexa score, transforms it back to a probability of error, and 480 then re-expresses it as a PHRED score. This assumes the error estimates 481 are equivalent. 482 483 The underlying formulas are given in the documentation for the sister 484 function solexa_quality_from_phred, in this case the operation is:: 485 486 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 487 488 This will return a floating point number, it is up to you to round this to 489 the nearest integer if appropriate. e.g. 490 491 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 492 80.00 493 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 494 20.04 495 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 496 10.41 497 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 498 3.01 499 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 500 1.19 501 502 Note that a solexa_quality less then -5 is not expected, will trigger a 503 warning, but will still be converted as per the logarithmic mapping 504 (giving a number between 0 and 1.19 back). 505 506 As a special case where None is used for a "missing value", None is 507 returned: 508 509 >>> print(phred_quality_from_solexa(None)) 510 None 511 """ 512 if solexa_quality is None: 513 # Assume None is used as some kind of NULL or NA value; return None 514 return None 515 if solexa_quality < -5: 516 warnings.warn("Solexa quality less than -5 passed, %r" % solexa_quality, 517 BiopythonWarning) 518 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
519 520
521 -def _get_phred_quality(record):
522 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 523 524 If there are no PHRED qualities, but there are Solexa qualities, those are 525 used instead after conversion. 526 """ 527 try: 528 return record.letter_annotations["phred_quality"] 529 except KeyError: 530 pass 531 try: 532 return [phred_quality_from_solexa(q) for 533 q in record.letter_annotations["solexa_quality"]] 534 except KeyError: 535 raise ValueError("No suitable quality scores found in " 536 "letter_annotations of SeqRecord (id=%s)." 537 % record.id)
538 539 # Only map 0 to 93, we need to give a warning on truncating at 93 540 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 541 for qp in range(0, 93 + 1)) 542 # Only map -5 to 93, we need to give a warning on truncating at 93 543 _solexa_to_sanger_quality_str = dict( 544 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 545 SANGER_SCORE_OFFSET))) 546 for qs in range(-5, 93 + 1)) 547 548
549 -def _get_sanger_quality_str(record):
550 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 551 552 >>> from Bio.Seq import Seq 553 >>> from Bio.SeqRecord import SeqRecord 554 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 555 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 556 >>> _get_sanger_quality_str(r) 557 'SI?5+!' 558 559 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 560 the PHRED qualities are integers, this function is able to use a very fast 561 pre-cached mapping. However, if they are floats which differ slightly, then 562 it has to do the appropriate rounding - which is slower: 563 564 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 565 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 566 >>> _get_sanger_quality_str(r2) 567 'SI?5+!' 568 569 If your scores include a None value, this raises an exception: 570 571 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 572 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 573 >>> _get_sanger_quality_str(r3) 574 Traceback (most recent call last): 575 ... 576 TypeError: A quality value of None was found 577 578 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 579 scores are used in preference: 580 581 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 582 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 583 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 584 >>> _get_sanger_quality_str(r4) 585 'SI?5+!' 586 587 If there are no PHRED scores, but there are Solexa scores, these are used 588 instead (after the appropriate conversion): 589 590 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 591 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 592 >>> _get_sanger_quality_str(r5) 593 'I?5+$"' 594 595 Again, integer Solexa scores can be looked up in a pre-cached mapping making 596 this very fast. You can still use approximate floating point scores: 597 598 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 599 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 600 >>> _get_sanger_quality_str(r6) 601 'I?5+$"' 602 603 Notice that due to the limited range of printable ASCII characters, a 604 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 605 file (using ASCII 126, the tilde). This function will issue a warning 606 in this situation. 607 """ 608 # TODO - This functions works and is fast, but it is also ugly 609 # and there is considerable repetition of code for the other 610 # two FASTQ variants. 611 try: 612 # These take priority (in case both Solexa and PHRED scores found) 613 qualities = record.letter_annotations["phred_quality"] 614 except KeyError: 615 # Fall back on solexa scores... 616 pass 617 else: 618 # Try and use the precomputed mapping: 619 try: 620 return "".join(_phred_to_sanger_quality_str[qp] 621 for qp in qualities) 622 except KeyError: 623 # Could be a float, or a None in the list, or a high value. 624 pass 625 if None in qualities: 626 raise TypeError("A quality value of None was found") 627 if max(qualities) >= 93.5: 628 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 629 BiopythonWarning) 630 # This will apply the truncation at 93, giving max ASCII 126 631 return "".join(chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 632 for qp in qualities) 633 # Fall back on the Solexa scores... 634 try: 635 qualities = record.letter_annotations["solexa_quality"] 636 except KeyError: 637 raise ValueError("No suitable quality scores found in " 638 "letter_annotations of SeqRecord (id=%s)." 639 % record.id) 640 # Try and use the precomputed mapping: 641 try: 642 return "".join(_solexa_to_sanger_quality_str[qs] 643 for qs in qualities) 644 except KeyError: 645 # Either no PHRED scores, or something odd like a float or None 646 pass 647 if None in qualities: 648 raise TypeError("A quality value of None was found") 649 # Must do this the slow way, first converting the PHRED scores into 650 # Solexa scores: 651 if max(qualities) >= 93.5: 652 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 653 BiopythonWarning) 654 # This will apply the truncation at 93, giving max ASCII 126 655 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 656 for qs in qualities)
657 658 # Only map 0 to 62, we need to give a warning on truncating at 62 659 assert 62 + SOLEXA_SCORE_OFFSET == 126 660 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 661 for qp in range(0, 62 + 1)) 662 # Only map -5 to 62, we need to give a warning on truncating at 62 663 _solexa_to_illumina_quality_str = dict( 664 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 665 for qs in range(-5, 62 + 1)) 666 667
668 -def _get_illumina_quality_str(record):
669 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 670 671 Notice that due to the limited range of printable ASCII characters, a 672 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 673 file (using ASCII 126, the tilde). This function will issue a warning 674 in this situation. 675 """ 676 # TODO - This functions works and is fast, but it is also ugly 677 # and there is considerable repetition of code for the other 678 # two FASTQ variants. 679 try: 680 # These take priority (in case both Solexa and PHRED scores found) 681 qualities = record.letter_annotations["phred_quality"] 682 except KeyError: 683 # Fall back on solexa scores... 684 pass 685 else: 686 # Try and use the precomputed mapping: 687 try: 688 return "".join(_phred_to_illumina_quality_str[qp] 689 for qp in qualities) 690 except KeyError: 691 # Could be a float, or a None in the list, or a high value. 692 pass 693 if None in qualities: 694 raise TypeError("A quality value of None was found") 695 if max(qualities) >= 62.5: 696 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 697 BiopythonWarning) 698 # This will apply the truncation at 62, giving max ASCII 126 699 return "".join(chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 700 for qp in qualities) 701 # Fall back on the Solexa scores... 702 try: 703 qualities = record.letter_annotations["solexa_quality"] 704 except KeyError: 705 raise ValueError("No suitable quality scores found in " 706 "letter_annotations of SeqRecord (id=%s)." 707 % record.id) 708 # Try and use the precomputed mapping: 709 try: 710 return "".join(_solexa_to_illumina_quality_str[qs] 711 for qs in qualities) 712 except KeyError: 713 # Either no PHRED scores, or something odd like a float or None 714 pass 715 if None in qualities: 716 raise TypeError("A quality value of None was found") 717 # Must do this the slow way, first converting the PHRED scores into 718 # Solexa scores: 719 if max(qualities) >= 62.5: 720 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 721 BiopythonWarning) 722 # This will apply the truncation at 62, giving max ASCII 126 723 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 724 for qs in qualities)
725 726 # Only map 0 to 62, we need to give a warning on truncating at 62 727 assert 62 + SOLEXA_SCORE_OFFSET == 126 728 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 729 for qs in range(-5, 62 + 1)) 730 # Only map -5 to 62, we need to give a warning on truncating at 62 731 _phred_to_solexa_quality_str = dict( 732 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 733 SOLEXA_SCORE_OFFSET))) 734 for qp in range(0, 62 + 1)) 735 736
737 -def _get_solexa_quality_str(record):
738 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 739 740 Notice that due to the limited range of printable ASCII characters, a 741 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 742 file (using ASCII 126, the tilde). This function will issue a warning 743 in this situation. 744 """ 745 # TODO - This functions works and is fast, but it is also ugly 746 # and there is considerable repetition of code for the other 747 # two FASTQ variants. 748 try: 749 # These take priority (in case both Solexa and PHRED scores found) 750 qualities = record.letter_annotations["solexa_quality"] 751 except KeyError: 752 # Fall back on PHRED scores... 753 pass 754 else: 755 # Try and use the precomputed mapping: 756 try: 757 return "".join(_solexa_to_solexa_quality_str[qs] 758 for qs in qualities) 759 except KeyError: 760 # Could be a float, or a None in the list, or a high value. 761 pass 762 if None in qualities: 763 raise TypeError("A quality value of None was found") 764 if max(qualities) >= 62.5: 765 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 766 BiopythonWarning) 767 # This will apply the truncation at 62, giving max ASCII 126 768 return "".join(chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 769 for qs in qualities) 770 # Fall back on the PHRED scores... 771 try: 772 qualities = record.letter_annotations["phred_quality"] 773 except KeyError: 774 raise ValueError("No suitable quality scores found in " 775 "letter_annotations of SeqRecord (id=%s)." 776 % record.id) 777 # Try and use the precomputed mapping: 778 try: 779 return "".join(_phred_to_solexa_quality_str[qp] 780 for qp in qualities) 781 except KeyError: 782 # Either no PHRED scores, or something odd like a float or None 783 # or too big to be in the cache 784 pass 785 if None in qualities: 786 raise TypeError("A quality value of None was found") 787 # Must do this the slow way, first converting the PHRED scores into 788 # Solexa scores: 789 if max(qualities) >= 62.5: 790 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 791 BiopythonWarning) 792 return "".join(chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 793 for qp in qualities)
794 795 796 # TODO - Default to nucleotide or even DNA?
797 -def FastqGeneralIterator(handle):
798 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 799 800 This code does not try to interpret the quality string numerically. It 801 just returns tuples of the title, sequence and quality as strings. For 802 the sequence and quality, any whitespace (such as new lines) is removed. 803 804 Our SeqRecord based FASTQ iterators call this function internally, and then 805 turn the strings into a SeqRecord objects, mapping the quality string into 806 a list of numerical scores. If you want to do a custom quality mapping, 807 then you might consider calling this function directly. 808 809 For parsing FASTQ files, the title string from the "@" line at the start 810 of each record can optionally be omitted on the "+" lines. If it is 811 repeated, it must be identical. 812 813 The sequence string and the quality string can optionally be split over 814 multiple lines, although several sources discourage this. In comparison, 815 for the FASTA file format line breaks between 60 and 80 characters are 816 the norm. 817 818 **WARNING** - Because the "@" character can appear in the quality string, 819 this can cause problems as this is also the marker for the start of 820 a new sequence. In fact, the "+" sign can also appear as well. Some 821 sources recommended having no line breaks in the quality to avoid this, 822 but even that is not enough, consider this example:: 823 824 @071113_EAS56_0053:1:1:998:236 825 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 826 +071113_EAS56_0053:1:1:998:236 827 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 828 @071113_EAS56_0053:1:1:182:712 829 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 830 + 831 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 832 @071113_EAS56_0053:1:1:153:10 833 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 834 + 835 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 836 @071113_EAS56_0053:1:3:990:501 837 TGGGAGGTTTTATGTGGA 838 AAGCAGCAATGTACAAGA 839 + 840 IIIIIII.IIIIII1@44 841 @-7.%<&+/$/%4(++(% 842 843 This is four PHRED encoded FASTQ entries originally from an NCBI source 844 (given the read length of 36, these are probably Solexa Illumina reads where 845 the quality has been mapped onto the PHRED values). 846 847 This example has been edited to illustrate some of the nasty things allowed 848 in the FASTQ format. Firstly, on the "+" lines most but not all of the 849 (redundant) identifiers are omitted. In real files it is likely that all or 850 none of these extra identifiers will be present. 851 852 Secondly, while the first three sequences have been shown without line 853 breaks, the last has been split over multiple lines. In real files any line 854 breaks are likely to be consistent. 855 856 Thirdly, some of the quality string lines start with an "@" character. For 857 the second record this is unavoidable. However for the fourth sequence this 858 only happens because its quality string is split over two lines. A naive 859 parser could wrongly treat any line starting with an "@" as the beginning of 860 a new sequence! This code copes with this possible ambiguity by keeping 861 track of the length of the sequence which gives the expected length of the 862 quality string. 863 864 Using this tricky example file as input, this short bit of code demonstrates 865 what this parsing function would return: 866 867 >>> with open("Quality/tricky.fastq", "rU") as handle: 868 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 869 ... print(title) 870 ... print("%s %s" % (sequence, quality)) 871 ... 872 071113_EAS56_0053:1:1:998:236 873 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 874 071113_EAS56_0053:1:1:182:712 875 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 876 071113_EAS56_0053:1:1:153:10 877 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 878 071113_EAS56_0053:1:3:990:501 879 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 880 881 Finally we note that some sources state that the quality string should 882 start with "!" (which using the PHRED mapping means the first letter always 883 has a quality score of zero). This rather restrictive rule is not widely 884 observed, so is therefore ignored here. One plus point about this "!" rule 885 is that (provided there are no line breaks in the quality sequence) it 886 would prevent the above problem with the "@" character. 887 """ 888 # We need to call handle.readline() at least four times per record, 889 # so we'll save a property look up each time: 890 handle_readline = handle.readline 891 892 # Skip any text before the first record (e.g. blank lines, comments?) 893 while True: 894 line = handle_readline() 895 if not line: 896 return # Premature end of file, or just empty? 897 if line[0] == "@": 898 break 899 if isinstance(line[0], int): 900 raise ValueError("Is this handle in binary mode not text mode?") 901 902 while line: 903 if line[0] != "@": 904 raise ValueError( 905 "Records in Fastq files should start with '@' character") 906 title_line = line[1:].rstrip() 907 # Will now be at least one line of quality data - in most FASTQ files 908 # just one line! We therefore use string concatenation (if needed) 909 # rather using than the "".join(...) trick just in case it is multiline: 910 seq_string = handle_readline().rstrip() 911 # There may now be more sequence lines, or the "+" quality marker line: 912 while True: 913 line = handle_readline() 914 if not line: 915 raise ValueError("End of file without quality information.") 916 if line[0] == "+": 917 # The title here is optional, but if present must match! 918 second_title = line[1:].rstrip() 919 if second_title and second_title != title_line: 920 raise ValueError("Sequence and quality captions differ.") 921 break 922 seq_string += line.rstrip() # removes trailing newlines 923 # This is going to slow things down a little, but assuming 924 # this isn't allowed we should try and catch it here: 925 if " " in seq_string or "\t" in seq_string: 926 raise ValueError("Whitespace is not allowed in the sequence.") 927 seq_len = len(seq_string) 928 929 # Will now be at least one line of quality data... 930 quality_string = handle_readline().rstrip() 931 # There may now be more quality data, or another sequence, or EOF 932 while True: 933 line = handle_readline() 934 if not line: 935 break # end of file 936 if line[0] == "@": 937 # This COULD be the start of a new sequence. However, it MAY just 938 # be a line of quality data which starts with a "@" character. We 939 # should be able to check this by looking at the sequence length 940 # and the amount of quality data found so far. 941 if len(quality_string) >= seq_len: 942 # We expect it to be equal if this is the start of a new record. 943 # If the quality data is longer, we'll raise an error below. 944 break 945 # Continue - its just some (more) quality data. 946 quality_string += line.rstrip() 947 948 if seq_len != len(quality_string): 949 raise ValueError("Lengths of sequence and quality values differs " 950 " for %s (%i and %i)." 951 % (title_line, seq_len, len(quality_string))) 952 953 # Return the record and then continue... 954 yield (title_line, seq_string, quality_string) 955 raise StopIteration
956 957
958 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
959 """Generator function to iterate over FASTQ records (as SeqRecord objects). 960 961 - handle - input file 962 - alphabet - optional alphabet 963 - title2ids - A function that, when given the title line from the FASTQ 964 file (without the beginning >), will return the id, name and 965 description (in that order) for the record as a tuple of 966 strings. If this is not given, then the entire title line 967 will be used as the description, and the first word as the 968 id and name. 969 970 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 971 972 For each sequence in a (Sanger style) FASTQ file there is a matching string 973 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 974 values with an offset of 33. 975 976 For example, consider a file containing three short reads:: 977 978 @EAS54_6_R1_2_1_413_324 979 CCCTTCTTGTCTTCAGCGTTTCTCC 980 + 981 ;;3;;;;;;;;;;;;7;;;;;;;88 982 @EAS54_6_R1_2_1_540_792 983 TTGGCAGGCCAAGGCCGATGGATCA 984 + 985 ;;;;;;;;;;;7;;;;;-;;;3;83 986 @EAS54_6_R1_2_1_443_348 987 GTTGCTTCTGGCGTGGGTGGGGGGG 988 + 989 ;;;;;;;;;;;9;7;;.7;393333 990 991 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 992 string encoding the PHRED qualities using a ASCII values with an offset of 993 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 994 995 Using this module directly you might run: 996 997 >>> with open("Quality/example.fastq", "rU") as handle: 998 ... for record in FastqPhredIterator(handle): 999 ... print("%s %s" % (record.id, record.seq)) 1000 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1001 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1002 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1003 1004 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1005 (or "fastq-sanger") as the format: 1006 1007 >>> from Bio import SeqIO 1008 >>> with open("Quality/example.fastq", "rU") as handle: 1009 ... for record in SeqIO.parse(handle, "fastq"): 1010 ... print("%s %s" % (record.id, record.seq)) 1011 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1012 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1013 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1014 1015 If you want to look at the qualities, they are record in each record's 1016 per-letter-annotation dictionary as a simple list of integers: 1017 1018 >>> print(record.letter_annotations["phred_quality"]) 1019 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1020 1021 """ 1022 assert SANGER_SCORE_OFFSET == ord("!") 1023 # Originally, I used a list expression for each record: 1024 # 1025 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1026 # 1027 # Precomputing is faster, perhaps partly by avoiding the subtractions. 1028 q_mapping = dict() 1029 for letter in range(0, 255): 1030 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1031 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1032 if title2ids: 1033 id, name, descr = title2ids(title_line) 1034 else: 1035 descr = title_line 1036 id = descr.split()[0] 1037 name = id 1038 record = SeqRecord(Seq(seq_string, alphabet), 1039 id=id, name=name, description=descr) 1040 qualities = [q_mapping[letter] for letter in quality_string] 1041 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1042 raise ValueError("Invalid character in quality string") 1043 # For speed, will now use a dirty trick to speed up assigning the 1044 # qualities. We do this to bypass the length check imposed by the 1045 # per-letter-annotations restricted dict (as this has already been 1046 # checked by FastqGeneralIterator). This is equivalent to: 1047 # record.letter_annotations["phred_quality"] = qualities 1048 dict.__setitem__(record._per_letter_annotations, 1049 "phred_quality", qualities) 1050 yield record
1051 1052
1053 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1054 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1055 1056 The optional arguments are the same as those for the FastqPhredIterator. 1057 1058 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1059 encoding the Solexa integer qualities using ASCII values with an offset 1060 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1061 will NOT perform any automatic conversion when loading. 1062 1063 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1064 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1065 1066 For example, consider a file containing these five records:: 1067 1068 @SLXA-B3_649_FC8437_R1_1_1_610_79 1069 GATGTGCAATACCTTTGTAGAGGAA 1070 +SLXA-B3_649_FC8437_R1_1_1_610_79 1071 YYYYYYYYYYYYYYYYYYWYWYYSU 1072 @SLXA-B3_649_FC8437_R1_1_1_397_389 1073 GGTTTGAGAAAGAGAAATGAGATAA 1074 +SLXA-B3_649_FC8437_R1_1_1_397_389 1075 YYYYYYYYYWYYYYWWYYYWYWYWW 1076 @SLXA-B3_649_FC8437_R1_1_1_850_123 1077 GAGGGTGTTGATCATGATGATGGCG 1078 +SLXA-B3_649_FC8437_R1_1_1_850_123 1079 YYYYYYYYYYYYYWYYWYYSYYYSY 1080 @SLXA-B3_649_FC8437_R1_1_1_362_549 1081 GGAAACAAAGTTTTTCTCAACATAG 1082 +SLXA-B3_649_FC8437_R1_1_1_362_549 1083 YYYYYYYYYYYYYYYYYYWWWWYWY 1084 @SLXA-B3_649_FC8437_R1_1_1_183_714 1085 GTATTATTTAATGGCATACACTCAA 1086 +SLXA-B3_649_FC8437_R1_1_1_183_714 1087 YYYYYYYYYYWYYYYWYWWUWWWQQ 1088 1089 Using this module directly you might run: 1090 1091 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1092 ... for record in FastqSolexaIterator(handle): 1093 ... print("%s %s" % (record.id, record.seq)) 1094 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1095 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1096 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1097 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1098 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1099 1100 Typically however, you would call this via Bio.SeqIO instead with 1101 "fastq-solexa" as the format: 1102 1103 >>> from Bio import SeqIO 1104 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1105 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1106 ... print("%s %s" % (record.id, record.seq)) 1107 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1108 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1109 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1110 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1111 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1112 1113 If you want to look at the qualities, they are recorded in each record's 1114 per-letter-annotation dictionary as a simple list of integers: 1115 1116 >>> print(record.letter_annotations["solexa_quality"]) 1117 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1118 1119 These scores aren't very good, but they are high enough that they map 1120 almost exactly onto PHRED scores: 1121 1122 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1123 25.01 1124 1125 Let's look at faked example read which is even worse, where there are 1126 more noticeable differences between the Solexa and PHRED scores:: 1127 1128 @slxa_0001_1_0001_01 1129 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1130 +slxa_0001_1_0001_01 1131 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1132 1133 Again, you would typically use Bio.SeqIO to read this file in (rather than 1134 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1135 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1136 as shown above. This example has only as one entry, so instead we can 1137 use the Bio.SeqIO.read() function: 1138 1139 >>> from Bio import SeqIO 1140 >>> with open("Quality/solexa_faked.fastq", "rU") as handle: 1141 ... record = SeqIO.read(handle, "fastq-solexa") 1142 >>> print("%s %s" % (record.id, record.seq)) 1143 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1144 >>> print(record.letter_annotations["solexa_quality"]) 1145 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1146 1147 These quality scores are so low that when converted from the Solexa scheme 1148 into PHRED scores they look quite different: 1149 1150 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1151 2.54 1152 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1153 1.19 1154 1155 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1156 method to output the record(s): 1157 1158 >>> print(record.format("fastq-solexa")) 1159 @slxa_0001_1_0001_01 1160 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1161 + 1162 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1163 <BLANKLINE> 1164 1165 Note this output is slightly different from the input file as Biopython 1166 has left out the optional repetition of the sequence identifier on the "+" 1167 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1168 output format instead, and Biopython will do the conversion for you: 1169 1170 >>> print(record.format("fastq")) 1171 @slxa_0001_1_0001_01 1172 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1173 + 1174 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1175 <BLANKLINE> 1176 1177 >>> print(record.format("qual")) 1178 >slxa_0001_1_0001_01 1179 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1180 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1181 1 1 1182 <BLANKLINE> 1183 1184 As shown above, the poor quality Solexa reads have been mapped to the 1185 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1186 """ 1187 q_mapping = dict() 1188 for letter in range(0, 255): 1189 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1190 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1191 if title2ids: 1192 id, name, descr = title_line 1193 else: 1194 descr = title_line 1195 id = descr.split()[0] 1196 name = id 1197 record = SeqRecord(Seq(seq_string, alphabet), 1198 id=id, name=name, description=descr) 1199 qualities = [q_mapping[letter] for letter in quality_string] 1200 # DO NOT convert these into PHRED qualities automatically! 1201 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1202 raise ValueError("Invalid character in quality string") 1203 # Dirty trick to speed up this line: 1204 # record.letter_annotations["solexa_quality"] = qualities 1205 dict.__setitem__(record._per_letter_annotations, 1206 "solexa_quality", qualities) 1207 yield record
1208 1209
1210 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1211 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1212 1213 The optional arguments are the same as those for the FastqPhredIterator. 1214 1215 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1216 encoding PHRED integer qualities using ASCII values with an offset of 64. 1217 1218 >>> from Bio import SeqIO 1219 >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") 1220 >>> print("%s %s" % (record.id, record.seq)) 1221 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1222 >>> max(record.letter_annotations["phred_quality"]) 1223 40 1224 >>> min(record.letter_annotations["phred_quality"]) 1225 0 1226 1227 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1228 with an ASCII offset of 64. They are approximately equal but only for high 1229 quality reads. If you have an old Solexa/Illumina file with negative 1230 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1231 1232 >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") 1233 Traceback (most recent call last): 1234 ... 1235 ValueError: Invalid character in quality string 1236 1237 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1238 """ 1239 q_mapping = dict() 1240 for letter in range(0, 255): 1241 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1242 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1243 if title2ids: 1244 id, name, descr = title2ids(title_line) 1245 else: 1246 descr = title_line 1247 id = descr.split()[0] 1248 name = id 1249 record = SeqRecord(Seq(seq_string, alphabet), 1250 id=id, name=name, description=descr) 1251 qualities = [q_mapping[letter] for letter in quality_string] 1252 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1253 raise ValueError("Invalid character in quality string") 1254 # Dirty trick to speed up this line: 1255 # record.letter_annotations["phred_quality"] = qualities 1256 dict.__setitem__(record._per_letter_annotations, 1257 "phred_quality", qualities) 1258 yield record
1259 1260
1261 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1262 """For QUAL files which include PHRED quality scores, but no sequence. 1263 1264 For example, consider this short QUAL file:: 1265 1266 >EAS54_6_R1_2_1_413_324 1267 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1268 26 26 26 23 23 1269 >EAS54_6_R1_2_1_540_792 1270 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1271 26 18 26 23 18 1272 >EAS54_6_R1_2_1_443_348 1273 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1274 24 18 18 18 18 1275 1276 Using this module directly you might run: 1277 1278 >>> with open("Quality/example.qual", "rU") as handle: 1279 ... for record in QualPhredIterator(handle): 1280 ... print("%s %s" % (record.id, record.seq)) 1281 EAS54_6_R1_2_1_413_324 ????????????????????????? 1282 EAS54_6_R1_2_1_540_792 ????????????????????????? 1283 EAS54_6_R1_2_1_443_348 ????????????????????????? 1284 1285 Typically however, you would call this via Bio.SeqIO instead with "qual" 1286 as the format: 1287 1288 >>> from Bio import SeqIO 1289 >>> with open("Quality/example.qual", "rU") as handle: 1290 ... for record in SeqIO.parse(handle, "qual"): 1291 ... print("%s %s" % (record.id, record.seq)) 1292 EAS54_6_R1_2_1_413_324 ????????????????????????? 1293 EAS54_6_R1_2_1_540_792 ????????????????????????? 1294 EAS54_6_R1_2_1_443_348 ????????????????????????? 1295 1296 Becase QUAL files don't contain the sequence string itself, the seq 1297 property is set to an UnknownSeq object. As no alphabet was given, this 1298 has defaulted to a generic single letter alphabet and the character "?" 1299 used. 1300 1301 By specifying a nucleotide alphabet, "N" is used instead: 1302 1303 >>> from Bio import SeqIO 1304 >>> from Bio.Alphabet import generic_dna 1305 >>> with open("Quality/example.qual", "rU") as handle: 1306 ... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1307 ... print("%s %s" % (record.id, record.seq)) 1308 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1309 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1310 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1311 1312 However, the quality scores themselves are available as a list of integers 1313 in each record's per-letter-annotation: 1314 1315 >>> print(record.letter_annotations["phred_quality"]) 1316 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1317 1318 You can still slice one of these SeqRecord objects with an UnknownSeq: 1319 1320 >>> sub_record = record[5:10] 1321 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1322 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1323 1324 As of Biopython 1.59, this parser will accept files with negatives quality 1325 scores but will replace them with the lowest possible PHRED score of zero. 1326 This will trigger a warning, previously it raised a ValueError exception. 1327 """ 1328 # Skip any text before the first record (e.g. blank lines, comments) 1329 while True: 1330 line = handle.readline() 1331 if line == "": 1332 return # Premature end of file, or just empty? 1333 if line[0] == ">": 1334 break 1335 1336 while True: 1337 if line[0] != ">": 1338 raise ValueError( 1339 "Records in Fasta files should start with '>' character") 1340 if title2ids: 1341 id, name, descr = title2ids(line[1:].rstrip()) 1342 else: 1343 descr = line[1:].rstrip() 1344 id = descr.split()[0] 1345 name = id 1346 1347 qualities = [] 1348 line = handle.readline() 1349 while True: 1350 if not line: 1351 break 1352 if line[0] == ">": 1353 break 1354 qualities.extend(int(word) for word in line.split()) 1355 line = handle.readline() 1356 1357 if qualities and min(qualities) < 0: 1358 warnings.warn(("Negative quality score %i found, " + 1359 "substituting PHRED zero instead.") 1360 % min(qualities), BiopythonParserWarning) 1361 qualities = [max(0, q) for q in qualities] 1362 1363 # Return the record and then continue... 1364 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1365 id=id, name=name, description=descr) 1366 # Dirty trick to speed up this line: 1367 # record.letter_annotations["phred_quality"] = qualities 1368 dict.__setitem__(record._per_letter_annotations, 1369 "phred_quality", qualities) 1370 yield record 1371 1372 if not line: 1373 return # StopIteration 1374 assert False, "Should not reach this line"
1375 1376
1377 -class FastqPhredWriter(SequentialSequenceWriter):
1378 """Class to write standard FASTQ format files (using PHRED quality scores). 1379 1380 Although you can use this class directly, you are strongly encouraged 1381 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1382 or the alias "fastq-sanger". For example, this code reads in a standard 1383 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1384 Sanger style FASTQ file: 1385 1386 >>> from Bio import SeqIO 1387 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1388 >>> with open("Quality/temp.fastq", "w") as out_handle: 1389 ... SeqIO.write(record_iterator, out_handle, "fastq") 1390 3 1391 1392 You might want to do this if the original file included extra line breaks, 1393 which while valid may not be supported by all tools. The output file from 1394 Biopython will have each sequence on a single line, and each quality 1395 string on a single line (which is considered desirable for maximum 1396 compatibility). 1397 1398 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1399 quality scores) is converted into a standard Sanger style FASTQ file using 1400 PHRED qualities: 1401 1402 >>> from Bio import SeqIO 1403 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1404 >>> with open("Quality/temp.fastq", "w") as out_handle: 1405 ... SeqIO.write(record_iterator, out_handle, "fastq") 1406 5 1407 1408 This code is also called if you use the .format("fastq") method of a 1409 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1410 1411 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1412 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1413 warning is issued. 1414 1415 P.S. To avoid cluttering up your working directory, you can delete this 1416 temporary file now: 1417 1418 >>> import os 1419 >>> os.remove("Quality/temp.fastq") 1420 """ 1421 assert SANGER_SCORE_OFFSET == ord("!") 1422
1423 - def write_record(self, record):
1424 """Write a single FASTQ record to the file.""" 1425 assert self._header_written 1426 assert not self._footer_written 1427 self._record_written = True 1428 # TODO - Is an empty sequence allowed in FASTQ format? 1429 if record.seq is None: 1430 raise ValueError("No sequence for record %s" % record.id) 1431 seq_str = str(record.seq) 1432 qualities_str = _get_sanger_quality_str(record) 1433 if len(qualities_str) != len(seq_str): 1434 raise ValueError("Record %s has sequence length %i but %i quality scores" 1435 % (record.id, len(seq_str), len(qualities_str))) 1436 1437 # FASTQ files can include a description, just like FASTA files 1438 # (at least, this is what the NCBI Short Read Archive does) 1439 id = self.clean(record.id) 1440 description = self.clean(record.description) 1441 if description and description.split(None, 1)[0] == id: 1442 # The description includes the id at the start 1443 title = description 1444 elif description: 1445 title = "%s %s" % (id, description) 1446 else: 1447 title = id 1448 1449 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1450 1451
1452 -class QualPhredWriter(SequentialSequenceWriter):
1453 """Class to write QUAL format files (using PHRED quality scores). 1454 1455 Although you can use this class directly, you are strongly encouraged 1456 to use the Bio.SeqIO.write() function instead. For example, this code 1457 reads in a FASTQ file and saves the quality scores into a QUAL file: 1458 1459 >>> from Bio import SeqIO 1460 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1461 >>> with open("Quality/temp.qual", "w") as out_handle: 1462 ... SeqIO.write(record_iterator, out_handle, "qual") 1463 3 1464 1465 This code is also called if you use the .format("qual") method of a 1466 SeqRecord. 1467 1468 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1469 1470 >>> import os 1471 >>> os.remove("Quality/temp.qual") 1472 """
1473 - def __init__(self, handle, wrap=60, record2title=None):
1474 """Create a QUAL writer. 1475 1476 Arguments: 1477 - handle - Handle to an output file, e.g. as returned 1478 by open(filename, "w") 1479 - wrap - Optional line length used to wrap sequence lines. 1480 Defaults to wrapping the sequence at 60 characters 1481 Use zero (or None) for no wrapping, giving a single 1482 long line for the sequence. 1483 - record2title - Optional function to return the text to be 1484 used for the title line of each record. By default 1485 a combination of the record.id and record.description 1486 is used. If the record.description starts with the 1487 record.id, then just the record.description is used. 1488 1489 The record2title argument is present for consistency with the 1490 Bio.SeqIO.FastaIO writer class. 1491 """ 1492 SequentialSequenceWriter.__init__(self, handle) 1493 # self.handle = handle 1494 self.wrap = None 1495 if wrap: 1496 if wrap < 1: 1497 raise ValueError 1498 self.wrap = wrap 1499 self.record2title = record2title
1500
1501 - def write_record(self, record):
1502 """Write a single QUAL record to the file.""" 1503 assert self._header_written 1504 assert not self._footer_written 1505 self._record_written = True 1506 1507 handle = self.handle 1508 wrap = self.wrap 1509 1510 if self.record2title: 1511 title = self.clean(self.record2title(record)) 1512 else: 1513 id = self.clean(record.id) 1514 description = self.clean(record.description) 1515 if description and description.split(None, 1)[0] == id: 1516 # The description includes the id at the start 1517 title = description 1518 elif description: 1519 title = "%s %s" % (id, description) 1520 else: 1521 title = id 1522 handle.write(">%s\n" % title) 1523 1524 qualities = _get_phred_quality(record) 1525 try: 1526 # This rounds to the nearest integer. 1527 # TODO - can we record a float in a qual file? 1528 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1529 except TypeError as e: 1530 if None in qualities: 1531 raise TypeError("A quality value of None was found") 1532 else: 1533 raise e 1534 1535 if wrap > 5: 1536 # Fast wrapping 1537 data = " ".join(qualities_strs) 1538 while True: 1539 if len(data) <= wrap: 1540 self.handle.write(data + "\n") 1541 break 1542 else: 1543 # By construction there must be spaces in the first X chars 1544 # (unless we have X digit or higher quality scores!) 1545 i = data.rfind(" ", 0, wrap) 1546 handle.write(data[:i] + "\n") 1547 data = data[i + 1:] 1548 elif wrap: 1549 # Safe wrapping 1550 while qualities_strs: 1551 line = qualities_strs.pop(0) 1552 while qualities_strs \ 1553 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1554 line += " " + qualities_strs.pop(0) 1555 handle.write(line + "\n") 1556 else: 1557 # No wrapping 1558 data = " ".join(qualities_strs) 1559 handle.write(data + "\n")
1560 1561
1562 -class FastqSolexaWriter(SequentialSequenceWriter):
1563 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1564 1565 This outputs FASTQ files like those from the early Solexa/Illumina 1566 pipeline, using Solexa scores and an ASCII offset of 64. These are 1567 NOT compatible with the standard Sanger style PHRED FASTQ files. 1568 1569 If your records contain a "solexa_quality" entry under letter_annotations, 1570 this is used, otherwise any "phred_quality" entry will be used after 1571 conversion using the solexa_quality_from_phred function. If neither style 1572 of quality scores are present, an exception is raised. 1573 1574 Although you can use this class directly, you are strongly encouraged 1575 to use the Bio.SeqIO.write() function instead. For example, this code 1576 reads in a FASTQ file and re-saves it as another FASTQ file: 1577 1578 >>> from Bio import SeqIO 1579 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1580 >>> with open("Quality/temp.fastq", "w") as out_handle: 1581 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1582 5 1583 1584 You might want to do this if the original file included extra line breaks, 1585 which (while valid) may not be supported by all tools. The output file 1586 from Biopython will have each sequence on a single line, and each quality 1587 string on a single line (which is considered desirable for maximum 1588 compatibility). 1589 1590 This code is also called if you use the .format("fastq-solexa") method of 1591 a SeqRecord. For example, 1592 1593 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1594 >>> print(record.format("fastq-solexa")) 1595 @Test PHRED qualities from 40 to 0 inclusive 1596 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1597 + 1598 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1599 <BLANKLINE> 1600 1601 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1602 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1603 a warning is issued. 1604 1605 P.S. Don't forget to delete the temp file if you don't need it anymore: 1606 1607 >>> import os 1608 >>> os.remove("Quality/temp.fastq") 1609 """
1610 - def write_record(self, record):
1611 """Write a single FASTQ record to the file.""" 1612 assert self._header_written 1613 assert not self._footer_written 1614 self._record_written = True 1615 1616 # TODO - Is an empty sequence allowed in FASTQ format? 1617 if record.seq is None: 1618 raise ValueError("No sequence for record %s" % record.id) 1619 seq_str = str(record.seq) 1620 qualities_str = _get_solexa_quality_str(record) 1621 if len(qualities_str) != len(seq_str): 1622 raise ValueError("Record %s has sequence length %i but %i quality scores" 1623 % (record.id, len(seq_str), len(qualities_str))) 1624 1625 # FASTQ files can include a description, just like FASTA files 1626 # (at least, this is what the NCBI Short Read Archive does) 1627 id = self.clean(record.id) 1628 description = self.clean(record.description) 1629 if description and description.split(None, 1)[0] == id: 1630 # The description includes the id at the start 1631 title = description 1632 elif description: 1633 title = "%s %s" % (id, description) 1634 else: 1635 title = id 1636 1637 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1638 1639
1640 -class FastqIlluminaWriter(SequentialSequenceWriter):
1641 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1642 1643 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1644 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1645 compatible with the standard Sanger style PHRED FASTQ files which use an 1646 ASCII offset of 32. 1647 1648 Although you can use this class directly, you are strongly encouraged to 1649 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1650 instead. This code is also called if you use the .format("fastq-illumina") 1651 method of a SeqRecord. For example, 1652 1653 >>> from Bio import SeqIO 1654 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1655 >>> print(record.format("fastq-illumina")) 1656 @Test PHRED qualities from 40 to 0 inclusive 1657 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1658 + 1659 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1660 <BLANKLINE> 1661 1662 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1663 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1664 warning is issued. 1665 """
1666 - def write_record(self, record):
1667 """Write a single FASTQ record to the file.""" 1668 assert self._header_written 1669 assert not self._footer_written 1670 self._record_written = True 1671 1672 # TODO - Is an empty sequence allowed in FASTQ format? 1673 if record.seq is None: 1674 raise ValueError("No sequence for record %s" % record.id) 1675 seq_str = str(record.seq) 1676 qualities_str = _get_illumina_quality_str(record) 1677 if len(qualities_str) != len(seq_str): 1678 raise ValueError("Record %s has sequence length %i but %i quality scores" 1679 % (record.id, len(seq_str), len(qualities_str))) 1680 1681 # FASTQ files can include a description, just like FASTA files 1682 # (at least, this is what the NCBI Short Read Archive does) 1683 id = self.clean(record.id) 1684 description = self.clean(record.description) 1685 if description and description.split(None, 1)[0] == id: 1686 # The description includes the id at the start 1687 title = description 1688 elif description: 1689 title = "%s %s" % (id, description) 1690 else: 1691 title = id 1692 1693 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1694 1695
1696 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1697 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1698 1699 For example, consider this short QUAL file with PHRED quality scores:: 1700 1701 >EAS54_6_R1_2_1_413_324 1702 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1703 26 26 26 23 23 1704 >EAS54_6_R1_2_1_540_792 1705 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1706 26 18 26 23 18 1707 >EAS54_6_R1_2_1_443_348 1708 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1709 24 18 18 18 18 1710 1711 And a matching FASTA file:: 1712 1713 >EAS54_6_R1_2_1_413_324 1714 CCCTTCTTGTCTTCAGCGTTTCTCC 1715 >EAS54_6_R1_2_1_540_792 1716 TTGGCAGGCCAAGGCCGATGGATCA 1717 >EAS54_6_R1_2_1_443_348 1718 GTTGCTTCTGGCGTGGGTGGGGGGG 1719 1720 You can parse these separately using Bio.SeqIO with the "qual" and 1721 "fasta" formats, but then you'll get a group of SeqRecord objects with 1722 no sequence, and a matching group with the sequence but not the 1723 qualities. Because it only deals with one input file handle, Bio.SeqIO 1724 can't be used to read the two files together - but this function can! 1725 For example, 1726 1727 >>> with open("Quality/example.fasta", "rU") as f: 1728 ... with open("Quality/example.qual", "rU") as q: 1729 ... for record in PairedFastaQualIterator(f, q): 1730 ... print("%s %s" % (record.id, record.seq)) 1731 ... 1732 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1733 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1734 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1735 1736 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1737 they are in each record's per-letter-annotation dictionary as a simple 1738 list of integers: 1739 1740 >>> print(record.letter_annotations["phred_quality"]) 1741 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1742 1743 If you have access to data as a FASTQ format file, using that directly 1744 would be simpler and more straight forward. Note that you can easily use 1745 this function to convert paired FASTA and QUAL files into FASTQ files: 1746 1747 >>> from Bio import SeqIO 1748 >>> with open("Quality/example.fasta", "rU") as f: 1749 ... with open("Quality/example.qual", "rU") as q: 1750 ... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") 1751 ... 1752 3 1753 1754 And don't forget to clean up the temp file if you don't need it anymore: 1755 1756 >>> import os 1757 >>> os.remove("Quality/temp.fastq") 1758 """ 1759 from Bio.SeqIO.FastaIO import FastaIterator 1760 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1761 title2ids=title2ids) 1762 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1763 title2ids=title2ids) 1764 1765 # Using (Python 3 style) zip wouldn't load everything into memory, 1766 # but also would not catch any extra records found in only one file. 1767 while True: 1768 try: 1769 f_rec = next(fasta_iter) 1770 except StopIteration: 1771 f_rec = None 1772 try: 1773 q_rec = next(qual_iter) 1774 except StopIteration: 1775 q_rec = None 1776 if f_rec is None and q_rec is None: 1777 # End of both files 1778 break 1779 if f_rec is None: 1780 raise ValueError("FASTA file has more entries than the QUAL file.") 1781 if q_rec is None: 1782 raise ValueError("QUAL file has more entries than the FASTA file.") 1783 if f_rec.id != q_rec.id: 1784 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1785 % (f_rec.id, q_rec.id)) 1786 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1787 raise ValueError("Sequence length and number of quality scores disagree for %s" 1788 % f_rec.id) 1789 # Merge the data.... 1790 f_rec.letter_annotations[ 1791 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1792 yield f_rec
1793 # Done 1794 1795 1796 if __name__ == "__main__": 1797 from Bio._utils import run_doctest 1798 run_doctest(verbose=0) 1799