Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2015 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78      - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79      - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80        offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81        These can potentially hold PHRED scores from 0 to 93. 
  82      - "fastq-sanger" is an alias for "fastq". 
  83      - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84        files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85        scores from -5 to 62. 
  86      - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87        PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88        to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114  >>> from Bio import SeqIO 
 115  >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116  ...     print("%s %s" % (record.id, record.seq)) 
 117  EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118  EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119  EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123  >>> print(record) 
 124  ID: EAS54_6_R1_2_1_443_348 
 125  Name: EAS54_6_R1_2_1_443_348 
 126  Description: EAS54_6_R1_2_1_443_348 
 127  Number of features: 0 
 128  Per letter annotation for: phred_quality 
 129  Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130  >>> print(record.letter_annotations["phred_quality"]) 
 131  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135  >>> print(record.format("qual")) 
 136  >EAS54_6_R1_2_1_443_348 
 137  26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138  24 18 18 18 18 
 139  <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143  >>> print(record.format("fastq")) 
 144  @EAS54_6_R1_2_1_443_348 
 145  GTTGCTTCTGGCGTGGGTGGGGGGG 
 146  + 
 147  ;;;;;;;;;;;9;7;;.7;393333 
 148  <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153  >>> print(record.format("fastq-illumina")) 
 154  @EAS54_6_R1_2_1_443_348 
 155  GTTGCTTCTGGCGTGGGTGGGGGGG 
 156  + 
 157  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158  <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163  >>> print(record.format("fastq-solexa")) 
 164  @EAS54_6_R1_2_1_443_348 
 165  GTTGCTTCTGGCGTGGGTGGGGGGG 
 166  + 
 167  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168  <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179  >>> sub_rec = record[5:15] 
 180  >>> print(sub_rec) 
 181  ID: EAS54_6_R1_2_1_443_348 
 182  Name: EAS54_6_R1_2_1_443_348 
 183  Description: EAS54_6_R1_2_1_443_348 
 184  Number of features: 0 
 185  Per letter annotation for: phred_quality 
 186  Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187  >>> print(sub_rec.letter_annotations["phred_quality"]) 
 188  [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189  >>> print(sub_rec.format("fastq")) 
 190  @EAS54_6_R1_2_1_443_348 
 191  TTCTGGCGTG 
 192  + 
 193  ;;;;;;9;7; 
 194  <BLANKLINE> 
 195   
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198  >>> from Bio import SeqIO 
 199  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200  >>> with open("Quality/temp.qual", "w") as out_handle: 
 201  ...     SeqIO.write(record_iterator, out_handle, "qual") 
 202  3 
 203   
 204  You can of course read in a QUAL file, such as the one we just created: 
 205   
 206  >>> from Bio import SeqIO 
 207  >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 208  ...     print("%s %s" % (record.id, record.seq)) 
 209  EAS54_6_R1_2_1_413_324 ????????????????????????? 
 210  EAS54_6_R1_2_1_540_792 ????????????????????????? 
 211  EAS54_6_R1_2_1_443_348 ????????????????????????? 
 212   
 213  Notice that QUAL files don't have a proper sequence present!  But the quality 
 214  information is there: 
 215   
 216  >>> print(record) 
 217  ID: EAS54_6_R1_2_1_443_348 
 218  Name: EAS54_6_R1_2_1_443_348 
 219  Description: EAS54_6_R1_2_1_443_348 
 220  Number of features: 0 
 221  Per letter annotation for: phred_quality 
 222  UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 223  >>> print(record.letter_annotations["phred_quality"]) 
 224  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 225   
 226  Just to keep things tidy, if you are following this example yourself, you can 
 227  delete this temporary file now: 
 228   
 229  >>> import os 
 230  >>> os.remove("Quality/temp.qual") 
 231   
 232  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 233  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 234  would have to read the two in separately and then combine the data.  However, 
 235  since this is such a common thing to want to do, there is a helper iterator 
 236  defined in this module that does this for you - PairedFastaQualIterator. 
 237   
 238  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 239  then a simple dictionary approach would work: 
 240   
 241  >>> from Bio import SeqIO 
 242  >>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) 
 243  >>> for rec in SeqIO.parse("Quality/example.qual", "qual"): 
 244  ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 245   
 246  You can then access any record by its key, and get both the sequence and the 
 247  quality scores. 
 248   
 249  >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 
 250  @EAS54_6_R1_2_1_540_792 
 251  TTGGCAGGCCAAGGCCGATGGATCA 
 252  + 
 253  ;;;;;;;;;;;7;;;;;-;;;3;83 
 254  <BLANKLINE> 
 255   
 256  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 257  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 258  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 259  for the more recent variant), as this cannot be detected reliably 
 260  automatically. 
 261   
 262  To illustrate this problem, let's consider an artificial example: 
 263   
 264  >>> from Bio.Seq import Seq 
 265  >>> from Bio.Alphabet import generic_dna 
 266  >>> from Bio.SeqRecord import SeqRecord 
 267  >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 268  ... description="Made up!") 
 269  >>> print(test.format("fasta")) 
 270  >Test Made up! 
 271  NACGTACGTA 
 272  <BLANKLINE> 
 273  >>> print(test.format("fastq")) 
 274  Traceback (most recent call last): 
 275   ... 
 276  ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 277   
 278  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 279  or FASTQ format we need to provide some quality scores. These are held as a 
 280  list of integers (one for each base) in the letter_annotations dictionary: 
 281   
 282  >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 283  >>> print(test.format("qual")) 
 284  >Test Made up! 
 285  0 1 2 3 4 5 10 20 30 40 
 286  <BLANKLINE> 
 287  >>> print(test.format("fastq")) 
 288  @Test Made up! 
 289  NACGTACGTA 
 290  + 
 291  !"#$%&+5?I 
 292  <BLANKLINE> 
 293   
 294  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 295  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 296  the letter "I": 
 297   
 298  >>> ord('!') - 33 
 299  0 
 300  >>> ord('I') - 33 
 301  40 
 302  >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 303  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 304   
 305  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 306  scores with an offset of 64: 
 307   
 308  >>> print(test.format("fastq-illumina")) 
 309  @Test Made up! 
 310  NACGTACGTA 
 311  + 
 312  @ABCDEJT^h 
 313  <BLANKLINE> 
 314   
 315  And we can check this too - the first PHRED score was zero, and this mapped to 
 316  "@", while the final score was 40 and this mapped to "h": 
 317   
 318  >>> ord("@") - 64 
 319  0 
 320  >>> ord("h") - 64 
 321  40 
 322  >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 323  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 324   
 325  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 326  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 327  format to consider which encodes Solexa scores instead of PHRED scores. 
 328   
 329  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 330  scores (rounding to one decimal place): 
 331   
 332  >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 
 333  ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 
 334  PHRED 0 maps to Solexa -5.0 
 335  PHRED 1 maps to Solexa -5.0 
 336  PHRED 2 maps to Solexa -2.3 
 337  PHRED 3 maps to Solexa -0.0 
 338  PHRED 4 maps to Solexa 1.8 
 339  PHRED 5 maps to Solexa 3.3 
 340  PHRED 10 maps to Solexa 9.5 
 341  PHRED 20 maps to Solexa 20.0 
 342  PHRED 30 maps to Solexa 30.0 
 343  PHRED 40 maps to Solexa 40.0 
 344   
 345  Now here is the record using the old Solexa style FASTQ file: 
 346   
 347  >>> print(test.format("fastq-solexa")) 
 348  @Test Made up! 
 349  NACGTACGTA 
 350  + 
 351  ;;>@BCJT^h 
 352  <BLANKLINE> 
 353   
 354  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 355   
 356  >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 357  [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 358   
 359  This explains why the last few letters of this FASTQ output matched that using 
 360  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 361  are approximately equal. 
 362   
 363  """ 
 364  from __future__ import print_function 
 365   
 366  from Bio.Alphabet import single_letter_alphabet 
 367  from Bio.Seq import Seq, UnknownSeq 
 368  from Bio.SeqRecord import SeqRecord 
 369  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 370  from math import log 
 371  import warnings 
 372  from Bio import BiopythonWarning, BiopythonParserWarning 
 373   
 374   
 375  # define score offsets. See discussion for differences between Sanger and 
 376  # Solexa offsets. 
 377  SANGER_SCORE_OFFSET = 33 
 378  SOLEXA_SCORE_OFFSET = 64 
 379   
 380   
381 -def solexa_quality_from_phred(phred_quality):
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 383 384 PHRED and Solexa quality scores are both log transformations of a 385 probality of error (high score = low probability of error). This function 386 takes a PHRED score, transforms it back to a probability of error, and 387 then re-expresses it as a Solexa score. This assumes the error estimates 388 are equivalent. 389 390 How does this work exactly? Well the PHRED quality is minus ten times the 391 base ten logarithm of the probability of error:: 392 393 phred_quality = -10*log(error,10) 394 395 Therefore, turning this round:: 396 397 error = 10 ** (- phred_quality / 10) 398 399 Now, Solexa qualities use a different log transformation:: 400 401 solexa_quality = -10*log(error/(1-error),10) 402 403 After substitution and a little manipulation we get:: 404 405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 406 407 However, real Solexa files use a minimum quality of -5. This does have a 408 good reason - a random base call would be correct 25% of the time, 409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 412 413 Taken literally, this logarithic formula would map a PHRED quality of zero 414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 415 score of zero means a probability of error of one (i.e. the base call is 416 definitely wrong), which is worse than random! In practice, a PHRED quality 417 of zero usually means a default value, or perhaps random - and therefore 418 mapping it to the minimum Solexa score of -5 is reasonable. 419 420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 422 quality of zero to -5.0 as well. 423 424 Note this function will return a floating point number, it is up to you to 425 round this to the nearest integer if appropriate. e.g. 426 427 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 428 80.00 429 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 430 50.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 432 19.96 433 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 434 9.54 435 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 436 3.35 437 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 438 1.80 439 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 440 -0.02 441 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 442 -2.33 443 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 444 -5.00 445 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 446 -5.00 447 448 Notice that for high quality reads PHRED and Solexa scores are numerically 449 equal. The differences are important for poor quality reads, where PHRED 450 has a minimum of zero but Solexa scores can be negative. 451 452 Finally, as a special case where None is used for a "missing value", None 453 is returned: 454 455 >>> print(solexa_quality_from_phred(None)) 456 None 457 """ 458 if phred_quality is None: 459 # Assume None is used as some kind of NULL or NA value; return None 460 # e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 461 return None 462 elif phred_quality > 0: 463 # Solexa uses a minimum value of -5, which after rounding matches a 464 # random nucleotide base call. 465 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 466 elif phred_quality == 0: 467 # Special case, map to -5 as discussed in the docstring 468 return -5.0 469 else: 470 raise ValueError("PHRED qualities must be positive (or zero), not %r" 471 % phred_quality)
472 473
474 -def phred_quality_from_solexa(solexa_quality):
475 """Convert a Solexa quality (which can be negative) to a PHRED quality. 476 477 PHRED and Solexa quality scores are both log transformations of a 478 probality of error (high score = low probability of error). This function 479 takes a Solexa score, transforms it back to a probability of error, and 480 then re-expresses it as a PHRED score. This assumes the error estimates 481 are equivalent. 482 483 The underlying formulas are given in the documentation for the sister 484 function solexa_quality_from_phred, in this case the operation is:: 485 486 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 487 488 This will return a floating point number, it is up to you to round this to 489 the nearest integer if appropriate. e.g. 490 491 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 492 80.00 493 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 494 20.04 495 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 496 10.41 497 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 498 3.01 499 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 500 1.19 501 502 Note that a solexa_quality less then -5 is not expected, will trigger a 503 warning, but will still be converted as per the logarithmic mapping 504 (giving a number between 0 and 1.19 back). 505 506 As a special case where None is used for a "missing value", None is 507 returned: 508 509 >>> print(phred_quality_from_solexa(None)) 510 None 511 """ 512 if solexa_quality is None: 513 # Assume None is used as some kind of NULL or NA value; return None 514 return None 515 if solexa_quality < -5: 516 warnings.warn("Solexa quality less than -5 passed, %r" % solexa_quality, 517 BiopythonWarning) 518 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
519 520
521 -def _get_phred_quality(record):
522 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 523 524 If there are no PHRED qualities, but there are Solexa qualities, those are 525 used instead after conversion. 526 """ 527 try: 528 return record.letter_annotations["phred_quality"] 529 except KeyError: 530 pass 531 try: 532 return [phred_quality_from_solexa(q) for 533 q in record.letter_annotations["solexa_quality"]] 534 except KeyError: 535 raise ValueError("No suitable quality scores found in " 536 "letter_annotations of SeqRecord (id=%s)." 537 % record.id)
538 539 540 # Only map 0 to 93, we need to give a warning on truncating at 93 541 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 542 for qp in range(0, 93 + 1)) 543 # Only map -5 to 93, we need to give a warning on truncating at 93 544 _solexa_to_sanger_quality_str = dict( 545 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 546 SANGER_SCORE_OFFSET))) 547 for qs in range(-5, 93 + 1)) 548 549
550 -def _get_sanger_quality_str(record):
551 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 552 553 >>> from Bio.Seq import Seq 554 >>> from Bio.SeqRecord import SeqRecord 555 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 556 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 557 >>> _get_sanger_quality_str(r) 558 'SI?5+!' 559 560 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 561 the PHRED qualities are integers, this function is able to use a very fast 562 pre-cached mapping. However, if they are floats which differ slightly, then 563 it has to do the appropriate rounding - which is slower: 564 565 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 566 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 567 >>> _get_sanger_quality_str(r2) 568 'SI?5+!' 569 570 If your scores include a None value, this raises an exception: 571 572 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 573 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 574 >>> _get_sanger_quality_str(r3) 575 Traceback (most recent call last): 576 ... 577 TypeError: A quality value of None was found 578 579 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 580 scores are used in preference: 581 582 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 583 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 584 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 585 >>> _get_sanger_quality_str(r4) 586 'SI?5+!' 587 588 If there are no PHRED scores, but there are Solexa scores, these are used 589 instead (after the appropriate conversion): 590 591 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 592 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 593 >>> _get_sanger_quality_str(r5) 594 'I?5+$"' 595 596 Again, integer Solexa scores can be looked up in a pre-cached mapping making 597 this very fast. You can still use approximate floating point scores: 598 599 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 600 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 601 >>> _get_sanger_quality_str(r6) 602 'I?5+$"' 603 604 Notice that due to the limited range of printable ASCII characters, a 605 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 606 file (using ASCII 126, the tilde). This function will issue a warning 607 in this situation. 608 """ 609 # TODO - This functions works and is fast, but it is also ugly 610 # and there is considerable repetition of code for the other 611 # two FASTQ variants. 612 try: 613 # These take priority (in case both Solexa and PHRED scores found) 614 qualities = record.letter_annotations["phred_quality"] 615 except KeyError: 616 # Fall back on solexa scores... 617 pass 618 else: 619 # Try and use the precomputed mapping: 620 try: 621 return "".join(_phred_to_sanger_quality_str[qp] 622 for qp in qualities) 623 except KeyError: 624 # Could be a float, or a None in the list, or a high value. 625 pass 626 if None in qualities: 627 raise TypeError("A quality value of None was found") 628 if max(qualities) >= 93.5: 629 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 630 BiopythonWarning) 631 # This will apply the truncation at 93, giving max ASCII 126 632 return "".join(chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 633 for qp in qualities) 634 # Fall back on the Solexa scores... 635 try: 636 qualities = record.letter_annotations["solexa_quality"] 637 except KeyError: 638 raise ValueError("No suitable quality scores found in " 639 "letter_annotations of SeqRecord (id=%s)." 640 % record.id) 641 # Try and use the precomputed mapping: 642 try: 643 return "".join(_solexa_to_sanger_quality_str[qs] 644 for qs in qualities) 645 except KeyError: 646 # Either no PHRED scores, or something odd like a float or None 647 pass 648 if None in qualities: 649 raise TypeError("A quality value of None was found") 650 # Must do this the slow way, first converting the PHRED scores into 651 # Solexa scores: 652 if max(qualities) >= 93.5: 653 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 654 BiopythonWarning) 655 # This will apply the truncation at 93, giving max ASCII 126 656 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 657 for qs in qualities)
658 659 660 # Only map 0 to 62, we need to give a warning on truncating at 62 661 assert 62 + SOLEXA_SCORE_OFFSET == 126 662 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 663 for qp in range(0, 62 + 1)) 664 # Only map -5 to 62, we need to give a warning on truncating at 62 665 _solexa_to_illumina_quality_str = dict( 666 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 667 for qs in range(-5, 62 + 1)) 668 669
670 -def _get_illumina_quality_str(record):
671 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 672 673 Notice that due to the limited range of printable ASCII characters, a 674 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 675 file (using ASCII 126, the tilde). This function will issue a warning 676 in this situation. 677 """ 678 # TODO - This functions works and is fast, but it is also ugly 679 # and there is considerable repetition of code for the other 680 # two FASTQ variants. 681 try: 682 # These take priority (in case both Solexa and PHRED scores found) 683 qualities = record.letter_annotations["phred_quality"] 684 except KeyError: 685 # Fall back on solexa scores... 686 pass 687 else: 688 # Try and use the precomputed mapping: 689 try: 690 return "".join(_phred_to_illumina_quality_str[qp] 691 for qp in qualities) 692 except KeyError: 693 # Could be a float, or a None in the list, or a high value. 694 pass 695 if None in qualities: 696 raise TypeError("A quality value of None was found") 697 if max(qualities) >= 62.5: 698 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 699 BiopythonWarning) 700 # This will apply the truncation at 62, giving max ASCII 126 701 return "".join(chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 702 for qp in qualities) 703 # Fall back on the Solexa scores... 704 try: 705 qualities = record.letter_annotations["solexa_quality"] 706 except KeyError: 707 raise ValueError("No suitable quality scores found in " 708 "letter_annotations of SeqRecord (id=%s)." 709 % record.id) 710 # Try and use the precomputed mapping: 711 try: 712 return "".join(_solexa_to_illumina_quality_str[qs] 713 for qs in qualities) 714 except KeyError: 715 # Either no PHRED scores, or something odd like a float or None 716 pass 717 if None in qualities: 718 raise TypeError("A quality value of None was found") 719 # Must do this the slow way, first converting the PHRED scores into 720 # Solexa scores: 721 if max(qualities) >= 62.5: 722 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 723 BiopythonWarning) 724 # This will apply the truncation at 62, giving max ASCII 126 725 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 726 for qs in qualities)
727 728 729 # Only map 0 to 62, we need to give a warning on truncating at 62 730 assert 62 + SOLEXA_SCORE_OFFSET == 126 731 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 732 for qs in range(-5, 62 + 1)) 733 # Only map -5 to 62, we need to give a warning on truncating at 62 734 _phred_to_solexa_quality_str = dict( 735 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 736 SOLEXA_SCORE_OFFSET))) 737 for qp in range(0, 62 + 1)) 738 739
740 -def _get_solexa_quality_str(record):
741 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 742 743 Notice that due to the limited range of printable ASCII characters, a 744 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 745 file (using ASCII 126, the tilde). This function will issue a warning 746 in this situation. 747 """ 748 # TODO - This functions works and is fast, but it is also ugly 749 # and there is considerable repetition of code for the other 750 # two FASTQ variants. 751 try: 752 # These take priority (in case both Solexa and PHRED scores found) 753 qualities = record.letter_annotations["solexa_quality"] 754 except KeyError: 755 # Fall back on PHRED scores... 756 pass 757 else: 758 # Try and use the precomputed mapping: 759 try: 760 return "".join(_solexa_to_solexa_quality_str[qs] 761 for qs in qualities) 762 except KeyError: 763 # Could be a float, or a None in the list, or a high value. 764 pass 765 if None in qualities: 766 raise TypeError("A quality value of None was found") 767 if max(qualities) >= 62.5: 768 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 769 BiopythonWarning) 770 # This will apply the truncation at 62, giving max ASCII 126 771 return "".join(chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 772 for qs in qualities) 773 # Fall back on the PHRED scores... 774 try: 775 qualities = record.letter_annotations["phred_quality"] 776 except KeyError: 777 raise ValueError("No suitable quality scores found in " 778 "letter_annotations of SeqRecord (id=%s)." 779 % record.id) 780 # Try and use the precomputed mapping: 781 try: 782 return "".join(_phred_to_solexa_quality_str[qp] 783 for qp in qualities) 784 except KeyError: 785 # Either no PHRED scores, or something odd like a float or None 786 # or too big to be in the cache 787 pass 788 if None in qualities: 789 raise TypeError("A quality value of None was found") 790 # Must do this the slow way, first converting the PHRED scores into 791 # Solexa scores: 792 if max(qualities) >= 62.5: 793 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 794 BiopythonWarning) 795 return "".join(chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 796 for qp in qualities)
797 798 799 # TODO - Default to nucleotide or even DNA?
800 -def FastqGeneralIterator(handle):
801 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 802 803 This code does not try to interpret the quality string numerically. It 804 just returns tuples of the title, sequence and quality as strings. For 805 the sequence and quality, any whitespace (such as new lines) is removed. 806 807 Our SeqRecord based FASTQ iterators call this function internally, and then 808 turn the strings into a SeqRecord objects, mapping the quality string into 809 a list of numerical scores. If you want to do a custom quality mapping, 810 then you might consider calling this function directly. 811 812 For parsing FASTQ files, the title string from the "@" line at the start 813 of each record can optionally be omitted on the "+" lines. If it is 814 repeated, it must be identical. 815 816 The sequence string and the quality string can optionally be split over 817 multiple lines, although several sources discourage this. In comparison, 818 for the FASTA file format line breaks between 60 and 80 characters are 819 the norm. 820 821 **WARNING** - Because the "@" character can appear in the quality string, 822 this can cause problems as this is also the marker for the start of 823 a new sequence. In fact, the "+" sign can also appear as well. Some 824 sources recommended having no line breaks in the quality to avoid this, 825 but even that is not enough, consider this example:: 826 827 @071113_EAS56_0053:1:1:998:236 828 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 829 +071113_EAS56_0053:1:1:998:236 830 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 831 @071113_EAS56_0053:1:1:182:712 832 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 833 + 834 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 835 @071113_EAS56_0053:1:1:153:10 836 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 837 + 838 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 839 @071113_EAS56_0053:1:3:990:501 840 TGGGAGGTTTTATGTGGA 841 AAGCAGCAATGTACAAGA 842 + 843 IIIIIII.IIIIII1@44 844 @-7.%<&+/$/%4(++(% 845 846 This is four PHRED encoded FASTQ entries originally from an NCBI source 847 (given the read length of 36, these are probably Solexa Illumina reads where 848 the quality has been mapped onto the PHRED values). 849 850 This example has been edited to illustrate some of the nasty things allowed 851 in the FASTQ format. Firstly, on the "+" lines most but not all of the 852 (redundant) identifiers are omitted. In real files it is likely that all or 853 none of these extra identifiers will be present. 854 855 Secondly, while the first three sequences have been shown without line 856 breaks, the last has been split over multiple lines. In real files any line 857 breaks are likely to be consistent. 858 859 Thirdly, some of the quality string lines start with an "@" character. For 860 the second record this is unavoidable. However for the fourth sequence this 861 only happens because its quality string is split over two lines. A naive 862 parser could wrongly treat any line starting with an "@" as the beginning of 863 a new sequence! This code copes with this possible ambiguity by keeping 864 track of the length of the sequence which gives the expected length of the 865 quality string. 866 867 Using this tricky example file as input, this short bit of code demonstrates 868 what this parsing function would return: 869 870 >>> with open("Quality/tricky.fastq", "rU") as handle: 871 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 872 ... print(title) 873 ... print("%s %s" % (sequence, quality)) 874 ... 875 071113_EAS56_0053:1:1:998:236 876 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 877 071113_EAS56_0053:1:1:182:712 878 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 879 071113_EAS56_0053:1:1:153:10 880 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 881 071113_EAS56_0053:1:3:990:501 882 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 883 884 Finally we note that some sources state that the quality string should 885 start with "!" (which using the PHRED mapping means the first letter always 886 has a quality score of zero). This rather restrictive rule is not widely 887 observed, so is therefore ignored here. One plus point about this "!" rule 888 is that (provided there are no line breaks in the quality sequence) it 889 would prevent the above problem with the "@" character. 890 """ 891 # We need to call handle.readline() at least four times per record, 892 # so we'll save a property look up each time: 893 handle_readline = handle.readline 894 895 # Skip any text before the first record (e.g. blank lines, comments?) 896 while True: 897 line = handle_readline() 898 if not line: 899 return # Premature end of file, or just empty? 900 if line[0] == "@": 901 break 902 if isinstance(line[0], int): 903 raise ValueError("Is this handle in binary mode not text mode?") 904 905 while line: 906 if line[0] != "@": 907 raise ValueError( 908 "Records in Fastq files should start with '@' character") 909 title_line = line[1:].rstrip() 910 # Will now be at least one line of quality data - in most FASTQ files 911 # just one line! We therefore use string concatenation (if needed) 912 # rather using than the "".join(...) trick just in case it is multiline: 913 seq_string = handle_readline().rstrip() 914 # There may now be more sequence lines, or the "+" quality marker line: 915 while True: 916 line = handle_readline() 917 if not line: 918 raise ValueError("End of file without quality information.") 919 if line[0] == "+": 920 # The title here is optional, but if present must match! 921 second_title = line[1:].rstrip() 922 if second_title and second_title != title_line: 923 raise ValueError("Sequence and quality captions differ.") 924 break 925 seq_string += line.rstrip() # removes trailing newlines 926 # This is going to slow things down a little, but assuming 927 # this isn't allowed we should try and catch it here: 928 if " " in seq_string or "\t" in seq_string: 929 raise ValueError("Whitespace is not allowed in the sequence.") 930 seq_len = len(seq_string) 931 932 # Will now be at least one line of quality data... 933 quality_string = handle_readline().rstrip() 934 # There may now be more quality data, or another sequence, or EOF 935 while True: 936 line = handle_readline() 937 if not line: 938 break # end of file 939 if line[0] == "@": 940 # This COULD be the start of a new sequence. However, it MAY just 941 # be a line of quality data which starts with a "@" character. We 942 # should be able to check this by looking at the sequence length 943 # and the amount of quality data found so far. 944 if len(quality_string) >= seq_len: 945 # We expect it to be equal if this is the start of a new record. 946 # If the quality data is longer, we'll raise an error below. 947 break 948 # Continue - its just some (more) quality data. 949 quality_string += line.rstrip() 950 951 if seq_len != len(quality_string): 952 raise ValueError("Lengths of sequence and quality values differs " 953 " for %s (%i and %i)." 954 % (title_line, seq_len, len(quality_string))) 955 956 # Return the record and then continue... 957 yield (title_line, seq_string, quality_string)
958 959
960 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
961 """Generator function to iterate over FASTQ records (as SeqRecord objects). 962 963 Arguments: 964 - handle - input file 965 - alphabet - optional alphabet 966 - title2ids - A function that, when given the title line from the FASTQ 967 file (without the beginning >), will return the id, name and 968 description (in that order) for the record as a tuple of strings. 969 If this is not given, then the entire title line will be used as 970 the description, and the first word as the id and name. 971 972 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 973 974 For each sequence in a (Sanger style) FASTQ file there is a matching string 975 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 976 values with an offset of 33. 977 978 For example, consider a file containing three short reads:: 979 980 @EAS54_6_R1_2_1_413_324 981 CCCTTCTTGTCTTCAGCGTTTCTCC 982 + 983 ;;3;;;;;;;;;;;;7;;;;;;;88 984 @EAS54_6_R1_2_1_540_792 985 TTGGCAGGCCAAGGCCGATGGATCA 986 + 987 ;;;;;;;;;;;7;;;;;-;;;3;83 988 @EAS54_6_R1_2_1_443_348 989 GTTGCTTCTGGCGTGGGTGGGGGGG 990 + 991 ;;;;;;;;;;;9;7;;.7;393333 992 993 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 994 string encoding the PHRED qualities using a ASCII values with an offset of 995 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 996 997 Using this module directly you might run: 998 999 >>> with open("Quality/example.fastq", "rU") as handle: 1000 ... for record in FastqPhredIterator(handle): 1001 ... print("%s %s" % (record.id, record.seq)) 1002 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1003 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1004 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1005 1006 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1007 (or "fastq-sanger") as the format: 1008 1009 >>> from Bio import SeqIO 1010 >>> with open("Quality/example.fastq", "rU") as handle: 1011 ... for record in SeqIO.parse(handle, "fastq"): 1012 ... print("%s %s" % (record.id, record.seq)) 1013 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1014 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1015 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1016 1017 If you want to look at the qualities, they are record in each record's 1018 per-letter-annotation dictionary as a simple list of integers: 1019 1020 >>> print(record.letter_annotations["phred_quality"]) 1021 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1022 1023 """ 1024 assert SANGER_SCORE_OFFSET == ord("!") 1025 # Originally, I used a list expression for each record: 1026 # 1027 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1028 # 1029 # Precomputing is faster, perhaps partly by avoiding the subtractions. 1030 q_mapping = dict() 1031 for letter in range(0, 255): 1032 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1033 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1034 if title2ids: 1035 id, name, descr = title2ids(title_line) 1036 else: 1037 descr = title_line 1038 id = descr.split()[0] 1039 name = id 1040 record = SeqRecord(Seq(seq_string, alphabet), 1041 id=id, name=name, description=descr) 1042 qualities = [q_mapping[letter] for letter in quality_string] 1043 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1044 raise ValueError("Invalid character in quality string") 1045 # For speed, will now use a dirty trick to speed up assigning the 1046 # qualities. We do this to bypass the length check imposed by the 1047 # per-letter-annotations restricted dict (as this has already been 1048 # checked by FastqGeneralIterator). This is equivalent to: 1049 # record.letter_annotations["phred_quality"] = qualities 1050 dict.__setitem__(record._per_letter_annotations, 1051 "phred_quality", qualities) 1052 yield record
1053 1054
1055 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1056 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1057 1058 The optional arguments are the same as those for the FastqPhredIterator. 1059 1060 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1061 encoding the Solexa integer qualities using ASCII values with an offset 1062 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1063 will NOT perform any automatic conversion when loading. 1064 1065 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1066 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1067 1068 For example, consider a file containing these five records:: 1069 1070 @SLXA-B3_649_FC8437_R1_1_1_610_79 1071 GATGTGCAATACCTTTGTAGAGGAA 1072 +SLXA-B3_649_FC8437_R1_1_1_610_79 1073 YYYYYYYYYYYYYYYYYYWYWYYSU 1074 @SLXA-B3_649_FC8437_R1_1_1_397_389 1075 GGTTTGAGAAAGAGAAATGAGATAA 1076 +SLXA-B3_649_FC8437_R1_1_1_397_389 1077 YYYYYYYYYWYYYYWWYYYWYWYWW 1078 @SLXA-B3_649_FC8437_R1_1_1_850_123 1079 GAGGGTGTTGATCATGATGATGGCG 1080 +SLXA-B3_649_FC8437_R1_1_1_850_123 1081 YYYYYYYYYYYYYWYYWYYSYYYSY 1082 @SLXA-B3_649_FC8437_R1_1_1_362_549 1083 GGAAACAAAGTTTTTCTCAACATAG 1084 +SLXA-B3_649_FC8437_R1_1_1_362_549 1085 YYYYYYYYYYYYYYYYYYWWWWYWY 1086 @SLXA-B3_649_FC8437_R1_1_1_183_714 1087 GTATTATTTAATGGCATACACTCAA 1088 +SLXA-B3_649_FC8437_R1_1_1_183_714 1089 YYYYYYYYYYWYYYYWYWWUWWWQQ 1090 1091 Using this module directly you might run: 1092 1093 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1094 ... for record in FastqSolexaIterator(handle): 1095 ... print("%s %s" % (record.id, record.seq)) 1096 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1097 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1098 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1099 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1100 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1101 1102 Typically however, you would call this via Bio.SeqIO instead with 1103 "fastq-solexa" as the format: 1104 1105 >>> from Bio import SeqIO 1106 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1107 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1108 ... print("%s %s" % (record.id, record.seq)) 1109 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1110 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1111 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1112 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1113 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1114 1115 If you want to look at the qualities, they are recorded in each record's 1116 per-letter-annotation dictionary as a simple list of integers: 1117 1118 >>> print(record.letter_annotations["solexa_quality"]) 1119 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1120 1121 These scores aren't very good, but they are high enough that they map 1122 almost exactly onto PHRED scores: 1123 1124 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1125 25.01 1126 1127 Let's look at faked example read which is even worse, where there are 1128 more noticeable differences between the Solexa and PHRED scores:: 1129 1130 @slxa_0001_1_0001_01 1131 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1132 +slxa_0001_1_0001_01 1133 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1134 1135 Again, you would typically use Bio.SeqIO to read this file in (rather than 1136 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1137 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1138 as shown above. This example has only as one entry, so instead we can 1139 use the Bio.SeqIO.read() function: 1140 1141 >>> from Bio import SeqIO 1142 >>> with open("Quality/solexa_faked.fastq", "rU") as handle: 1143 ... record = SeqIO.read(handle, "fastq-solexa") 1144 >>> print("%s %s" % (record.id, record.seq)) 1145 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1146 >>> print(record.letter_annotations["solexa_quality"]) 1147 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1148 1149 These quality scores are so low that when converted from the Solexa scheme 1150 into PHRED scores they look quite different: 1151 1152 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1153 2.54 1154 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1155 1.19 1156 1157 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1158 method to output the record(s): 1159 1160 >>> print(record.format("fastq-solexa")) 1161 @slxa_0001_1_0001_01 1162 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1163 + 1164 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1165 <BLANKLINE> 1166 1167 Note this output is slightly different from the input file as Biopython 1168 has left out the optional repetition of the sequence identifier on the "+" 1169 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1170 output format instead, and Biopython will do the conversion for you: 1171 1172 >>> print(record.format("fastq")) 1173 @slxa_0001_1_0001_01 1174 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1175 + 1176 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1177 <BLANKLINE> 1178 1179 >>> print(record.format("qual")) 1180 >slxa_0001_1_0001_01 1181 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1182 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1183 1 1 1184 <BLANKLINE> 1185 1186 As shown above, the poor quality Solexa reads have been mapped to the 1187 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1188 """ 1189 q_mapping = dict() 1190 for letter in range(0, 255): 1191 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1192 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1193 if title2ids: 1194 id, name, descr = title_line 1195 else: 1196 descr = title_line 1197 id = descr.split()[0] 1198 name = id 1199 record = SeqRecord(Seq(seq_string, alphabet), 1200 id=id, name=name, description=descr) 1201 qualities = [q_mapping[letter] for letter in quality_string] 1202 # DO NOT convert these into PHRED qualities automatically! 1203 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1204 raise ValueError("Invalid character in quality string") 1205 # Dirty trick to speed up this line: 1206 # record.letter_annotations["solexa_quality"] = qualities 1207 dict.__setitem__(record._per_letter_annotations, 1208 "solexa_quality", qualities) 1209 yield record
1210 1211
1212 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1213 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1214 1215 The optional arguments are the same as those for the FastqPhredIterator. 1216 1217 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1218 encoding PHRED integer qualities using ASCII values with an offset of 64. 1219 1220 >>> from Bio import SeqIO 1221 >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") 1222 >>> print("%s %s" % (record.id, record.seq)) 1223 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1224 >>> max(record.letter_annotations["phred_quality"]) 1225 40 1226 >>> min(record.letter_annotations["phred_quality"]) 1227 0 1228 1229 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1230 with an ASCII offset of 64. They are approximately equal but only for high 1231 quality reads. If you have an old Solexa/Illumina file with negative 1232 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1233 1234 >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") 1235 Traceback (most recent call last): 1236 ... 1237 ValueError: Invalid character in quality string 1238 1239 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1240 """ 1241 q_mapping = dict() 1242 for letter in range(0, 255): 1243 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1244 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1245 if title2ids: 1246 id, name, descr = title2ids(title_line) 1247 else: 1248 descr = title_line 1249 id = descr.split()[0] 1250 name = id 1251 record = SeqRecord(Seq(seq_string, alphabet), 1252 id=id, name=name, description=descr) 1253 qualities = [q_mapping[letter] for letter in quality_string] 1254 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1255 raise ValueError("Invalid character in quality string") 1256 # Dirty trick to speed up this line: 1257 # record.letter_annotations["phred_quality"] = qualities 1258 dict.__setitem__(record._per_letter_annotations, 1259 "phred_quality", qualities) 1260 yield record
1261 1262
1263 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1264 """For QUAL files which include PHRED quality scores, but no sequence. 1265 1266 For example, consider this short QUAL file:: 1267 1268 >EAS54_6_R1_2_1_413_324 1269 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1270 26 26 26 23 23 1271 >EAS54_6_R1_2_1_540_792 1272 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1273 26 18 26 23 18 1274 >EAS54_6_R1_2_1_443_348 1275 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1276 24 18 18 18 18 1277 1278 Using this module directly you might run: 1279 1280 >>> with open("Quality/example.qual", "rU") as handle: 1281 ... for record in QualPhredIterator(handle): 1282 ... print("%s %s" % (record.id, record.seq)) 1283 EAS54_6_R1_2_1_413_324 ????????????????????????? 1284 EAS54_6_R1_2_1_540_792 ????????????????????????? 1285 EAS54_6_R1_2_1_443_348 ????????????????????????? 1286 1287 Typically however, you would call this via Bio.SeqIO instead with "qual" 1288 as the format: 1289 1290 >>> from Bio import SeqIO 1291 >>> with open("Quality/example.qual", "rU") as handle: 1292 ... for record in SeqIO.parse(handle, "qual"): 1293 ... print("%s %s" % (record.id, record.seq)) 1294 EAS54_6_R1_2_1_413_324 ????????????????????????? 1295 EAS54_6_R1_2_1_540_792 ????????????????????????? 1296 EAS54_6_R1_2_1_443_348 ????????????????????????? 1297 1298 Becase QUAL files don't contain the sequence string itself, the seq 1299 property is set to an UnknownSeq object. As no alphabet was given, this 1300 has defaulted to a generic single letter alphabet and the character "?" 1301 used. 1302 1303 By specifying a nucleotide alphabet, "N" is used instead: 1304 1305 >>> from Bio import SeqIO 1306 >>> from Bio.Alphabet import generic_dna 1307 >>> with open("Quality/example.qual", "rU") as handle: 1308 ... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1309 ... print("%s %s" % (record.id, record.seq)) 1310 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1311 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1312 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1313 1314 However, the quality scores themselves are available as a list of integers 1315 in each record's per-letter-annotation: 1316 1317 >>> print(record.letter_annotations["phred_quality"]) 1318 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1319 1320 You can still slice one of these SeqRecord objects with an UnknownSeq: 1321 1322 >>> sub_record = record[5:10] 1323 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1324 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1325 1326 As of Biopython 1.59, this parser will accept files with negatives quality 1327 scores but will replace them with the lowest possible PHRED score of zero. 1328 This will trigger a warning, previously it raised a ValueError exception. 1329 """ 1330 # Skip any text before the first record (e.g. blank lines, comments) 1331 while True: 1332 line = handle.readline() 1333 if line == "": 1334 return # Premature end of file, or just empty? 1335 if line[0] == ">": 1336 break 1337 1338 while True: 1339 if line[0] != ">": 1340 raise ValueError( 1341 "Records in Fasta files should start with '>' character") 1342 if title2ids: 1343 id, name, descr = title2ids(line[1:].rstrip()) 1344 else: 1345 descr = line[1:].rstrip() 1346 id = descr.split()[0] 1347 name = id 1348 1349 qualities = [] 1350 line = handle.readline() 1351 while True: 1352 if not line: 1353 break 1354 if line[0] == ">": 1355 break 1356 qualities.extend(int(word) for word in line.split()) 1357 line = handle.readline() 1358 1359 if qualities and min(qualities) < 0: 1360 warnings.warn(("Negative quality score %i found, " + 1361 "substituting PHRED zero instead.") 1362 % min(qualities), BiopythonParserWarning) 1363 qualities = [max(0, q) for q in qualities] 1364 1365 # Return the record and then continue... 1366 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1367 id=id, name=name, description=descr) 1368 # Dirty trick to speed up this line: 1369 # record.letter_annotations["phred_quality"] = qualities 1370 dict.__setitem__(record._per_letter_annotations, 1371 "phred_quality", qualities) 1372 yield record 1373 1374 if not line: 1375 return # StopIteration 1376 assert False, "Should not reach this line"
1377 1378
1379 -class FastqPhredWriter(SequentialSequenceWriter):
1380 """Class to write standard FASTQ format files (using PHRED quality scores). 1381 1382 Although you can use this class directly, you are strongly encouraged 1383 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1384 or the alias "fastq-sanger". For example, this code reads in a standard 1385 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1386 Sanger style FASTQ file: 1387 1388 >>> from Bio import SeqIO 1389 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1390 >>> with open("Quality/temp.fastq", "w") as out_handle: 1391 ... SeqIO.write(record_iterator, out_handle, "fastq") 1392 3 1393 1394 You might want to do this if the original file included extra line breaks, 1395 which while valid may not be supported by all tools. The output file from 1396 Biopython will have each sequence on a single line, and each quality 1397 string on a single line (which is considered desirable for maximum 1398 compatibility). 1399 1400 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1401 quality scores) is converted into a standard Sanger style FASTQ file using 1402 PHRED qualities: 1403 1404 >>> from Bio import SeqIO 1405 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1406 >>> with open("Quality/temp.fastq", "w") as out_handle: 1407 ... SeqIO.write(record_iterator, out_handle, "fastq") 1408 5 1409 1410 This code is also called if you use the .format("fastq") method of a 1411 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1412 1413 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1414 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1415 warning is issued. 1416 1417 P.S. To avoid cluttering up your working directory, you can delete this 1418 temporary file now: 1419 1420 >>> import os 1421 >>> os.remove("Quality/temp.fastq") 1422 """ 1423 1424 assert SANGER_SCORE_OFFSET == ord("!") 1425
1426 - def write_record(self, record):
1427 """Write a single FASTQ record to the file.""" 1428 assert self._header_written 1429 assert not self._footer_written 1430 self._record_written = True 1431 # TODO - Is an empty sequence allowed in FASTQ format? 1432 if record.seq is None: 1433 raise ValueError("No sequence for record %s" % record.id) 1434 seq_str = str(record.seq) 1435 qualities_str = _get_sanger_quality_str(record) 1436 if len(qualities_str) != len(seq_str): 1437 raise ValueError("Record %s has sequence length %i but %i quality scores" 1438 % (record.id, len(seq_str), len(qualities_str))) 1439 1440 # FASTQ files can include a description, just like FASTA files 1441 # (at least, this is what the NCBI Short Read Archive does) 1442 id = self.clean(record.id) 1443 description = self.clean(record.description) 1444 if description and description.split(None, 1)[0] == id: 1445 # The description includes the id at the start 1446 title = description 1447 elif description: 1448 title = "%s %s" % (id, description) 1449 else: 1450 title = id 1451 1452 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1453 1454
1455 -class QualPhredWriter(SequentialSequenceWriter):
1456 """Class to write QUAL format files (using PHRED quality scores). 1457 1458 Although you can use this class directly, you are strongly encouraged 1459 to use the Bio.SeqIO.write() function instead. For example, this code 1460 reads in a FASTQ file and saves the quality scores into a QUAL file: 1461 1462 >>> from Bio import SeqIO 1463 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1464 >>> with open("Quality/temp.qual", "w") as out_handle: 1465 ... SeqIO.write(record_iterator, out_handle, "qual") 1466 3 1467 1468 This code is also called if you use the .format("qual") method of a 1469 SeqRecord. 1470 1471 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1472 1473 >>> import os 1474 >>> os.remove("Quality/temp.qual") 1475 """ 1476
1477 - def __init__(self, handle, wrap=60, record2title=None):
1478 """Create a QUAL writer. 1479 1480 Arguments: 1481 - handle - Handle to an output file, e.g. as returned 1482 by open(filename, "w") 1483 - wrap - Optional line length used to wrap sequence lines. 1484 Defaults to wrapping the sequence at 60 characters. Use 1485 zero (or None) for no wrapping, giving a single long line 1486 for the sequence. 1487 - record2title - Optional function to return the text to be 1488 used for the title line of each record. By default a 1489 combination of the record.id and record.description is 1490 used. If the record.description starts with the record.id, 1491 then just the record.description is used. 1492 1493 The record2title argument is present for consistency with the 1494 Bio.SeqIO.FastaIO writer class. 1495 """ 1496 SequentialSequenceWriter.__init__(self, handle) 1497 # self.handle = handle 1498 self.wrap = None 1499 if wrap: 1500 if wrap < 1: 1501 raise ValueError 1502 self.wrap = wrap 1503 self.record2title = record2title
1504
1505 - def write_record(self, record):
1506 """Write a single QUAL record to the file.""" 1507 assert self._header_written 1508 assert not self._footer_written 1509 self._record_written = True 1510 1511 handle = self.handle 1512 wrap = self.wrap 1513 1514 if self.record2title: 1515 title = self.clean(self.record2title(record)) 1516 else: 1517 id = self.clean(record.id) 1518 description = self.clean(record.description) 1519 if description and description.split(None, 1)[0] == id: 1520 # The description includes the id at the start 1521 title = description 1522 elif description: 1523 title = "%s %s" % (id, description) 1524 else: 1525 title = id 1526 handle.write(">%s\n" % title) 1527 1528 qualities = _get_phred_quality(record) 1529 try: 1530 # This rounds to the nearest integer. 1531 # TODO - can we record a float in a qual file? 1532 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1533 except TypeError as e: 1534 if None in qualities: 1535 raise TypeError("A quality value of None was found") 1536 else: 1537 raise e 1538 1539 if wrap > 5: 1540 # Fast wrapping 1541 data = " ".join(qualities_strs) 1542 while True: 1543 if len(data) <= wrap: 1544 self.handle.write(data + "\n") 1545 break 1546 else: 1547 # By construction there must be spaces in the first X chars 1548 # (unless we have X digit or higher quality scores!) 1549 i = data.rfind(" ", 0, wrap) 1550 handle.write(data[:i] + "\n") 1551 data = data[i + 1:] 1552 elif wrap: 1553 # Safe wrapping 1554 while qualities_strs: 1555 line = qualities_strs.pop(0) 1556 while qualities_strs \ 1557 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1558 line += " " + qualities_strs.pop(0) 1559 handle.write(line + "\n") 1560 else: 1561 # No wrapping 1562 data = " ".join(qualities_strs) 1563 handle.write(data + "\n")
1564 1565
1566 -class FastqSolexaWriter(SequentialSequenceWriter):
1567 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1568 1569 This outputs FASTQ files like those from the early Solexa/Illumina 1570 pipeline, using Solexa scores and an ASCII offset of 64. These are 1571 NOT compatible with the standard Sanger style PHRED FASTQ files. 1572 1573 If your records contain a "solexa_quality" entry under letter_annotations, 1574 this is used, otherwise any "phred_quality" entry will be used after 1575 conversion using the solexa_quality_from_phred function. If neither style 1576 of quality scores are present, an exception is raised. 1577 1578 Although you can use this class directly, you are strongly encouraged 1579 to use the Bio.SeqIO.write() function instead. For example, this code 1580 reads in a FASTQ file and re-saves it as another FASTQ file: 1581 1582 >>> from Bio import SeqIO 1583 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1584 >>> with open("Quality/temp.fastq", "w") as out_handle: 1585 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1586 5 1587 1588 You might want to do this if the original file included extra line breaks, 1589 which (while valid) may not be supported by all tools. The output file 1590 from Biopython will have each sequence on a single line, and each quality 1591 string on a single line (which is considered desirable for maximum 1592 compatibility). 1593 1594 This code is also called if you use the .format("fastq-solexa") method of 1595 a SeqRecord. For example, 1596 1597 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1598 >>> print(record.format("fastq-solexa")) 1599 @Test PHRED qualities from 40 to 0 inclusive 1600 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1601 + 1602 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1603 <BLANKLINE> 1604 1605 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1606 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1607 a warning is issued. 1608 1609 P.S. Don't forget to delete the temp file if you don't need it anymore: 1610 1611 >>> import os 1612 >>> os.remove("Quality/temp.fastq") 1613 """ 1614
1615 - def write_record(self, record):
1616 """Write a single FASTQ record to the file.""" 1617 assert self._header_written 1618 assert not self._footer_written 1619 self._record_written = True 1620 1621 # TODO - Is an empty sequence allowed in FASTQ format? 1622 if record.seq is None: 1623 raise ValueError("No sequence for record %s" % record.id) 1624 seq_str = str(record.seq) 1625 qualities_str = _get_solexa_quality_str(record) 1626 if len(qualities_str) != len(seq_str): 1627 raise ValueError("Record %s has sequence length %i but %i quality scores" 1628 % (record.id, len(seq_str), len(qualities_str))) 1629 1630 # FASTQ files can include a description, just like FASTA files 1631 # (at least, this is what the NCBI Short Read Archive does) 1632 id = self.clean(record.id) 1633 description = self.clean(record.description) 1634 if description and description.split(None, 1)[0] == id: 1635 # The description includes the id at the start 1636 title = description 1637 elif description: 1638 title = "%s %s" % (id, description) 1639 else: 1640 title = id 1641 1642 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1643 1644
1645 -class FastqIlluminaWriter(SequentialSequenceWriter):
1646 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1647 1648 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1649 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1650 compatible with the standard Sanger style PHRED FASTQ files which use an 1651 ASCII offset of 32. 1652 1653 Although you can use this class directly, you are strongly encouraged to 1654 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1655 instead. This code is also called if you use the .format("fastq-illumina") 1656 method of a SeqRecord. For example, 1657 1658 >>> from Bio import SeqIO 1659 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1660 >>> print(record.format("fastq-illumina")) 1661 @Test PHRED qualities from 40 to 0 inclusive 1662 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1663 + 1664 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1665 <BLANKLINE> 1666 1667 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1668 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1669 warning is issued. 1670 """ 1671
1672 - def write_record(self, record):
1673 """Write a single FASTQ record to the file.""" 1674 assert self._header_written 1675 assert not self._footer_written 1676 self._record_written = True 1677 1678 # TODO - Is an empty sequence allowed in FASTQ format? 1679 if record.seq is None: 1680 raise ValueError("No sequence for record %s" % record.id) 1681 seq_str = str(record.seq) 1682 qualities_str = _get_illumina_quality_str(record) 1683 if len(qualities_str) != len(seq_str): 1684 raise ValueError("Record %s has sequence length %i but %i quality scores" 1685 % (record.id, len(seq_str), len(qualities_str))) 1686 1687 # FASTQ files can include a description, just like FASTA files 1688 # (at least, this is what the NCBI Short Read Archive does) 1689 id = self.clean(record.id) 1690 description = self.clean(record.description) 1691 if description and description.split(None, 1)[0] == id: 1692 # The description includes the id at the start 1693 title = description 1694 elif description: 1695 title = "%s %s" % (id, description) 1696 else: 1697 title = id 1698 1699 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1700 1701
1702 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1703 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1704 1705 For example, consider this short QUAL file with PHRED quality scores:: 1706 1707 >EAS54_6_R1_2_1_413_324 1708 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1709 26 26 26 23 23 1710 >EAS54_6_R1_2_1_540_792 1711 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1712 26 18 26 23 18 1713 >EAS54_6_R1_2_1_443_348 1714 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1715 24 18 18 18 18 1716 1717 And a matching FASTA file:: 1718 1719 >EAS54_6_R1_2_1_413_324 1720 CCCTTCTTGTCTTCAGCGTTTCTCC 1721 >EAS54_6_R1_2_1_540_792 1722 TTGGCAGGCCAAGGCCGATGGATCA 1723 >EAS54_6_R1_2_1_443_348 1724 GTTGCTTCTGGCGTGGGTGGGGGGG 1725 1726 You can parse these separately using Bio.SeqIO with the "qual" and 1727 "fasta" formats, but then you'll get a group of SeqRecord objects with 1728 no sequence, and a matching group with the sequence but not the 1729 qualities. Because it only deals with one input file handle, Bio.SeqIO 1730 can't be used to read the two files together - but this function can! 1731 For example, 1732 1733 >>> with open("Quality/example.fasta", "rU") as f: 1734 ... with open("Quality/example.qual", "rU") as q: 1735 ... for record in PairedFastaQualIterator(f, q): 1736 ... print("%s %s" % (record.id, record.seq)) 1737 ... 1738 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1739 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1740 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1741 1742 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1743 they are in each record's per-letter-annotation dictionary as a simple 1744 list of integers: 1745 1746 >>> print(record.letter_annotations["phred_quality"]) 1747 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1748 1749 If you have access to data as a FASTQ format file, using that directly 1750 would be simpler and more straight forward. Note that you can easily use 1751 this function to convert paired FASTA and QUAL files into FASTQ files: 1752 1753 >>> from Bio import SeqIO 1754 >>> with open("Quality/example.fasta", "rU") as f: 1755 ... with open("Quality/example.qual", "rU") as q: 1756 ... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") 1757 ... 1758 3 1759 1760 And don't forget to clean up the temp file if you don't need it anymore: 1761 1762 >>> import os 1763 >>> os.remove("Quality/temp.fastq") 1764 """ 1765 from Bio.SeqIO.FastaIO import FastaIterator 1766 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1767 title2ids=title2ids) 1768 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1769 title2ids=title2ids) 1770 1771 # Using (Python 3 style) zip wouldn't load everything into memory, 1772 # but also would not catch any extra records found in only one file. 1773 while True: 1774 try: 1775 f_rec = next(fasta_iter) 1776 except StopIteration: 1777 f_rec = None 1778 try: 1779 q_rec = next(qual_iter) 1780 except StopIteration: 1781 q_rec = None 1782 if f_rec is None and q_rec is None: 1783 # End of both files 1784 break 1785 if f_rec is None: 1786 raise ValueError("FASTA file has more entries than the QUAL file.") 1787 if q_rec is None: 1788 raise ValueError("QUAL file has more entries than the FASTA file.") 1789 if f_rec.id != q_rec.id: 1790 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1791 % (f_rec.id, q_rec.id)) 1792 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1793 raise ValueError("Sequence length and number of quality scores disagree for %s" 1794 % f_rec.id) 1795 # Merge the data.... 1796 f_rec.letter_annotations[ 1797 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1798 yield f_rec
1799 # Done 1800 1801 1802 if __name__ == "__main__": 1803 from Bio._utils import run_doctest 1804 run_doctest(verbose=0) 1805