Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2015 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78      - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79      - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80        offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81        These can potentially hold PHRED scores from 0 to 93. 
  82      - "fastq-sanger" is an alias for "fastq". 
  83      - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84        files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85        scores from -5 to 62. 
  86      - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87        PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88        to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114  >>> from Bio import SeqIO 
 115  >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116  ...     print("%s %s" % (record.id, record.seq)) 
 117  EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118  EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119  EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123  >>> print(record) 
 124  ID: EAS54_6_R1_2_1_443_348 
 125  Name: EAS54_6_R1_2_1_443_348 
 126  Description: EAS54_6_R1_2_1_443_348 
 127  Number of features: 0 
 128  Per letter annotation for: phred_quality 
 129  Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130  >>> print(record.letter_annotations["phred_quality"]) 
 131  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135  >>> print(record.format("qual")) 
 136  >EAS54_6_R1_2_1_443_348 
 137  26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138  24 18 18 18 18 
 139  <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143  >>> print(record.format("fastq")) 
 144  @EAS54_6_R1_2_1_443_348 
 145  GTTGCTTCTGGCGTGGGTGGGGGGG 
 146  + 
 147  ;;;;;;;;;;;9;7;;.7;393333 
 148  <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153  >>> print(record.format("fastq-illumina")) 
 154  @EAS54_6_R1_2_1_443_348 
 155  GTTGCTTCTGGCGTGGGTGGGGGGG 
 156  + 
 157  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158  <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163  >>> print(record.format("fastq-solexa")) 
 164  @EAS54_6_R1_2_1_443_348 
 165  GTTGCTTCTGGCGTGGGTGGGGGGG 
 166  + 
 167  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168  <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179  >>> sub_rec = record[5:15] 
 180  >>> print(sub_rec) 
 181  ID: EAS54_6_R1_2_1_443_348 
 182  Name: EAS54_6_R1_2_1_443_348 
 183  Description: EAS54_6_R1_2_1_443_348 
 184  Number of features: 0 
 185  Per letter annotation for: phred_quality 
 186  Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187  >>> print(sub_rec.letter_annotations["phred_quality"]) 
 188  [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189  >>> print(sub_rec.format("fastq")) 
 190  @EAS54_6_R1_2_1_443_348 
 191  TTCTGGCGTG 
 192  + 
 193  ;;;;;;9;7; 
 194  <BLANKLINE> 
 195   
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198  >>> from Bio import SeqIO 
 199  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200  >>> with open("Quality/temp.qual", "w") as out_handle: 
 201  ...     SeqIO.write(record_iterator, out_handle, "qual") 
 202  3 
 203   
 204  You can of course read in a QUAL file, such as the one we just created: 
 205   
 206  >>> from Bio import SeqIO 
 207  >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 208  ...     print("%s %s" % (record.id, record.seq)) 
 209  EAS54_6_R1_2_1_413_324 ????????????????????????? 
 210  EAS54_6_R1_2_1_540_792 ????????????????????????? 
 211  EAS54_6_R1_2_1_443_348 ????????????????????????? 
 212   
 213  Notice that QUAL files don't have a proper sequence present!  But the quality 
 214  information is there: 
 215   
 216  >>> print(record) 
 217  ID: EAS54_6_R1_2_1_443_348 
 218  Name: EAS54_6_R1_2_1_443_348 
 219  Description: EAS54_6_R1_2_1_443_348 
 220  Number of features: 0 
 221  Per letter annotation for: phred_quality 
 222  UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 223  >>> print(record.letter_annotations["phred_quality"]) 
 224  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 225   
 226  Just to keep things tidy, if you are following this example yourself, you can 
 227  delete this temporary file now: 
 228   
 229  >>> import os 
 230  >>> os.remove("Quality/temp.qual") 
 231   
 232  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 233  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 234  would have to read the two in separately and then combine the data.  However, 
 235  since this is such a common thing to want to do, there is a helper iterator 
 236  defined in this module that does this for you - PairedFastaQualIterator. 
 237   
 238  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 239  then a simple dictionary approach would work: 
 240   
 241  >>> from Bio import SeqIO 
 242  >>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) 
 243  >>> for rec in SeqIO.parse("Quality/example.qual", "qual"): 
 244  ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 245   
 246  You can then access any record by its key, and get both the sequence and the 
 247  quality scores. 
 248   
 249  >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 
 250  @EAS54_6_R1_2_1_540_792 
 251  TTGGCAGGCCAAGGCCGATGGATCA 
 252  + 
 253  ;;;;;;;;;;;7;;;;;-;;;3;83 
 254  <BLANKLINE> 
 255   
 256  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 257  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 258  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 259  for the more recent variant), as this cannot be detected reliably 
 260  automatically. 
 261   
 262  To illustrate this problem, let's consider an artificial example: 
 263   
 264  >>> from Bio.Seq import Seq 
 265  >>> from Bio.Alphabet import generic_dna 
 266  >>> from Bio.SeqRecord import SeqRecord 
 267  >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 268  ... description="Made up!") 
 269  >>> print(test.format("fasta")) 
 270  >Test Made up! 
 271  NACGTACGTA 
 272  <BLANKLINE> 
 273  >>> print(test.format("fastq")) 
 274  Traceback (most recent call last): 
 275   ... 
 276  ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 277   
 278  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 279  or FASTQ format we need to provide some quality scores. These are held as a 
 280  list of integers (one for each base) in the letter_annotations dictionary: 
 281   
 282  >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 283  >>> print(test.format("qual")) 
 284  >Test Made up! 
 285  0 1 2 3 4 5 10 20 30 40 
 286  <BLANKLINE> 
 287  >>> print(test.format("fastq")) 
 288  @Test Made up! 
 289  NACGTACGTA 
 290  + 
 291  !"#$%&+5?I 
 292  <BLANKLINE> 
 293   
 294  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 295  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 296  the letter "I": 
 297   
 298  >>> ord('!') - 33 
 299  0 
 300  >>> ord('I') - 33 
 301  40 
 302  >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 303  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 304   
 305  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 306  scores with an offset of 64: 
 307   
 308  >>> print(test.format("fastq-illumina")) 
 309  @Test Made up! 
 310  NACGTACGTA 
 311  + 
 312  @ABCDEJT^h 
 313  <BLANKLINE> 
 314   
 315  And we can check this too - the first PHRED score was zero, and this mapped to 
 316  "@", while the final score was 40 and this mapped to "h": 
 317   
 318  >>> ord("@") - 64 
 319  0 
 320  >>> ord("h") - 64 
 321  40 
 322  >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 323  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 324   
 325  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 326  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 327  format to consider which encodes Solexa scores instead of PHRED scores. 
 328   
 329  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 330  scores (rounding to one decimal place): 
 331   
 332  >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 
 333  ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 
 334  PHRED 0 maps to Solexa -5.0 
 335  PHRED 1 maps to Solexa -5.0 
 336  PHRED 2 maps to Solexa -2.3 
 337  PHRED 3 maps to Solexa -0.0 
 338  PHRED 4 maps to Solexa 1.8 
 339  PHRED 5 maps to Solexa 3.3 
 340  PHRED 10 maps to Solexa 9.5 
 341  PHRED 20 maps to Solexa 20.0 
 342  PHRED 30 maps to Solexa 30.0 
 343  PHRED 40 maps to Solexa 40.0 
 344   
 345  Now here is the record using the old Solexa style FASTQ file: 
 346   
 347  >>> print(test.format("fastq-solexa")) 
 348  @Test Made up! 
 349  NACGTACGTA 
 350  + 
 351  ;;>@BCJT^h 
 352  <BLANKLINE> 
 353   
 354  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 355   
 356  >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 357  [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 358   
 359  This explains why the last few letters of this FASTQ output matched that using 
 360  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 361  are approximately equal. 
 362   
 363  """ 
 364  from __future__ import print_function 
 365   
 366  from Bio.Alphabet import single_letter_alphabet 
 367  from Bio.Seq import Seq, UnknownSeq 
 368  from Bio.SeqRecord import SeqRecord 
 369  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 370  from math import log 
 371  import warnings 
 372  from Bio import BiopythonWarning, BiopythonParserWarning 
 373   
 374   
 375  # define score offsets. See discussion for differences between Sanger and 
 376  # Solexa offsets. 
 377  SANGER_SCORE_OFFSET = 33 
 378  SOLEXA_SCORE_OFFSET = 64 
 379   
 380   
381 -def solexa_quality_from_phred(phred_quality):
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 383 384 PHRED and Solexa quality scores are both log transformations of a 385 probality of error (high score = low probability of error). This function 386 takes a PHRED score, transforms it back to a probability of error, and 387 then re-expresses it as a Solexa score. This assumes the error estimates 388 are equivalent. 389 390 How does this work exactly? Well the PHRED quality is minus ten times the 391 base ten logarithm of the probability of error:: 392 393 phred_quality = -10*log(error,10) 394 395 Therefore, turning this round:: 396 397 error = 10 ** (- phred_quality / 10) 398 399 Now, Solexa qualities use a different log transformation:: 400 401 solexa_quality = -10*log(error/(1-error),10) 402 403 After substitution and a little manipulation we get:: 404 405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 406 407 However, real Solexa files use a minimum quality of -5. This does have a 408 good reason - a random base call would be correct 25% of the time, 409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 412 413 Taken literally, this logarithic formula would map a PHRED quality of zero 414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 415 score of zero means a probability of error of one (i.e. the base call is 416 definitely wrong), which is worse than random! In practice, a PHRED quality 417 of zero usually means a default value, or perhaps random - and therefore 418 mapping it to the minimum Solexa score of -5 is reasonable. 419 420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 422 quality of zero to -5.0 as well. 423 424 Note this function will return a floating point number, it is up to you to 425 round this to the nearest integer if appropriate. e.g. 426 427 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 428 80.00 429 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 430 50.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 432 19.96 433 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 434 9.54 435 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 436 3.35 437 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 438 1.80 439 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 440 -0.02 441 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 442 -2.33 443 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 444 -5.00 445 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 446 -5.00 447 448 Notice that for high quality reads PHRED and Solexa scores are numerically 449 equal. The differences are important for poor quality reads, where PHRED 450 has a minimum of zero but Solexa scores can be negative. 451 452 Finally, as a special case where None is used for a "missing value", None 453 is returned: 454 455 >>> print(solexa_quality_from_phred(None)) 456 None 457 """ 458 if phred_quality is None: 459 # Assume None is used as some kind of NULL or NA value; return None 460 # e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 461 return None 462 elif phred_quality > 0: 463 # Solexa uses a minimum value of -5, which after rounding matches a 464 # random nucleotide base call. 465 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 466 elif phred_quality == 0: 467 # Special case, map to -5 as discussed in the docstring 468 return -5.0 469 else: 470 raise ValueError("PHRED qualities must be positive (or zero), not %r" 471 % phred_quality)
472 473
474 -def phred_quality_from_solexa(solexa_quality):
475 """Convert a Solexa quality (which can be negative) to a PHRED quality. 476 477 PHRED and Solexa quality scores are both log transformations of a 478 probality of error (high score = low probability of error). This function 479 takes a Solexa score, transforms it back to a probability of error, and 480 then re-expresses it as a PHRED score. This assumes the error estimates 481 are equivalent. 482 483 The underlying formulas are given in the documentation for the sister 484 function solexa_quality_from_phred, in this case the operation is:: 485 486 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 487 488 This will return a floating point number, it is up to you to round this to 489 the nearest integer if appropriate. e.g. 490 491 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 492 80.00 493 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 494 20.04 495 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 496 10.41 497 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 498 3.01 499 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 500 1.19 501 502 Note that a solexa_quality less then -5 is not expected, will trigger a 503 warning, but will still be converted as per the logarithmic mapping 504 (giving a number between 0 and 1.19 back). 505 506 As a special case where None is used for a "missing value", None is 507 returned: 508 509 >>> print(phred_quality_from_solexa(None)) 510 None 511 """ 512 if solexa_quality is None: 513 # Assume None is used as some kind of NULL or NA value; return None 514 return None 515 if solexa_quality < -5: 516 warnings.warn("Solexa quality less than -5 passed, %r" % solexa_quality, 517 BiopythonWarning) 518 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
519 520
521 -def _get_phred_quality(record):
522 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 523 524 If there are no PHRED qualities, but there are Solexa qualities, those are 525 used instead after conversion. 526 """ 527 try: 528 return record.letter_annotations["phred_quality"] 529 except KeyError: 530 pass 531 try: 532 return [phred_quality_from_solexa(q) for 533 q in record.letter_annotations["solexa_quality"]] 534 except KeyError: 535 raise ValueError("No suitable quality scores found in " 536 "letter_annotations of SeqRecord (id=%s)." 537 % record.id)
538 539 # Only map 0 to 93, we need to give a warning on truncating at 93 540 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 541 for qp in range(0, 93 + 1)) 542 # Only map -5 to 93, we need to give a warning on truncating at 93 543 _solexa_to_sanger_quality_str = dict( 544 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 545 SANGER_SCORE_OFFSET))) 546 for qs in range(-5, 93 + 1)) 547 548
549 -def _get_sanger_quality_str(record):
550 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 551 552 >>> from Bio.Seq import Seq 553 >>> from Bio.SeqRecord import SeqRecord 554 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 555 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 556 >>> _get_sanger_quality_str(r) 557 'SI?5+!' 558 559 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 560 the PHRED qualities are integers, this function is able to use a very fast 561 pre-cached mapping. However, if they are floats which differ slightly, then 562 it has to do the appropriate rounding - which is slower: 563 564 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 565 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 566 >>> _get_sanger_quality_str(r2) 567 'SI?5+!' 568 569 If your scores include a None value, this raises an exception: 570 571 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 572 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 573 >>> _get_sanger_quality_str(r3) 574 Traceback (most recent call last): 575 ... 576 TypeError: A quality value of None was found 577 578 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 579 scores are used in preference: 580 581 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 582 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 583 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 584 >>> _get_sanger_quality_str(r4) 585 'SI?5+!' 586 587 If there are no PHRED scores, but there are Solexa scores, these are used 588 instead (after the appropriate conversion): 589 590 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 591 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 592 >>> _get_sanger_quality_str(r5) 593 'I?5+$"' 594 595 Again, integer Solexa scores can be looked up in a pre-cached mapping making 596 this very fast. You can still use approximate floating point scores: 597 598 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 599 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 600 >>> _get_sanger_quality_str(r6) 601 'I?5+$"' 602 603 Notice that due to the limited range of printable ASCII characters, a 604 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 605 file (using ASCII 126, the tilde). This function will issue a warning 606 in this situation. 607 """ 608 # TODO - This functions works and is fast, but it is also ugly 609 # and there is considerable repetition of code for the other 610 # two FASTQ variants. 611 try: 612 # These take priority (in case both Solexa and PHRED scores found) 613 qualities = record.letter_annotations["phred_quality"] 614 except KeyError: 615 # Fall back on solexa scores... 616 pass 617 else: 618 # Try and use the precomputed mapping: 619 try: 620 return "".join(_phred_to_sanger_quality_str[qp] 621 for qp in qualities) 622 except KeyError: 623 # Could be a float, or a None in the list, or a high value. 624 pass 625 if None in qualities: 626 raise TypeError("A quality value of None was found") 627 if max(qualities) >= 93.5: 628 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 629 BiopythonWarning) 630 # This will apply the truncation at 93, giving max ASCII 126 631 return "".join(chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 632 for qp in qualities) 633 # Fall back on the Solexa scores... 634 try: 635 qualities = record.letter_annotations["solexa_quality"] 636 except KeyError: 637 raise ValueError("No suitable quality scores found in " 638 "letter_annotations of SeqRecord (id=%s)." 639 % record.id) 640 # Try and use the precomputed mapping: 641 try: 642 return "".join(_solexa_to_sanger_quality_str[qs] 643 for qs in qualities) 644 except KeyError: 645 # Either no PHRED scores, or something odd like a float or None 646 pass 647 if None in qualities: 648 raise TypeError("A quality value of None was found") 649 # Must do this the slow way, first converting the PHRED scores into 650 # Solexa scores: 651 if max(qualities) >= 93.5: 652 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 653 BiopythonWarning) 654 # This will apply the truncation at 93, giving max ASCII 126 655 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 656 for qs in qualities)
657 658 # Only map 0 to 62, we need to give a warning on truncating at 62 659 assert 62 + SOLEXA_SCORE_OFFSET == 126 660 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 661 for qp in range(0, 62 + 1)) 662 # Only map -5 to 62, we need to give a warning on truncating at 62 663 _solexa_to_illumina_quality_str = dict( 664 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 665 for qs in range(-5, 62 + 1)) 666 667
668 -def _get_illumina_quality_str(record):
669 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 670 671 Notice that due to the limited range of printable ASCII characters, a 672 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 673 file (using ASCII 126, the tilde). This function will issue a warning 674 in this situation. 675 """ 676 # TODO - This functions works and is fast, but it is also ugly 677 # and there is considerable repetition of code for the other 678 # two FASTQ variants. 679 try: 680 # These take priority (in case both Solexa and PHRED scores found) 681 qualities = record.letter_annotations["phred_quality"] 682 except KeyError: 683 # Fall back on solexa scores... 684 pass 685 else: 686 # Try and use the precomputed mapping: 687 try: 688 return "".join(_phred_to_illumina_quality_str[qp] 689 for qp in qualities) 690 except KeyError: 691 # Could be a float, or a None in the list, or a high value. 692 pass 693 if None in qualities: 694 raise TypeError("A quality value of None was found") 695 if max(qualities) >= 62.5: 696 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 697 BiopythonWarning) 698 # This will apply the truncation at 62, giving max ASCII 126 699 return "".join(chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 700 for qp in qualities) 701 # Fall back on the Solexa scores... 702 try: 703 qualities = record.letter_annotations["solexa_quality"] 704 except KeyError: 705 raise ValueError("No suitable quality scores found in " 706 "letter_annotations of SeqRecord (id=%s)." 707 % record.id) 708 # Try and use the precomputed mapping: 709 try: 710 return "".join(_solexa_to_illumina_quality_str[qs] 711 for qs in qualities) 712 except KeyError: 713 # Either no PHRED scores, or something odd like a float or None 714 pass 715 if None in qualities: 716 raise TypeError("A quality value of None was found") 717 # Must do this the slow way, first converting the PHRED scores into 718 # Solexa scores: 719 if max(qualities) >= 62.5: 720 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 721 BiopythonWarning) 722 # This will apply the truncation at 62, giving max ASCII 126 723 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 724 for qs in qualities)
725 726 # Only map 0 to 62, we need to give a warning on truncating at 62 727 assert 62 + SOLEXA_SCORE_OFFSET == 126 728 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 729 for qs in range(-5, 62 + 1)) 730 # Only map -5 to 62, we need to give a warning on truncating at 62 731 _phred_to_solexa_quality_str = dict( 732 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 733 SOLEXA_SCORE_OFFSET))) 734 for qp in range(0, 62 + 1)) 735 736
737 -def _get_solexa_quality_str(record):
738 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 739 740 Notice that due to the limited range of printable ASCII characters, a 741 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 742 file (using ASCII 126, the tilde). This function will issue a warning 743 in this situation. 744 """ 745 # TODO - This functions works and is fast, but it is also ugly 746 # and there is considerable repetition of code for the other 747 # two FASTQ variants. 748 try: 749 # These take priority (in case both Solexa and PHRED scores found) 750 qualities = record.letter_annotations["solexa_quality"] 751 except KeyError: 752 # Fall back on PHRED scores... 753 pass 754 else: 755 # Try and use the precomputed mapping: 756 try: 757 return "".join(_solexa_to_solexa_quality_str[qs] 758 for qs in qualities) 759 except KeyError: 760 # Could be a float, or a None in the list, or a high value. 761 pass 762 if None in qualities: 763 raise TypeError("A quality value of None was found") 764 if max(qualities) >= 62.5: 765 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 766 BiopythonWarning) 767 # This will apply the truncation at 62, giving max ASCII 126 768 return "".join(chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 769 for qs in qualities) 770 # Fall back on the PHRED scores... 771 try: 772 qualities = record.letter_annotations["phred_quality"] 773 except KeyError: 774 raise ValueError("No suitable quality scores found in " 775 "letter_annotations of SeqRecord (id=%s)." 776 % record.id) 777 # Try and use the precomputed mapping: 778 try: 779 return "".join(_phred_to_solexa_quality_str[qp] 780 for qp in qualities) 781 except KeyError: 782 # Either no PHRED scores, or something odd like a float or None 783 # or too big to be in the cache 784 pass 785 if None in qualities: 786 raise TypeError("A quality value of None was found") 787 # Must do this the slow way, first converting the PHRED scores into 788 # Solexa scores: 789 if max(qualities) >= 62.5: 790 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 791 BiopythonWarning) 792 return "".join(chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 793 for qp in qualities)
794 795 796 # TODO - Default to nucleotide or even DNA?
797 -def FastqGeneralIterator(handle):
798 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 799 800 This code does not try to interpret the quality string numerically. It 801 just returns tuples of the title, sequence and quality as strings. For 802 the sequence and quality, any whitespace (such as new lines) is removed. 803 804 Our SeqRecord based FASTQ iterators call this function internally, and then 805 turn the strings into a SeqRecord objects, mapping the quality string into 806 a list of numerical scores. If you want to do a custom quality mapping, 807 then you might consider calling this function directly. 808 809 For parsing FASTQ files, the title string from the "@" line at the start 810 of each record can optionally be omitted on the "+" lines. If it is 811 repeated, it must be identical. 812 813 The sequence string and the quality string can optionally be split over 814 multiple lines, although several sources discourage this. In comparison, 815 for the FASTA file format line breaks between 60 and 80 characters are 816 the norm. 817 818 **WARNING** - Because the "@" character can appear in the quality string, 819 this can cause problems as this is also the marker for the start of 820 a new sequence. In fact, the "+" sign can also appear as well. Some 821 sources recommended having no line breaks in the quality to avoid this, 822 but even that is not enough, consider this example:: 823 824 @071113_EAS56_0053:1:1:998:236 825 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 826 +071113_EAS56_0053:1:1:998:236 827 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 828 @071113_EAS56_0053:1:1:182:712 829 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 830 + 831 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 832 @071113_EAS56_0053:1:1:153:10 833 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 834 + 835 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 836 @071113_EAS56_0053:1:3:990:501 837 TGGGAGGTTTTATGTGGA 838 AAGCAGCAATGTACAAGA 839 + 840 IIIIIII.IIIIII1@44 841 @-7.%<&+/$/%4(++(% 842 843 This is four PHRED encoded FASTQ entries originally from an NCBI source 844 (given the read length of 36, these are probably Solexa Illumina reads where 845 the quality has been mapped onto the PHRED values). 846 847 This example has been edited to illustrate some of the nasty things allowed 848 in the FASTQ format. Firstly, on the "+" lines most but not all of the 849 (redundant) identifiers are omitted. In real files it is likely that all or 850 none of these extra identifiers will be present. 851 852 Secondly, while the first three sequences have been shown without line 853 breaks, the last has been split over multiple lines. In real files any line 854 breaks are likely to be consistent. 855 856 Thirdly, some of the quality string lines start with an "@" character. For 857 the second record this is unavoidable. However for the fourth sequence this 858 only happens because its quality string is split over two lines. A naive 859 parser could wrongly treat any line starting with an "@" as the beginning of 860 a new sequence! This code copes with this possible ambiguity by keeping 861 track of the length of the sequence which gives the expected length of the 862 quality string. 863 864 Using this tricky example file as input, this short bit of code demonstrates 865 what this parsing function would return: 866 867 >>> with open("Quality/tricky.fastq", "rU") as handle: 868 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 869 ... print(title) 870 ... print("%s %s" % (sequence, quality)) 871 ... 872 071113_EAS56_0053:1:1:998:236 873 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 874 071113_EAS56_0053:1:1:182:712 875 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 876 071113_EAS56_0053:1:1:153:10 877 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 878 071113_EAS56_0053:1:3:990:501 879 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 880 881 Finally we note that some sources state that the quality string should 882 start with "!" (which using the PHRED mapping means the first letter always 883 has a quality score of zero). This rather restrictive rule is not widely 884 observed, so is therefore ignored here. One plus point about this "!" rule 885 is that (provided there are no line breaks in the quality sequence) it 886 would prevent the above problem with the "@" character. 887 """ 888 # We need to call handle.readline() at least four times per record, 889 # so we'll save a property look up each time: 890 handle_readline = handle.readline 891 892 # Skip any text before the first record (e.g. blank lines, comments?) 893 while True: 894 line = handle_readline() 895 if not line: 896 return # Premature end of file, or just empty? 897 if line[0] == "@": 898 break 899 if isinstance(line[0], int): 900 raise ValueError("Is this handle in binary mode not text mode?") 901 902 while line: 903 if line[0] != "@": 904 raise ValueError( 905 "Records in Fastq files should start with '@' character") 906 title_line = line[1:].rstrip() 907 # Will now be at least one line of quality data - in most FASTQ files 908 # just one line! We therefore use string concatenation (if needed) 909 # rather using than the "".join(...) trick just in case it is multiline: 910 seq_string = handle_readline().rstrip() 911 # There may now be more sequence lines, or the "+" quality marker line: 912 while True: 913 line = handle_readline() 914 if not line: 915 raise ValueError("End of file without quality information.") 916 if line[0] == "+": 917 # The title here is optional, but if present must match! 918 second_title = line[1:].rstrip() 919 if second_title and second_title != title_line: 920 raise ValueError("Sequence and quality captions differ.") 921 break 922 seq_string += line.rstrip() # removes trailing newlines 923 # This is going to slow things down a little, but assuming 924 # this isn't allowed we should try and catch it here: 925 if " " in seq_string or "\t" in seq_string: 926 raise ValueError("Whitespace is not allowed in the sequence.") 927 seq_len = len(seq_string) 928 929 # Will now be at least one line of quality data... 930 quality_string = handle_readline().rstrip() 931 # There may now be more quality data, or another sequence, or EOF 932 while True: 933 line = handle_readline() 934 if not line: 935 break # end of file 936 if line[0] == "@": 937 # This COULD be the start of a new sequence. However, it MAY just 938 # be a line of quality data which starts with a "@" character. We 939 # should be able to check this by looking at the sequence length 940 # and the amount of quality data found so far. 941 if len(quality_string) >= seq_len: 942 # We expect it to be equal if this is the start of a new record. 943 # If the quality data is longer, we'll raise an error below. 944 break 945 # Continue - its just some (more) quality data. 946 quality_string += line.rstrip() 947 948 if seq_len != len(quality_string): 949 raise ValueError("Lengths of sequence and quality values differs " 950 " for %s (%i and %i)." 951 % (title_line, seq_len, len(quality_string))) 952 953 # Return the record and then continue... 954 yield (title_line, seq_string, quality_string)
955 956
957 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
958 """Generator function to iterate over FASTQ records (as SeqRecord objects). 959 960 - handle - input file 961 - alphabet - optional alphabet 962 - title2ids - A function that, when given the title line from the FASTQ 963 file (without the beginning >), will return the id, name and 964 description (in that order) for the record as a tuple of 965 strings. If this is not given, then the entire title line 966 will be used as the description, and the first word as the 967 id and name. 968 969 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 970 971 For each sequence in a (Sanger style) FASTQ file there is a matching string 972 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 973 values with an offset of 33. 974 975 For example, consider a file containing three short reads:: 976 977 @EAS54_6_R1_2_1_413_324 978 CCCTTCTTGTCTTCAGCGTTTCTCC 979 + 980 ;;3;;;;;;;;;;;;7;;;;;;;88 981 @EAS54_6_R1_2_1_540_792 982 TTGGCAGGCCAAGGCCGATGGATCA 983 + 984 ;;;;;;;;;;;7;;;;;-;;;3;83 985 @EAS54_6_R1_2_1_443_348 986 GTTGCTTCTGGCGTGGGTGGGGGGG 987 + 988 ;;;;;;;;;;;9;7;;.7;393333 989 990 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 991 string encoding the PHRED qualities using a ASCII values with an offset of 992 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 993 994 Using this module directly you might run: 995 996 >>> with open("Quality/example.fastq", "rU") as handle: 997 ... for record in FastqPhredIterator(handle): 998 ... print("%s %s" % (record.id, record.seq)) 999 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1000 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1001 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1002 1003 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1004 (or "fastq-sanger") as the format: 1005 1006 >>> from Bio import SeqIO 1007 >>> with open("Quality/example.fastq", "rU") as handle: 1008 ... for record in SeqIO.parse(handle, "fastq"): 1009 ... print("%s %s" % (record.id, record.seq)) 1010 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1011 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1012 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1013 1014 If you want to look at the qualities, they are record in each record's 1015 per-letter-annotation dictionary as a simple list of integers: 1016 1017 >>> print(record.letter_annotations["phred_quality"]) 1018 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1019 1020 """ 1021 assert SANGER_SCORE_OFFSET == ord("!") 1022 # Originally, I used a list expression for each record: 1023 # 1024 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1025 # 1026 # Precomputing is faster, perhaps partly by avoiding the subtractions. 1027 q_mapping = dict() 1028 for letter in range(0, 255): 1029 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1030 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1031 if title2ids: 1032 id, name, descr = title2ids(title_line) 1033 else: 1034 descr = title_line 1035 id = descr.split()[0] 1036 name = id 1037 record = SeqRecord(Seq(seq_string, alphabet), 1038 id=id, name=name, description=descr) 1039 qualities = [q_mapping[letter] for letter in quality_string] 1040 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1041 raise ValueError("Invalid character in quality string") 1042 # For speed, will now use a dirty trick to speed up assigning the 1043 # qualities. We do this to bypass the length check imposed by the 1044 # per-letter-annotations restricted dict (as this has already been 1045 # checked by FastqGeneralIterator). This is equivalent to: 1046 # record.letter_annotations["phred_quality"] = qualities 1047 dict.__setitem__(record._per_letter_annotations, 1048 "phred_quality", qualities) 1049 yield record
1050 1051
1052 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1053 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1054 1055 The optional arguments are the same as those for the FastqPhredIterator. 1056 1057 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1058 encoding the Solexa integer qualities using ASCII values with an offset 1059 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1060 will NOT perform any automatic conversion when loading. 1061 1062 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1063 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1064 1065 For example, consider a file containing these five records:: 1066 1067 @SLXA-B3_649_FC8437_R1_1_1_610_79 1068 GATGTGCAATACCTTTGTAGAGGAA 1069 +SLXA-B3_649_FC8437_R1_1_1_610_79 1070 YYYYYYYYYYYYYYYYYYWYWYYSU 1071 @SLXA-B3_649_FC8437_R1_1_1_397_389 1072 GGTTTGAGAAAGAGAAATGAGATAA 1073 +SLXA-B3_649_FC8437_R1_1_1_397_389 1074 YYYYYYYYYWYYYYWWYYYWYWYWW 1075 @SLXA-B3_649_FC8437_R1_1_1_850_123 1076 GAGGGTGTTGATCATGATGATGGCG 1077 +SLXA-B3_649_FC8437_R1_1_1_850_123 1078 YYYYYYYYYYYYYWYYWYYSYYYSY 1079 @SLXA-B3_649_FC8437_R1_1_1_362_549 1080 GGAAACAAAGTTTTTCTCAACATAG 1081 +SLXA-B3_649_FC8437_R1_1_1_362_549 1082 YYYYYYYYYYYYYYYYYYWWWWYWY 1083 @SLXA-B3_649_FC8437_R1_1_1_183_714 1084 GTATTATTTAATGGCATACACTCAA 1085 +SLXA-B3_649_FC8437_R1_1_1_183_714 1086 YYYYYYYYYYWYYYYWYWWUWWWQQ 1087 1088 Using this module directly you might run: 1089 1090 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1091 ... for record in FastqSolexaIterator(handle): 1092 ... print("%s %s" % (record.id, record.seq)) 1093 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1094 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1095 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1096 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1097 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1098 1099 Typically however, you would call this via Bio.SeqIO instead with 1100 "fastq-solexa" as the format: 1101 1102 >>> from Bio import SeqIO 1103 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1104 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1105 ... print("%s %s" % (record.id, record.seq)) 1106 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1107 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1108 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1109 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1110 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1111 1112 If you want to look at the qualities, they are recorded in each record's 1113 per-letter-annotation dictionary as a simple list of integers: 1114 1115 >>> print(record.letter_annotations["solexa_quality"]) 1116 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1117 1118 These scores aren't very good, but they are high enough that they map 1119 almost exactly onto PHRED scores: 1120 1121 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1122 25.01 1123 1124 Let's look at faked example read which is even worse, where there are 1125 more noticeable differences between the Solexa and PHRED scores:: 1126 1127 @slxa_0001_1_0001_01 1128 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1129 +slxa_0001_1_0001_01 1130 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1131 1132 Again, you would typically use Bio.SeqIO to read this file in (rather than 1133 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1134 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1135 as shown above. This example has only as one entry, so instead we can 1136 use the Bio.SeqIO.read() function: 1137 1138 >>> from Bio import SeqIO 1139 >>> with open("Quality/solexa_faked.fastq", "rU") as handle: 1140 ... record = SeqIO.read(handle, "fastq-solexa") 1141 >>> print("%s %s" % (record.id, record.seq)) 1142 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1143 >>> print(record.letter_annotations["solexa_quality"]) 1144 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1145 1146 These quality scores are so low that when converted from the Solexa scheme 1147 into PHRED scores they look quite different: 1148 1149 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1150 2.54 1151 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1152 1.19 1153 1154 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1155 method to output the record(s): 1156 1157 >>> print(record.format("fastq-solexa")) 1158 @slxa_0001_1_0001_01 1159 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1160 + 1161 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1162 <BLANKLINE> 1163 1164 Note this output is slightly different from the input file as Biopython 1165 has left out the optional repetition of the sequence identifier on the "+" 1166 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1167 output format instead, and Biopython will do the conversion for you: 1168 1169 >>> print(record.format("fastq")) 1170 @slxa_0001_1_0001_01 1171 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1172 + 1173 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1174 <BLANKLINE> 1175 1176 >>> print(record.format("qual")) 1177 >slxa_0001_1_0001_01 1178 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1179 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1180 1 1 1181 <BLANKLINE> 1182 1183 As shown above, the poor quality Solexa reads have been mapped to the 1184 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1185 """ 1186 q_mapping = dict() 1187 for letter in range(0, 255): 1188 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1189 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1190 if title2ids: 1191 id, name, descr = title_line 1192 else: 1193 descr = title_line 1194 id = descr.split()[0] 1195 name = id 1196 record = SeqRecord(Seq(seq_string, alphabet), 1197 id=id, name=name, description=descr) 1198 qualities = [q_mapping[letter] for letter in quality_string] 1199 # DO NOT convert these into PHRED qualities automatically! 1200 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1201 raise ValueError("Invalid character in quality string") 1202 # Dirty trick to speed up this line: 1203 # record.letter_annotations["solexa_quality"] = qualities 1204 dict.__setitem__(record._per_letter_annotations, 1205 "solexa_quality", qualities) 1206 yield record
1207 1208
1209 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1210 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1211 1212 The optional arguments are the same as those for the FastqPhredIterator. 1213 1214 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1215 encoding PHRED integer qualities using ASCII values with an offset of 64. 1216 1217 >>> from Bio import SeqIO 1218 >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") 1219 >>> print("%s %s" % (record.id, record.seq)) 1220 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1221 >>> max(record.letter_annotations["phred_quality"]) 1222 40 1223 >>> min(record.letter_annotations["phred_quality"]) 1224 0 1225 1226 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1227 with an ASCII offset of 64. They are approximately equal but only for high 1228 quality reads. If you have an old Solexa/Illumina file with negative 1229 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1230 1231 >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") 1232 Traceback (most recent call last): 1233 ... 1234 ValueError: Invalid character in quality string 1235 1236 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1237 """ 1238 q_mapping = dict() 1239 for letter in range(0, 255): 1240 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1241 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1242 if title2ids: 1243 id, name, descr = title2ids(title_line) 1244 else: 1245 descr = title_line 1246 id = descr.split()[0] 1247 name = id 1248 record = SeqRecord(Seq(seq_string, alphabet), 1249 id=id, name=name, description=descr) 1250 qualities = [q_mapping[letter] for letter in quality_string] 1251 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1252 raise ValueError("Invalid character in quality string") 1253 # Dirty trick to speed up this line: 1254 # record.letter_annotations["phred_quality"] = qualities 1255 dict.__setitem__(record._per_letter_annotations, 1256 "phred_quality", qualities) 1257 yield record
1258 1259
1260 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1261 """For QUAL files which include PHRED quality scores, but no sequence. 1262 1263 For example, consider this short QUAL file:: 1264 1265 >EAS54_6_R1_2_1_413_324 1266 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1267 26 26 26 23 23 1268 >EAS54_6_R1_2_1_540_792 1269 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1270 26 18 26 23 18 1271 >EAS54_6_R1_2_1_443_348 1272 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1273 24 18 18 18 18 1274 1275 Using this module directly you might run: 1276 1277 >>> with open("Quality/example.qual", "rU") as handle: 1278 ... for record in QualPhredIterator(handle): 1279 ... print("%s %s" % (record.id, record.seq)) 1280 EAS54_6_R1_2_1_413_324 ????????????????????????? 1281 EAS54_6_R1_2_1_540_792 ????????????????????????? 1282 EAS54_6_R1_2_1_443_348 ????????????????????????? 1283 1284 Typically however, you would call this via Bio.SeqIO instead with "qual" 1285 as the format: 1286 1287 >>> from Bio import SeqIO 1288 >>> with open("Quality/example.qual", "rU") as handle: 1289 ... for record in SeqIO.parse(handle, "qual"): 1290 ... print("%s %s" % (record.id, record.seq)) 1291 EAS54_6_R1_2_1_413_324 ????????????????????????? 1292 EAS54_6_R1_2_1_540_792 ????????????????????????? 1293 EAS54_6_R1_2_1_443_348 ????????????????????????? 1294 1295 Becase QUAL files don't contain the sequence string itself, the seq 1296 property is set to an UnknownSeq object. As no alphabet was given, this 1297 has defaulted to a generic single letter alphabet and the character "?" 1298 used. 1299 1300 By specifying a nucleotide alphabet, "N" is used instead: 1301 1302 >>> from Bio import SeqIO 1303 >>> from Bio.Alphabet import generic_dna 1304 >>> with open("Quality/example.qual", "rU") as handle: 1305 ... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1306 ... print("%s %s" % (record.id, record.seq)) 1307 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1308 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1309 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1310 1311 However, the quality scores themselves are available as a list of integers 1312 in each record's per-letter-annotation: 1313 1314 >>> print(record.letter_annotations["phred_quality"]) 1315 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1316 1317 You can still slice one of these SeqRecord objects with an UnknownSeq: 1318 1319 >>> sub_record = record[5:10] 1320 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1321 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1322 1323 As of Biopython 1.59, this parser will accept files with negatives quality 1324 scores but will replace them with the lowest possible PHRED score of zero. 1325 This will trigger a warning, previously it raised a ValueError exception. 1326 """ 1327 # Skip any text before the first record (e.g. blank lines, comments) 1328 while True: 1329 line = handle.readline() 1330 if line == "": 1331 return # Premature end of file, or just empty? 1332 if line[0] == ">": 1333 break 1334 1335 while True: 1336 if line[0] != ">": 1337 raise ValueError( 1338 "Records in Fasta files should start with '>' character") 1339 if title2ids: 1340 id, name, descr = title2ids(line[1:].rstrip()) 1341 else: 1342 descr = line[1:].rstrip() 1343 id = descr.split()[0] 1344 name = id 1345 1346 qualities = [] 1347 line = handle.readline() 1348 while True: 1349 if not line: 1350 break 1351 if line[0] == ">": 1352 break 1353 qualities.extend(int(word) for word in line.split()) 1354 line = handle.readline() 1355 1356 if qualities and min(qualities) < 0: 1357 warnings.warn(("Negative quality score %i found, " + 1358 "substituting PHRED zero instead.") 1359 % min(qualities), BiopythonParserWarning) 1360 qualities = [max(0, q) for q in qualities] 1361 1362 # Return the record and then continue... 1363 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1364 id=id, name=name, description=descr) 1365 # Dirty trick to speed up this line: 1366 # record.letter_annotations["phred_quality"] = qualities 1367 dict.__setitem__(record._per_letter_annotations, 1368 "phred_quality", qualities) 1369 yield record 1370 1371 if not line: 1372 return # StopIteration 1373 assert False, "Should not reach this line"
1374 1375
1376 -class FastqPhredWriter(SequentialSequenceWriter):
1377 """Class to write standard FASTQ format files (using PHRED quality scores). 1378 1379 Although you can use this class directly, you are strongly encouraged 1380 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1381 or the alias "fastq-sanger". For example, this code reads in a standard 1382 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1383 Sanger style FASTQ file: 1384 1385 >>> from Bio import SeqIO 1386 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1387 >>> with open("Quality/temp.fastq", "w") as out_handle: 1388 ... SeqIO.write(record_iterator, out_handle, "fastq") 1389 3 1390 1391 You might want to do this if the original file included extra line breaks, 1392 which while valid may not be supported by all tools. The output file from 1393 Biopython will have each sequence on a single line, and each quality 1394 string on a single line (which is considered desirable for maximum 1395 compatibility). 1396 1397 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1398 quality scores) is converted into a standard Sanger style FASTQ file using 1399 PHRED qualities: 1400 1401 >>> from Bio import SeqIO 1402 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1403 >>> with open("Quality/temp.fastq", "w") as out_handle: 1404 ... SeqIO.write(record_iterator, out_handle, "fastq") 1405 5 1406 1407 This code is also called if you use the .format("fastq") method of a 1408 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1409 1410 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1411 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1412 warning is issued. 1413 1414 P.S. To avoid cluttering up your working directory, you can delete this 1415 temporary file now: 1416 1417 >>> import os 1418 >>> os.remove("Quality/temp.fastq") 1419 """ 1420 assert SANGER_SCORE_OFFSET == ord("!") 1421
1422 - def write_record(self, record):
1423 """Write a single FASTQ record to the file.""" 1424 assert self._header_written 1425 assert not self._footer_written 1426 self._record_written = True 1427 # TODO - Is an empty sequence allowed in FASTQ format? 1428 if record.seq is None: 1429 raise ValueError("No sequence for record %s" % record.id) 1430 seq_str = str(record.seq) 1431 qualities_str = _get_sanger_quality_str(record) 1432 if len(qualities_str) != len(seq_str): 1433 raise ValueError("Record %s has sequence length %i but %i quality scores" 1434 % (record.id, len(seq_str), len(qualities_str))) 1435 1436 # FASTQ files can include a description, just like FASTA files 1437 # (at least, this is what the NCBI Short Read Archive does) 1438 id = self.clean(record.id) 1439 description = self.clean(record.description) 1440 if description and description.split(None, 1)[0] == id: 1441 # The description includes the id at the start 1442 title = description 1443 elif description: 1444 title = "%s %s" % (id, description) 1445 else: 1446 title = id 1447 1448 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1449 1450
1451 -class QualPhredWriter(SequentialSequenceWriter):
1452 """Class to write QUAL format files (using PHRED quality scores). 1453 1454 Although you can use this class directly, you are strongly encouraged 1455 to use the Bio.SeqIO.write() function instead. For example, this code 1456 reads in a FASTQ file and saves the quality scores into a QUAL file: 1457 1458 >>> from Bio import SeqIO 1459 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1460 >>> with open("Quality/temp.qual", "w") as out_handle: 1461 ... SeqIO.write(record_iterator, out_handle, "qual") 1462 3 1463 1464 This code is also called if you use the .format("qual") method of a 1465 SeqRecord. 1466 1467 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1468 1469 >>> import os 1470 >>> os.remove("Quality/temp.qual") 1471 """
1472 - def __init__(self, handle, wrap=60, record2title=None):
1473 """Create a QUAL writer. 1474 1475 Arguments: 1476 - handle - Handle to an output file, e.g. as returned 1477 by open(filename, "w") 1478 - wrap - Optional line length used to wrap sequence lines. 1479 Defaults to wrapping the sequence at 60 characters 1480 Use zero (or None) for no wrapping, giving a single 1481 long line for the sequence. 1482 - record2title - Optional function to return the text to be 1483 used for the title line of each record. By default 1484 a combination of the record.id and record.description 1485 is used. If the record.description starts with the 1486 record.id, then just the record.description is used. 1487 1488 The record2title argument is present for consistency with the 1489 Bio.SeqIO.FastaIO writer class. 1490 """ 1491 SequentialSequenceWriter.__init__(self, handle) 1492 # self.handle = handle 1493 self.wrap = None 1494 if wrap: 1495 if wrap < 1: 1496 raise ValueError 1497 self.wrap = wrap 1498 self.record2title = record2title
1499
1500 - def write_record(self, record):
1501 """Write a single QUAL record to the file.""" 1502 assert self._header_written 1503 assert not self._footer_written 1504 self._record_written = True 1505 1506 handle = self.handle 1507 wrap = self.wrap 1508 1509 if self.record2title: 1510 title = self.clean(self.record2title(record)) 1511 else: 1512 id = self.clean(record.id) 1513 description = self.clean(record.description) 1514 if description and description.split(None, 1)[0] == id: 1515 # The description includes the id at the start 1516 title = description 1517 elif description: 1518 title = "%s %s" % (id, description) 1519 else: 1520 title = id 1521 handle.write(">%s\n" % title) 1522 1523 qualities = _get_phred_quality(record) 1524 try: 1525 # This rounds to the nearest integer. 1526 # TODO - can we record a float in a qual file? 1527 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1528 except TypeError as e: 1529 if None in qualities: 1530 raise TypeError("A quality value of None was found") 1531 else: 1532 raise e 1533 1534 if wrap > 5: 1535 # Fast wrapping 1536 data = " ".join(qualities_strs) 1537 while True: 1538 if len(data) <= wrap: 1539 self.handle.write(data + "\n") 1540 break 1541 else: 1542 # By construction there must be spaces in the first X chars 1543 # (unless we have X digit or higher quality scores!) 1544 i = data.rfind(" ", 0, wrap) 1545 handle.write(data[:i] + "\n") 1546 data = data[i + 1:] 1547 elif wrap: 1548 # Safe wrapping 1549 while qualities_strs: 1550 line = qualities_strs.pop(0) 1551 while qualities_strs \ 1552 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1553 line += " " + qualities_strs.pop(0) 1554 handle.write(line + "\n") 1555 else: 1556 # No wrapping 1557 data = " ".join(qualities_strs) 1558 handle.write(data + "\n")
1559 1560
1561 -class FastqSolexaWriter(SequentialSequenceWriter):
1562 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1563 1564 This outputs FASTQ files like those from the early Solexa/Illumina 1565 pipeline, using Solexa scores and an ASCII offset of 64. These are 1566 NOT compatible with the standard Sanger style PHRED FASTQ files. 1567 1568 If your records contain a "solexa_quality" entry under letter_annotations, 1569 this is used, otherwise any "phred_quality" entry will be used after 1570 conversion using the solexa_quality_from_phred function. If neither style 1571 of quality scores are present, an exception is raised. 1572 1573 Although you can use this class directly, you are strongly encouraged 1574 to use the Bio.SeqIO.write() function instead. For example, this code 1575 reads in a FASTQ file and re-saves it as another FASTQ file: 1576 1577 >>> from Bio import SeqIO 1578 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1579 >>> with open("Quality/temp.fastq", "w") as out_handle: 1580 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1581 5 1582 1583 You might want to do this if the original file included extra line breaks, 1584 which (while valid) may not be supported by all tools. The output file 1585 from Biopython will have each sequence on a single line, and each quality 1586 string on a single line (which is considered desirable for maximum 1587 compatibility). 1588 1589 This code is also called if you use the .format("fastq-solexa") method of 1590 a SeqRecord. For example, 1591 1592 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1593 >>> print(record.format("fastq-solexa")) 1594 @Test PHRED qualities from 40 to 0 inclusive 1595 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1596 + 1597 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1598 <BLANKLINE> 1599 1600 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1601 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1602 a warning is issued. 1603 1604 P.S. Don't forget to delete the temp file if you don't need it anymore: 1605 1606 >>> import os 1607 >>> os.remove("Quality/temp.fastq") 1608 """
1609 - def write_record(self, record):
1610 """Write a single FASTQ record to the file.""" 1611 assert self._header_written 1612 assert not self._footer_written 1613 self._record_written = True 1614 1615 # TODO - Is an empty sequence allowed in FASTQ format? 1616 if record.seq is None: 1617 raise ValueError("No sequence for record %s" % record.id) 1618 seq_str = str(record.seq) 1619 qualities_str = _get_solexa_quality_str(record) 1620 if len(qualities_str) != len(seq_str): 1621 raise ValueError("Record %s has sequence length %i but %i quality scores" 1622 % (record.id, len(seq_str), len(qualities_str))) 1623 1624 # FASTQ files can include a description, just like FASTA files 1625 # (at least, this is what the NCBI Short Read Archive does) 1626 id = self.clean(record.id) 1627 description = self.clean(record.description) 1628 if description and description.split(None, 1)[0] == id: 1629 # The description includes the id at the start 1630 title = description 1631 elif description: 1632 title = "%s %s" % (id, description) 1633 else: 1634 title = id 1635 1636 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1637 1638
1639 -class FastqIlluminaWriter(SequentialSequenceWriter):
1640 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1641 1642 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1643 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1644 compatible with the standard Sanger style PHRED FASTQ files which use an 1645 ASCII offset of 32. 1646 1647 Although you can use this class directly, you are strongly encouraged to 1648 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1649 instead. This code is also called if you use the .format("fastq-illumina") 1650 method of a SeqRecord. For example, 1651 1652 >>> from Bio import SeqIO 1653 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1654 >>> print(record.format("fastq-illumina")) 1655 @Test PHRED qualities from 40 to 0 inclusive 1656 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1657 + 1658 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1659 <BLANKLINE> 1660 1661 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1662 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1663 warning is issued. 1664 """
1665 - def write_record(self, record):
1666 """Write a single FASTQ record to the file.""" 1667 assert self._header_written 1668 assert not self._footer_written 1669 self._record_written = True 1670 1671 # TODO - Is an empty sequence allowed in FASTQ format? 1672 if record.seq is None: 1673 raise ValueError("No sequence for record %s" % record.id) 1674 seq_str = str(record.seq) 1675 qualities_str = _get_illumina_quality_str(record) 1676 if len(qualities_str) != len(seq_str): 1677 raise ValueError("Record %s has sequence length %i but %i quality scores" 1678 % (record.id, len(seq_str), len(qualities_str))) 1679 1680 # FASTQ files can include a description, just like FASTA files 1681 # (at least, this is what the NCBI Short Read Archive does) 1682 id = self.clean(record.id) 1683 description = self.clean(record.description) 1684 if description and description.split(None, 1)[0] == id: 1685 # The description includes the id at the start 1686 title = description 1687 elif description: 1688 title = "%s %s" % (id, description) 1689 else: 1690 title = id 1691 1692 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1693 1694
1695 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1696 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1697 1698 For example, consider this short QUAL file with PHRED quality scores:: 1699 1700 >EAS54_6_R1_2_1_413_324 1701 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1702 26 26 26 23 23 1703 >EAS54_6_R1_2_1_540_792 1704 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1705 26 18 26 23 18 1706 >EAS54_6_R1_2_1_443_348 1707 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1708 24 18 18 18 18 1709 1710 And a matching FASTA file:: 1711 1712 >EAS54_6_R1_2_1_413_324 1713 CCCTTCTTGTCTTCAGCGTTTCTCC 1714 >EAS54_6_R1_2_1_540_792 1715 TTGGCAGGCCAAGGCCGATGGATCA 1716 >EAS54_6_R1_2_1_443_348 1717 GTTGCTTCTGGCGTGGGTGGGGGGG 1718 1719 You can parse these separately using Bio.SeqIO with the "qual" and 1720 "fasta" formats, but then you'll get a group of SeqRecord objects with 1721 no sequence, and a matching group with the sequence but not the 1722 qualities. Because it only deals with one input file handle, Bio.SeqIO 1723 can't be used to read the two files together - but this function can! 1724 For example, 1725 1726 >>> with open("Quality/example.fasta", "rU") as f: 1727 ... with open("Quality/example.qual", "rU") as q: 1728 ... for record in PairedFastaQualIterator(f, q): 1729 ... print("%s %s" % (record.id, record.seq)) 1730 ... 1731 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1732 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1733 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1734 1735 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1736 they are in each record's per-letter-annotation dictionary as a simple 1737 list of integers: 1738 1739 >>> print(record.letter_annotations["phred_quality"]) 1740 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1741 1742 If you have access to data as a FASTQ format file, using that directly 1743 would be simpler and more straight forward. Note that you can easily use 1744 this function to convert paired FASTA and QUAL files into FASTQ files: 1745 1746 >>> from Bio import SeqIO 1747 >>> with open("Quality/example.fasta", "rU") as f: 1748 ... with open("Quality/example.qual", "rU") as q: 1749 ... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") 1750 ... 1751 3 1752 1753 And don't forget to clean up the temp file if you don't need it anymore: 1754 1755 >>> import os 1756 >>> os.remove("Quality/temp.fastq") 1757 """ 1758 from Bio.SeqIO.FastaIO import FastaIterator 1759 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1760 title2ids=title2ids) 1761 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1762 title2ids=title2ids) 1763 1764 # Using (Python 3 style) zip wouldn't load everything into memory, 1765 # but also would not catch any extra records found in only one file. 1766 while True: 1767 try: 1768 f_rec = next(fasta_iter) 1769 except StopIteration: 1770 f_rec = None 1771 try: 1772 q_rec = next(qual_iter) 1773 except StopIteration: 1774 q_rec = None 1775 if f_rec is None and q_rec is None: 1776 # End of both files 1777 break 1778 if f_rec is None: 1779 raise ValueError("FASTA file has more entries than the QUAL file.") 1780 if q_rec is None: 1781 raise ValueError("QUAL file has more entries than the FASTA file.") 1782 if f_rec.id != q_rec.id: 1783 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1784 % (f_rec.id, q_rec.id)) 1785 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1786 raise ValueError("Sequence length and number of quality scores disagree for %s" 1787 % f_rec.id) 1788 # Merge the data.... 1789 f_rec.letter_annotations[ 1790 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1791 yield f_rec
1792 # Done 1793 1794 1795 if __name__ == "__main__": 1796 from Bio._utils import run_doctest 1797 run_doctest(verbose=0) 1798