Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2017 by Peter Cock.  All rights reserved. 
   2  # 
   3  # This file is part of the Biopython distribution and governed by your 
   4  # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". 
   5  # Please see the LICENSE file that should have been included as part of this 
   6  # package. 
   7  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
   8   
   9  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  10  shown below. 
  11   
  12  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  13  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  14  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  15  are used containing the sequence and the quality information separately. 
  16   
  17  The PHRED software reads DNA sequencing trace files, calls bases, and 
  18  assigns a non-negative quality value to each called base using a logged 
  19  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  20   
  21      Pe = 1.0,         Q =  0 
  22      Pe = 0.1,         Q = 10 
  23      Pe = 0.01,        Q = 20 
  24      ... 
  25      Pe = 0.00000001,  Q = 80 
  26      Pe = 0.000000001, Q = 90 
  27   
  28  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  29  In the QUAL format these quality values are held as space separated text in 
  30  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  31  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  32  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  33  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  34  quality are then stored in pairs in a FASTA like format. 
  35   
  36  Unfortunately there is no official document describing the FASTQ file format, 
  37  and worse, several related but different variants exist. For more details, 
  38  please read this open access publication:: 
  39   
  40      The Sanger FASTQ file format for sequences with quality scores, and the 
  41      Solexa/Illumina FASTQ variants. 
  42      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  43      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  44      Nucleic Acids Research 2010 38(6):1767-1771 
  45      https://doi.org/10.1093/nar/gkp1137 
  46   
  47  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  48  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  49  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  50  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  51  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  52  FASTA file using the -s or -seq argument instead. 
  53   
  54  The bad news is that Solexa/Illumina did things differently - they have their 
  55  own scoring system AND their own incompatible versions of the FASTQ format. 
  56  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  57  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  58  reasonable mapping can be achieved between them, and they are approximately 
  59  equal for higher quality reads). 
  60   
  61  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  62  own score mapping and an ASCII offset of 64. To make things worse, for the 
  63  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  64  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  65  with an ASCII offset of 64. 
  66   
  67  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  68  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  69   
  70  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  71  produce FASTQ files using the standard Sanger encoding. 
  72   
  73  You are expected to use this module via the Bio.SeqIO functions, with the 
  74  following format names: 
  75   
  76      - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  77      - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  78        offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  79        These can potentially hold PHRED scores from 0 to 93. 
  80      - "fastq-sanger" is an alias for "fastq". 
  81      - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  82        files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  83        scores from -5 to 62. 
  84      - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  85        PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  86        to 62. 
  87   
  88  We could potentially add support for "qual-solexa" meaning QUAL files which 
  89  contain Solexa scores, but thus far there isn't any reason to use such files. 
  90   
  91  For example, consider the following short FASTQ file:: 
  92   
  93      @EAS54_6_R1_2_1_413_324 
  94      CCCTTCTTGTCTTCAGCGTTTCTCC 
  95      + 
  96      ;;3;;;;;;;;;;;;7;;;;;;;88 
  97      @EAS54_6_R1_2_1_540_792 
  98      TTGGCAGGCCAAGGCCGATGGATCA 
  99      + 
 100      ;;;;;;;;;;;7;;;;;-;;;3;83 
 101      @EAS54_6_R1_2_1_443_348 
 102      GTTGCTTCTGGCGTGGGTGGGGGGG 
 103      + 
 104      ;;;;;;;;;;;9;7;;.7;393333 
 105   
 106  This contains three reads of length 25.  From the read length these were 
 107  probably originally from an early Solexa/Illumina sequencer but this file 
 108  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 109  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 110  "fastq" as the format name: 
 111   
 112  >>> from Bio import SeqIO 
 113  >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 114  ...     print("%s %s" % (record.id, record.seq)) 
 115  EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 116  EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 117  EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 118   
 119  The qualities are held as a list of integers in each record's annotation: 
 120   
 121  >>> print(record) 
 122  ID: EAS54_6_R1_2_1_443_348 
 123  Name: EAS54_6_R1_2_1_443_348 
 124  Description: EAS54_6_R1_2_1_443_348 
 125  Number of features: 0 
 126  Per letter annotation for: phred_quality 
 127  Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 128  >>> print(record.letter_annotations["phred_quality"]) 
 129  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 130   
 131  You can use the SeqRecord format method to show this in the QUAL format: 
 132   
 133  >>> print(record.format("qual")) 
 134  >EAS54_6_R1_2_1_443_348 
 135  26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 136  24 18 18 18 18 
 137  <BLANKLINE> 
 138   
 139  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 140   
 141  >>> print(record.format("fastq")) 
 142  @EAS54_6_R1_2_1_443_348 
 143  GTTGCTTCTGGCGTGGGTGGGGGGG 
 144  + 
 145  ;;;;;;;;;;;9;7;;.7;393333 
 146  <BLANKLINE> 
 147   
 148  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 149  of 64): 
 150   
 151  >>> print(record.format("fastq-illumina")) 
 152  @EAS54_6_R1_2_1_443_348 
 153  GTTGCTTCTGGCGTGGGTGGGGGGG 
 154  + 
 155  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 156  <BLANKLINE> 
 157   
 158  You can also get Biopython to convert the scores and show a Solexa style 
 159  FASTQ file: 
 160   
 161  >>> print(record.format("fastq-solexa")) 
 162  @EAS54_6_R1_2_1_443_348 
 163  GTTGCTTCTGGCGTGGGTGGGGGGG 
 164  + 
 165  ZZZZZZZZZZZXZVZZMVZRXRRRR 
 166  <BLANKLINE> 
 167   
 168  Notice that this is actually the same output as above using "fastq-illumina" 
 169  as the format! The reason for this is all these scores are high enough that 
 170  the PHRED and Solexa scores are almost equal. The differences become apparent 
 171  for poor quality reads. See the functions solexa_quality_from_phred and 
 172  phred_quality_from_solexa for more details. 
 173   
 174  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 175  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 176   
 177  >>> sub_rec = record[5:15] 
 178  >>> print(sub_rec) 
 179  ID: EAS54_6_R1_2_1_443_348 
 180  Name: EAS54_6_R1_2_1_443_348 
 181  Description: EAS54_6_R1_2_1_443_348 
 182  Number of features: 0 
 183  Per letter annotation for: phred_quality 
 184  Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 185  >>> print(sub_rec.letter_annotations["phred_quality"]) 
 186  [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 187  >>> print(sub_rec.format("fastq")) 
 188  @EAS54_6_R1_2_1_443_348 
 189  TTCTGGCGTG 
 190  + 
 191  ;;;;;;9;7; 
 192  <BLANKLINE> 
 193   
 194  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 195   
 196  >>> from Bio import SeqIO 
 197  >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 198  >>> with open("Quality/temp.qual", "w") as out_handle: 
 199  ...     SeqIO.write(record_iterator, out_handle, "qual") 
 200  3 
 201   
 202  You can of course read in a QUAL file, such as the one we just created: 
 203   
 204  >>> from Bio import SeqIO 
 205  >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 206  ...     print("%s %s" % (record.id, record.seq)) 
 207  EAS54_6_R1_2_1_413_324 ????????????????????????? 
 208  EAS54_6_R1_2_1_540_792 ????????????????????????? 
 209  EAS54_6_R1_2_1_443_348 ????????????????????????? 
 210   
 211  Notice that QUAL files don't have a proper sequence present!  But the quality 
 212  information is there: 
 213   
 214  >>> print(record) 
 215  ID: EAS54_6_R1_2_1_443_348 
 216  Name: EAS54_6_R1_2_1_443_348 
 217  Description: EAS54_6_R1_2_1_443_348 
 218  Number of features: 0 
 219  Per letter annotation for: phred_quality 
 220  UnknownSeq(25, alphabet=SingleLetterAlphabet(), character='?') 
 221  >>> print(record.letter_annotations["phred_quality"]) 
 222  [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 223   
 224  Just to keep things tidy, if you are following this example yourself, you can 
 225  delete this temporary file now: 
 226   
 227  >>> import os 
 228  >>> os.remove("Quality/temp.qual") 
 229   
 230  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 231  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 232  would have to read the two in separately and then combine the data.  However, 
 233  since this is such a common thing to want to do, there is a helper iterator 
 234  defined in this module that does this for you - PairedFastaQualIterator. 
 235   
 236  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 237  then a simple dictionary approach would work: 
 238   
 239  >>> from Bio import SeqIO 
 240  >>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) 
 241  >>> for rec in SeqIO.parse("Quality/example.qual", "qual"): 
 242  ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 243   
 244  You can then access any record by its key, and get both the sequence and the 
 245  quality scores. 
 246   
 247  >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 
 248  @EAS54_6_R1_2_1_540_792 
 249  TTGGCAGGCCAAGGCCGATGGATCA 
 250  + 
 251  ;;;;;;;;;;;7;;;;;-;;;3;83 
 252  <BLANKLINE> 
 253   
 254  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 255  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 256  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 257  for the more recent variant), as this cannot be detected reliably 
 258  automatically. 
 259   
 260  To illustrate this problem, let's consider an artificial example: 
 261   
 262  >>> from Bio.Seq import Seq 
 263  >>> from Bio.Alphabet import generic_dna 
 264  >>> from Bio.SeqRecord import SeqRecord 
 265  >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 266  ... description="Made up!") 
 267  >>> print(test.format("fasta")) 
 268  >Test Made up! 
 269  NACGTACGTA 
 270  <BLANKLINE> 
 271  >>> print(test.format("fastq")) 
 272  Traceback (most recent call last): 
 273   ... 
 274  ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 275   
 276  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 277  or FASTQ format we need to provide some quality scores. These are held as a 
 278  list of integers (one for each base) in the letter_annotations dictionary: 
 279   
 280  >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 281  >>> print(test.format("qual")) 
 282  >Test Made up! 
 283  0 1 2 3 4 5 10 20 30 40 
 284  <BLANKLINE> 
 285  >>> print(test.format("fastq")) 
 286  @Test Made up! 
 287  NACGTACGTA 
 288  + 
 289  !"#$%&+5?I 
 290  <BLANKLINE> 
 291   
 292  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 293  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 294  the letter "I": 
 295   
 296  >>> ord('!') - 33 
 297  0 
 298  >>> ord('I') - 33 
 299  40 
 300  >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 301  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 302   
 303  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 304  scores with an offset of 64: 
 305   
 306  >>> print(test.format("fastq-illumina")) 
 307  @Test Made up! 
 308  NACGTACGTA 
 309  + 
 310  @ABCDEJT^h 
 311  <BLANKLINE> 
 312   
 313  And we can check this too - the first PHRED score was zero, and this mapped to 
 314  "@", while the final score was 40 and this mapped to "h": 
 315   
 316  >>> ord("@") - 64 
 317  0 
 318  >>> ord("h") - 64 
 319  40 
 320  >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 321  [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 322   
 323  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 324  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 325  format to consider which encodes Solexa scores instead of PHRED scores. 
 326   
 327  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 328  scores (rounding to one decimal place): 
 329   
 330  >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 
 331  ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 
 332  PHRED 0 maps to Solexa -5.0 
 333  PHRED 1 maps to Solexa -5.0 
 334  PHRED 2 maps to Solexa -2.3 
 335  PHRED 3 maps to Solexa -0.0 
 336  PHRED 4 maps to Solexa 1.8 
 337  PHRED 5 maps to Solexa 3.3 
 338  PHRED 10 maps to Solexa 9.5 
 339  PHRED 20 maps to Solexa 20.0 
 340  PHRED 30 maps to Solexa 30.0 
 341  PHRED 40 maps to Solexa 40.0 
 342   
 343  Now here is the record using the old Solexa style FASTQ file: 
 344   
 345  >>> print(test.format("fastq-solexa")) 
 346  @Test Made up! 
 347  NACGTACGTA 
 348  + 
 349  ;;>@BCJT^h 
 350  <BLANKLINE> 
 351   
 352  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 353   
 354  >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 355  [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 356   
 357  This explains why the last few letters of this FASTQ output matched that using 
 358  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 359  are approximately equal. 
 360   
 361  """ 
 362  from __future__ import print_function 
 363   
 364  from Bio.Alphabet import single_letter_alphabet 
 365  from Bio.Seq import Seq, UnknownSeq 
 366  from Bio.SeqRecord import SeqRecord 
 367  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 368  from Bio.SeqIO.Interfaces import _clean, _get_seq_string 
 369   
 370  from math import log 
 371  import warnings 
 372  from Bio import BiopythonWarning, BiopythonParserWarning 
 373   
 374   
 375  # define score offsets. See discussion for differences between Sanger and 
 376  # Solexa offsets. 
 377  SANGER_SCORE_OFFSET = 33 
 378  SOLEXA_SCORE_OFFSET = 64 
 379   
 380   
381 -def solexa_quality_from_phred(phred_quality):
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 383 384 PHRED and Solexa quality scores are both log transformations of a 385 probality of error (high score = low probability of error). This function 386 takes a PHRED score, transforms it back to a probability of error, and 387 then re-expresses it as a Solexa score. This assumes the error estimates 388 are equivalent. 389 390 How does this work exactly? Well the PHRED quality is minus ten times the 391 base ten logarithm of the probability of error:: 392 393 phred_quality = -10*log(error,10) 394 395 Therefore, turning this round:: 396 397 error = 10 ** (- phred_quality / 10) 398 399 Now, Solexa qualities use a different log transformation:: 400 401 solexa_quality = -10*log(error/(1-error),10) 402 403 After substitution and a little manipulation we get:: 404 405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 406 407 However, real Solexa files use a minimum quality of -5. This does have a 408 good reason - a random base call would be correct 25% of the time, 409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 412 413 Taken literally, this logarithic formula would map a PHRED quality of zero 414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 415 score of zero means a probability of error of one (i.e. the base call is 416 definitely wrong), which is worse than random! In practice, a PHRED quality 417 of zero usually means a default value, or perhaps random - and therefore 418 mapping it to the minimum Solexa score of -5 is reasonable. 419 420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 422 quality of zero to -5.0 as well. 423 424 Note this function will return a floating point number, it is up to you to 425 round this to the nearest integer if appropriate. e.g. 426 427 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 428 80.00 429 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 430 50.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 432 19.96 433 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 434 9.54 435 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 436 3.35 437 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 438 1.80 439 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 440 -0.02 441 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 442 -2.33 443 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 444 -5.00 445 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 446 -5.00 447 448 Notice that for high quality reads PHRED and Solexa scores are numerically 449 equal. The differences are important for poor quality reads, where PHRED 450 has a minimum of zero but Solexa scores can be negative. 451 452 Finally, as a special case where None is used for a "missing value", None 453 is returned: 454 455 >>> print(solexa_quality_from_phred(None)) 456 None 457 """ 458 if phred_quality is None: 459 # Assume None is used as some kind of NULL or NA value; return None 460 # e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 461 return None 462 elif phred_quality > 0: 463 # Solexa uses a minimum value of -5, which after rounding matches a 464 # random nucleotide base call. 465 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 466 elif phred_quality == 0: 467 # Special case, map to -5 as discussed in the docstring 468 return -5.0 469 else: 470 raise ValueError("PHRED qualities must be positive (or zero), not %r" 471 % phred_quality)
472 473
474 -def phred_quality_from_solexa(solexa_quality):
475 """Convert a Solexa quality (which can be negative) to a PHRED quality. 476 477 PHRED and Solexa quality scores are both log transformations of a 478 probality of error (high score = low probability of error). This function 479 takes a Solexa score, transforms it back to a probability of error, and 480 then re-expresses it as a PHRED score. This assumes the error estimates 481 are equivalent. 482 483 The underlying formulas are given in the documentation for the sister 484 function solexa_quality_from_phred, in this case the operation is:: 485 486 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 487 488 This will return a floating point number, it is up to you to round this to 489 the nearest integer if appropriate. e.g. 490 491 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 492 80.00 493 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 494 20.04 495 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 496 10.41 497 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 498 3.01 499 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 500 1.19 501 502 Note that a solexa_quality less then -5 is not expected, will trigger a 503 warning, but will still be converted as per the logarithmic mapping 504 (giving a number between 0 and 1.19 back). 505 506 As a special case where None is used for a "missing value", None is 507 returned: 508 509 >>> print(phred_quality_from_solexa(None)) 510 None 511 """ 512 if solexa_quality is None: 513 # Assume None is used as some kind of NULL or NA value; return None 514 return None 515 if solexa_quality < -5: 516 warnings.warn("Solexa quality less than -5 passed, %r" % solexa_quality, 517 BiopythonWarning) 518 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
519 520
521 -def _get_phred_quality(record):
522 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 523 524 If there are no PHRED qualities, but there are Solexa qualities, those are 525 used instead after conversion. 526 """ 527 try: 528 return record.letter_annotations["phred_quality"] 529 except KeyError: 530 pass 531 try: 532 return [phred_quality_from_solexa(q) for 533 q in record.letter_annotations["solexa_quality"]] 534 except KeyError: 535 raise ValueError("No suitable quality scores found in " 536 "letter_annotations of SeqRecord (id=%s)." 537 % record.id)
538 539 540 # Only map 0 to 93, we need to give a warning on truncating at 93 541 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 542 for qp in range(0, 93 + 1)) 543 # Only map -5 to 93, we need to give a warning on truncating at 93 544 _solexa_to_sanger_quality_str = dict( 545 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 546 SANGER_SCORE_OFFSET))) 547 for qs in range(-5, 93 + 1)) 548 549
550 -def _get_sanger_quality_str(record):
551 """Return a Sanger FASTQ encoded quality string (PRIVATE). 552 553 >>> from Bio.Seq import Seq 554 >>> from Bio.SeqRecord import SeqRecord 555 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 556 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 557 >>> _get_sanger_quality_str(r) 558 'SI?5+!' 559 560 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 561 the PHRED qualities are integers, this function is able to use a very fast 562 pre-cached mapping. However, if they are floats which differ slightly, then 563 it has to do the appropriate rounding - which is slower: 564 565 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 566 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 567 >>> _get_sanger_quality_str(r2) 568 'SI?5+!' 569 570 If your scores include a None value, this raises an exception: 571 572 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 573 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 574 >>> _get_sanger_quality_str(r3) 575 Traceback (most recent call last): 576 ... 577 TypeError: A quality value of None was found 578 579 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 580 scores are used in preference: 581 582 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 583 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 584 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 585 >>> _get_sanger_quality_str(r4) 586 'SI?5+!' 587 588 If there are no PHRED scores, but there are Solexa scores, these are used 589 instead (after the appropriate conversion): 590 591 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 592 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 593 >>> _get_sanger_quality_str(r5) 594 'I?5+$"' 595 596 Again, integer Solexa scores can be looked up in a pre-cached mapping making 597 this very fast. You can still use approximate floating point scores: 598 599 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 600 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 601 >>> _get_sanger_quality_str(r6) 602 'I?5+$"' 603 604 Notice that due to the limited range of printable ASCII characters, a 605 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 606 file (using ASCII 126, the tilde). This function will issue a warning 607 in this situation. 608 """ 609 # TODO - This functions works and is fast, but it is also ugly 610 # and there is considerable repetition of code for the other 611 # two FASTQ variants. 612 try: 613 # These take priority (in case both Solexa and PHRED scores found) 614 qualities = record.letter_annotations["phred_quality"] 615 except KeyError: 616 # Fall back on solexa scores... 617 pass 618 else: 619 # Try and use the precomputed mapping: 620 try: 621 return "".join(_phred_to_sanger_quality_str[qp] 622 for qp in qualities) 623 except KeyError: 624 # Could be a float, or a None in the list, or a high value. 625 pass 626 if None in qualities: 627 raise TypeError("A quality value of None was found") 628 if max(qualities) >= 93.5: 629 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 630 BiopythonWarning) 631 # This will apply the truncation at 93, giving max ASCII 126 632 return "".join(chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 633 for qp in qualities) 634 # Fall back on the Solexa scores... 635 try: 636 qualities = record.letter_annotations["solexa_quality"] 637 except KeyError: 638 raise ValueError("No suitable quality scores found in " 639 "letter_annotations of SeqRecord (id=%s)." 640 % record.id) 641 # Try and use the precomputed mapping: 642 try: 643 return "".join(_solexa_to_sanger_quality_str[qs] 644 for qs in qualities) 645 except KeyError: 646 # Either no PHRED scores, or something odd like a float or None 647 pass 648 if None in qualities: 649 raise TypeError("A quality value of None was found") 650 # Must do this the slow way, first converting the PHRED scores into 651 # Solexa scores: 652 if max(qualities) >= 93.5: 653 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 654 BiopythonWarning) 655 # This will apply the truncation at 93, giving max ASCII 126 656 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 657 for qs in qualities)
658 659 660 # Only map 0 to 62, we need to give a warning on truncating at 62 661 assert 62 + SOLEXA_SCORE_OFFSET == 126 662 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 663 for qp in range(0, 62 + 1)) 664 # Only map -5 to 62, we need to give a warning on truncating at 62 665 _solexa_to_illumina_quality_str = dict( 666 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 667 for qs in range(-5, 62 + 1)) 668 669
670 -def _get_illumina_quality_str(record):
671 """Return an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 672 673 Notice that due to the limited range of printable ASCII characters, a 674 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 675 file (using ASCII 126, the tilde). This function will issue a warning 676 in this situation. 677 """ 678 # TODO - This functions works and is fast, but it is also ugly 679 # and there is considerable repetition of code for the other 680 # two FASTQ variants. 681 try: 682 # These take priority (in case both Solexa and PHRED scores found) 683 qualities = record.letter_annotations["phred_quality"] 684 except KeyError: 685 # Fall back on solexa scores... 686 pass 687 else: 688 # Try and use the precomputed mapping: 689 try: 690 return "".join(_phred_to_illumina_quality_str[qp] 691 for qp in qualities) 692 except KeyError: 693 # Could be a float, or a None in the list, or a high value. 694 pass 695 if None in qualities: 696 raise TypeError("A quality value of None was found") 697 if max(qualities) >= 62.5: 698 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 699 BiopythonWarning) 700 # This will apply the truncation at 62, giving max ASCII 126 701 return "".join(chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 702 for qp in qualities) 703 # Fall back on the Solexa scores... 704 try: 705 qualities = record.letter_annotations["solexa_quality"] 706 except KeyError: 707 raise ValueError("No suitable quality scores found in " 708 "letter_annotations of SeqRecord (id=%s)." 709 % record.id) 710 # Try and use the precomputed mapping: 711 try: 712 return "".join(_solexa_to_illumina_quality_str[qs] 713 for qs in qualities) 714 except KeyError: 715 # Either no PHRED scores, or something odd like a float or None 716 pass 717 if None in qualities: 718 raise TypeError("A quality value of None was found") 719 # Must do this the slow way, first converting the PHRED scores into 720 # Solexa scores: 721 if max(qualities) >= 62.5: 722 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 723 BiopythonWarning) 724 # This will apply the truncation at 62, giving max ASCII 126 725 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 726 for qs in qualities)
727 728 729 # Only map 0 to 62, we need to give a warning on truncating at 62 730 assert 62 + SOLEXA_SCORE_OFFSET == 126 731 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 732 for qs in range(-5, 62 + 1)) 733 # Only map -5 to 62, we need to give a warning on truncating at 62 734 _phred_to_solexa_quality_str = dict( 735 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 736 SOLEXA_SCORE_OFFSET))) 737 for qp in range(0, 62 + 1)) 738 739
740 -def _get_solexa_quality_str(record):
741 """Return a Solexa FASTQ encoded quality string (PRIVATE). 742 743 Notice that due to the limited range of printable ASCII characters, a 744 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 745 file (using ASCII 126, the tilde). This function will issue a warning 746 in this situation. 747 """ 748 # TODO - This functions works and is fast, but it is also ugly 749 # and there is considerable repetition of code for the other 750 # two FASTQ variants. 751 try: 752 # These take priority (in case both Solexa and PHRED scores found) 753 qualities = record.letter_annotations["solexa_quality"] 754 except KeyError: 755 # Fall back on PHRED scores... 756 pass 757 else: 758 # Try and use the precomputed mapping: 759 try: 760 return "".join(_solexa_to_solexa_quality_str[qs] 761 for qs in qualities) 762 except KeyError: 763 # Could be a float, or a None in the list, or a high value. 764 pass 765 if None in qualities: 766 raise TypeError("A quality value of None was found") 767 if max(qualities) >= 62.5: 768 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 769 BiopythonWarning) 770 # This will apply the truncation at 62, giving max ASCII 126 771 return "".join(chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 772 for qs in qualities) 773 # Fall back on the PHRED scores... 774 try: 775 qualities = record.letter_annotations["phred_quality"] 776 except KeyError: 777 raise ValueError("No suitable quality scores found in " 778 "letter_annotations of SeqRecord (id=%s)." 779 % record.id) 780 # Try and use the precomputed mapping: 781 try: 782 return "".join(_phred_to_solexa_quality_str[qp] 783 for qp in qualities) 784 except KeyError: 785 # Either no PHRED scores, or something odd like a float or None 786 # or too big to be in the cache 787 pass 788 if None in qualities: 789 raise TypeError("A quality value of None was found") 790 # Must do this the slow way, first converting the PHRED scores into 791 # Solexa scores: 792 if max(qualities) >= 62.5: 793 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 794 BiopythonWarning) 795 return "".join(chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 796 for qp in qualities)
797 798 799 # TODO - Default to nucleotide or even DNA?
800 -def FastqGeneralIterator(handle):
801 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 802 803 This code does not try to interpret the quality string numerically. It 804 just returns tuples of the title, sequence and quality as strings. For 805 the sequence and quality, any whitespace (such as new lines) is removed. 806 807 Our SeqRecord based FASTQ iterators call this function internally, and then 808 turn the strings into a SeqRecord objects, mapping the quality string into 809 a list of numerical scores. If you want to do a custom quality mapping, 810 then you might consider calling this function directly. 811 812 For parsing FASTQ files, the title string from the "@" line at the start 813 of each record can optionally be omitted on the "+" lines. If it is 814 repeated, it must be identical. 815 816 The sequence string and the quality string can optionally be split over 817 multiple lines, although several sources discourage this. In comparison, 818 for the FASTA file format line breaks between 60 and 80 characters are 819 the norm. 820 821 **WARNING** - Because the "@" character can appear in the quality string, 822 this can cause problems as this is also the marker for the start of 823 a new sequence. In fact, the "+" sign can also appear as well. Some 824 sources recommended having no line breaks in the quality to avoid this, 825 but even that is not enough, consider this example:: 826 827 @071113_EAS56_0053:1:1:998:236 828 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 829 +071113_EAS56_0053:1:1:998:236 830 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 831 @071113_EAS56_0053:1:1:182:712 832 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 833 + 834 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 835 @071113_EAS56_0053:1:1:153:10 836 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 837 + 838 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 839 @071113_EAS56_0053:1:3:990:501 840 TGGGAGGTTTTATGTGGA 841 AAGCAGCAATGTACAAGA 842 + 843 IIIIIII.IIIIII1@44 844 @-7.%<&+/$/%4(++(% 845 846 This is four PHRED encoded FASTQ entries originally from an NCBI source 847 (given the read length of 36, these are probably Solexa Illumina reads where 848 the quality has been mapped onto the PHRED values). 849 850 This example has been edited to illustrate some of the nasty things allowed 851 in the FASTQ format. Firstly, on the "+" lines most but not all of the 852 (redundant) identifiers are omitted. In real files it is likely that all or 853 none of these extra identifiers will be present. 854 855 Secondly, while the first three sequences have been shown without line 856 breaks, the last has been split over multiple lines. In real files any line 857 breaks are likely to be consistent. 858 859 Thirdly, some of the quality string lines start with an "@" character. For 860 the second record this is unavoidable. However for the fourth sequence this 861 only happens because its quality string is split over two lines. A naive 862 parser could wrongly treat any line starting with an "@" as the beginning of 863 a new sequence! This code copes with this possible ambiguity by keeping 864 track of the length of the sequence which gives the expected length of the 865 quality string. 866 867 Using this tricky example file as input, this short bit of code demonstrates 868 what this parsing function would return: 869 870 >>> with open("Quality/tricky.fastq", "rU") as handle: 871 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 872 ... print(title) 873 ... print("%s %s" % (sequence, quality)) 874 ... 875 071113_EAS56_0053:1:1:998:236 876 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 877 071113_EAS56_0053:1:1:182:712 878 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 879 071113_EAS56_0053:1:1:153:10 880 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 881 071113_EAS56_0053:1:3:990:501 882 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 883 884 Finally we note that some sources state that the quality string should 885 start with "!" (which using the PHRED mapping means the first letter always 886 has a quality score of zero). This rather restrictive rule is not widely 887 observed, so is therefore ignored here. One plus point about this "!" rule 888 is that (provided there are no line breaks in the quality sequence) it 889 would prevent the above problem with the "@" character. 890 """ 891 # We need to call handle.readline() at least four times per record, 892 # so we'll save a property look up each time: 893 handle_readline = handle.readline 894 895 line = handle_readline() 896 if not line: 897 return # Premature end of file, or just empty? 898 if isinstance(line[0], int): 899 raise ValueError("Is this handle in binary mode not text mode?") 900 901 while line: 902 if line[0] != "@": 903 raise ValueError( 904 "Records in Fastq files should start with '@' character") 905 title_line = line[1:].rstrip() 906 # Will now be at least one line of quality data - in most FASTQ files 907 # just one line! We therefore use string concatenation (if needed) 908 # rather using than the "".join(...) trick just in case it is multiline: 909 seq_string = handle_readline().rstrip() 910 # There may now be more sequence lines, or the "+" quality marker line: 911 while True: 912 line = handle_readline() 913 if not line: 914 raise ValueError("End of file without quality information.") 915 if line[0] == "+": 916 # The title here is optional, but if present must match! 917 second_title = line[1:].rstrip() 918 if second_title and second_title != title_line: 919 raise ValueError("Sequence and quality captions differ.") 920 break 921 seq_string += line.rstrip() # removes trailing newlines 922 # This is going to slow things down a little, but assuming 923 # this isn't allowed we should try and catch it here: 924 if " " in seq_string or "\t" in seq_string: 925 raise ValueError("Whitespace is not allowed in the sequence.") 926 seq_len = len(seq_string) 927 928 # Will now be at least one line of quality data... 929 quality_string = handle_readline().rstrip() 930 # There may now be more quality data, or another sequence, or EOF 931 while True: 932 line = handle_readline() 933 if not line: 934 break # end of file 935 if line[0] == "@": 936 # This COULD be the start of a new sequence. However, it MAY just 937 # be a line of quality data which starts with a "@" character. We 938 # should be able to check this by looking at the sequence length 939 # and the amount of quality data found so far. 940 if len(quality_string) >= seq_len: 941 # We expect it to be equal if this is the start of a new record. 942 # If the quality data is longer, we'll raise an error below. 943 break 944 # Continue - its just some (more) quality data. 945 quality_string += line.rstrip() 946 947 if seq_len != len(quality_string): 948 raise ValueError("Lengths of sequence and quality values differs " 949 " for %s (%i and %i)." 950 % (title_line, seq_len, len(quality_string))) 951 952 # Return the record and then continue... 953 yield (title_line, seq_string, quality_string)
954 955
956 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
957 """Iterate over FASTQ records as SeqRecord objects. 958 959 Arguments: 960 - handle - input file 961 - alphabet - optional alphabet 962 - title2ids - A function that, when given the title line from the FASTQ 963 file (without the beginning >), will return the id, name and 964 description (in that order) for the record as a tuple of strings. 965 If this is not given, then the entire title line will be used as 966 the description, and the first word as the id and name. 967 968 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 969 970 For each sequence in a (Sanger style) FASTQ file there is a matching string 971 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 972 values with an offset of 33. 973 974 For example, consider a file containing three short reads:: 975 976 @EAS54_6_R1_2_1_413_324 977 CCCTTCTTGTCTTCAGCGTTTCTCC 978 + 979 ;;3;;;;;;;;;;;;7;;;;;;;88 980 @EAS54_6_R1_2_1_540_792 981 TTGGCAGGCCAAGGCCGATGGATCA 982 + 983 ;;;;;;;;;;;7;;;;;-;;;3;83 984 @EAS54_6_R1_2_1_443_348 985 GTTGCTTCTGGCGTGGGTGGGGGGG 986 + 987 ;;;;;;;;;;;9;7;;.7;393333 988 989 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 990 string encoding the PHRED qualities using a ASCII values with an offset of 991 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 992 993 Using this module directly you might run: 994 995 >>> with open("Quality/example.fastq", "rU") as handle: 996 ... for record in FastqPhredIterator(handle): 997 ... print("%s %s" % (record.id, record.seq)) 998 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 999 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1000 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1001 1002 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1003 (or "fastq-sanger") as the format: 1004 1005 >>> from Bio import SeqIO 1006 >>> with open("Quality/example.fastq", "rU") as handle: 1007 ... for record in SeqIO.parse(handle, "fastq"): 1008 ... print("%s %s" % (record.id, record.seq)) 1009 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1010 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1011 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1012 1013 If you want to look at the qualities, they are record in each record's 1014 per-letter-annotation dictionary as a simple list of integers: 1015 1016 >>> print(record.letter_annotations["phred_quality"]) 1017 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1018 1019 """ 1020 assert SANGER_SCORE_OFFSET == ord("!") 1021 # Originally, I used a list expression for each record: 1022 # 1023 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1024 # 1025 # Precomputing is faster, perhaps partly by avoiding the subtractions. 1026 q_mapping = dict() 1027 for letter in range(0, 255): 1028 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1029 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1030 if title2ids: 1031 id, name, descr = title2ids(title_line) 1032 else: 1033 descr = title_line 1034 id = descr.split()[0] 1035 name = id 1036 record = SeqRecord(Seq(seq_string, alphabet), 1037 id=id, name=name, description=descr) 1038 qualities = [q_mapping[letter] for letter in quality_string] 1039 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1040 raise ValueError("Invalid character in quality string") 1041 # For speed, will now use a dirty trick to speed up assigning the 1042 # qualities. We do this to bypass the length check imposed by the 1043 # per-letter-annotations restricted dict (as this has already been 1044 # checked by FastqGeneralIterator). This is equivalent to: 1045 # record.letter_annotations["phred_quality"] = qualities 1046 dict.__setitem__(record._per_letter_annotations, 1047 "phred_quality", qualities) 1048 yield record
1049 1050
1051 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1052 r"""Parse old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1053 1054 The optional arguments are the same as those for the FastqPhredIterator. 1055 1056 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1057 encoding the Solexa integer qualities using ASCII values with an offset 1058 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1059 will NOT perform any automatic conversion when loading. 1060 1061 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1062 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1063 1064 For example, consider a file containing these five records:: 1065 1066 @SLXA-B3_649_FC8437_R1_1_1_610_79 1067 GATGTGCAATACCTTTGTAGAGGAA 1068 +SLXA-B3_649_FC8437_R1_1_1_610_79 1069 YYYYYYYYYYYYYYYYYYWYWYYSU 1070 @SLXA-B3_649_FC8437_R1_1_1_397_389 1071 GGTTTGAGAAAGAGAAATGAGATAA 1072 +SLXA-B3_649_FC8437_R1_1_1_397_389 1073 YYYYYYYYYWYYYYWWYYYWYWYWW 1074 @SLXA-B3_649_FC8437_R1_1_1_850_123 1075 GAGGGTGTTGATCATGATGATGGCG 1076 +SLXA-B3_649_FC8437_R1_1_1_850_123 1077 YYYYYYYYYYYYYWYYWYYSYYYSY 1078 @SLXA-B3_649_FC8437_R1_1_1_362_549 1079 GGAAACAAAGTTTTTCTCAACATAG 1080 +SLXA-B3_649_FC8437_R1_1_1_362_549 1081 YYYYYYYYYYYYYYYYYYWWWWYWY 1082 @SLXA-B3_649_FC8437_R1_1_1_183_714 1083 GTATTATTTAATGGCATACACTCAA 1084 +SLXA-B3_649_FC8437_R1_1_1_183_714 1085 YYYYYYYYYYWYYYYWYWWUWWWQQ 1086 1087 Using this module directly you might run: 1088 1089 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1090 ... for record in FastqSolexaIterator(handle): 1091 ... print("%s %s" % (record.id, record.seq)) 1092 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1093 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1094 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1095 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1096 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1097 1098 Typically however, you would call this via Bio.SeqIO instead with 1099 "fastq-solexa" as the format: 1100 1101 >>> from Bio import SeqIO 1102 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1103 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1104 ... print("%s %s" % (record.id, record.seq)) 1105 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1106 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1107 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1108 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1109 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1110 1111 If you want to look at the qualities, they are recorded in each record's 1112 per-letter-annotation dictionary as a simple list of integers: 1113 1114 >>> print(record.letter_annotations["solexa_quality"]) 1115 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1116 1117 These scores aren't very good, but they are high enough that they map 1118 almost exactly onto PHRED scores: 1119 1120 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1121 25.01 1122 1123 Let's look at faked example read which is even worse, where there are 1124 more noticeable differences between the Solexa and PHRED scores:: 1125 1126 @slxa_0001_1_0001_01 1127 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1128 +slxa_0001_1_0001_01 1129 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1130 1131 Again, you would typically use Bio.SeqIO to read this file in (rather than 1132 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1133 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1134 as shown above. This example has only as one entry, so instead we can 1135 use the Bio.SeqIO.read() function: 1136 1137 >>> from Bio import SeqIO 1138 >>> with open("Quality/solexa_faked.fastq", "rU") as handle: 1139 ... record = SeqIO.read(handle, "fastq-solexa") 1140 >>> print("%s %s" % (record.id, record.seq)) 1141 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1142 >>> print(record.letter_annotations["solexa_quality"]) 1143 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1144 1145 These quality scores are so low that when converted from the Solexa scheme 1146 into PHRED scores they look quite different: 1147 1148 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1149 2.54 1150 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1151 1.19 1152 1153 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1154 method to output the record(s): 1155 1156 >>> print(record.format("fastq-solexa")) 1157 @slxa_0001_1_0001_01 1158 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1159 + 1160 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1161 <BLANKLINE> 1162 1163 Note this output is slightly different from the input file as Biopython 1164 has left out the optional repetition of the sequence identifier on the "+" 1165 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1166 output format instead, and Biopython will do the conversion for you: 1167 1168 >>> print(record.format("fastq")) 1169 @slxa_0001_1_0001_01 1170 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1171 + 1172 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1173 <BLANKLINE> 1174 1175 >>> print(record.format("qual")) 1176 >slxa_0001_1_0001_01 1177 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1178 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1179 1 1 1180 <BLANKLINE> 1181 1182 As shown above, the poor quality Solexa reads have been mapped to the 1183 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1184 """ 1185 q_mapping = dict() 1186 for letter in range(0, 255): 1187 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1188 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1189 if title2ids: 1190 id, name, descr = title_line 1191 else: 1192 descr = title_line 1193 id = descr.split()[0] 1194 name = id 1195 record = SeqRecord(Seq(seq_string, alphabet), 1196 id=id, name=name, description=descr) 1197 qualities = [q_mapping[letter] for letter in quality_string] 1198 # DO NOT convert these into PHRED qualities automatically! 1199 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1200 raise ValueError("Invalid character in quality string") 1201 # Dirty trick to speed up this line: 1202 # record.letter_annotations["solexa_quality"] = qualities 1203 dict.__setitem__(record._per_letter_annotations, 1204 "solexa_quality", qualities) 1205 yield record
1206 1207
1208 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1209 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1210 1211 The optional arguments are the same as those for the FastqPhredIterator. 1212 1213 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1214 encoding PHRED integer qualities using ASCII values with an offset of 64. 1215 1216 >>> from Bio import SeqIO 1217 >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") 1218 >>> print("%s %s" % (record.id, record.seq)) 1219 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1220 >>> max(record.letter_annotations["phred_quality"]) 1221 40 1222 >>> min(record.letter_annotations["phred_quality"]) 1223 0 1224 1225 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1226 with an ASCII offset of 64. They are approximately equal but only for high 1227 quality reads. If you have an old Solexa/Illumina file with negative 1228 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1229 1230 >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") 1231 Traceback (most recent call last): 1232 ... 1233 ValueError: Invalid character in quality string 1234 1235 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1236 """ 1237 q_mapping = dict() 1238 for letter in range(0, 255): 1239 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1240 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1241 if title2ids: 1242 id, name, descr = title2ids(title_line) 1243 else: 1244 descr = title_line 1245 id = descr.split()[0] 1246 name = id 1247 record = SeqRecord(Seq(seq_string, alphabet), 1248 id=id, name=name, description=descr) 1249 qualities = [q_mapping[letter] for letter in quality_string] 1250 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1251 raise ValueError("Invalid character in quality string") 1252 # Dirty trick to speed up this line: 1253 # record.letter_annotations["phred_quality"] = qualities 1254 dict.__setitem__(record._per_letter_annotations, 1255 "phred_quality", qualities) 1256 yield record
1257 1258
1259 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1260 """For QUAL files which include PHRED quality scores, but no sequence. 1261 1262 For example, consider this short QUAL file:: 1263 1264 >EAS54_6_R1_2_1_413_324 1265 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1266 26 26 26 23 23 1267 >EAS54_6_R1_2_1_540_792 1268 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1269 26 18 26 23 18 1270 >EAS54_6_R1_2_1_443_348 1271 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1272 24 18 18 18 18 1273 1274 Using this module directly you might run: 1275 1276 >>> with open("Quality/example.qual", "rU") as handle: 1277 ... for record in QualPhredIterator(handle): 1278 ... print("%s %s" % (record.id, record.seq)) 1279 EAS54_6_R1_2_1_413_324 ????????????????????????? 1280 EAS54_6_R1_2_1_540_792 ????????????????????????? 1281 EAS54_6_R1_2_1_443_348 ????????????????????????? 1282 1283 Typically however, you would call this via Bio.SeqIO instead with "qual" 1284 as the format: 1285 1286 >>> from Bio import SeqIO 1287 >>> with open("Quality/example.qual", "rU") as handle: 1288 ... for record in SeqIO.parse(handle, "qual"): 1289 ... print("%s %s" % (record.id, record.seq)) 1290 EAS54_6_R1_2_1_413_324 ????????????????????????? 1291 EAS54_6_R1_2_1_540_792 ????????????????????????? 1292 EAS54_6_R1_2_1_443_348 ????????????????????????? 1293 1294 Becase QUAL files don't contain the sequence string itself, the seq 1295 property is set to an UnknownSeq object. As no alphabet was given, this 1296 has defaulted to a generic single letter alphabet and the character "?" 1297 used. 1298 1299 By specifying a nucleotide alphabet, "N" is used instead: 1300 1301 >>> from Bio import SeqIO 1302 >>> from Bio.Alphabet import generic_dna 1303 >>> with open("Quality/example.qual", "rU") as handle: 1304 ... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1305 ... print("%s %s" % (record.id, record.seq)) 1306 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1307 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1308 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1309 1310 However, the quality scores themselves are available as a list of integers 1311 in each record's per-letter-annotation: 1312 1313 >>> print(record.letter_annotations["phred_quality"]) 1314 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1315 1316 You can still slice one of these SeqRecord objects with an UnknownSeq: 1317 1318 >>> sub_record = record[5:10] 1319 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1320 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1321 1322 As of Biopython 1.59, this parser will accept files with negatives quality 1323 scores but will replace them with the lowest possible PHRED score of zero. 1324 This will trigger a warning, previously it raised a ValueError exception. 1325 """ 1326 # Skip any text before the first record (e.g. blank lines, comments) 1327 while True: 1328 line = handle.readline() 1329 if line == "": 1330 return # Premature end of file, or just empty? 1331 if line[0] == ">": 1332 break 1333 1334 while True: 1335 if line[0] != ">": 1336 raise ValueError( 1337 "Records in Fasta files should start with '>' character") 1338 if title2ids: 1339 id, name, descr = title2ids(line[1:].rstrip()) 1340 else: 1341 descr = line[1:].rstrip() 1342 id = descr.split()[0] 1343 name = id 1344 1345 qualities = [] 1346 line = handle.readline() 1347 while True: 1348 if not line: 1349 break 1350 if line[0] == ">": 1351 break 1352 qualities.extend(int(word) for word in line.split()) 1353 line = handle.readline() 1354 1355 if qualities and min(qualities) < 0: 1356 warnings.warn(("Negative quality score %i found, " + 1357 "substituting PHRED zero instead.") 1358 % min(qualities), BiopythonParserWarning) 1359 qualities = [max(0, q) for q in qualities] 1360 1361 # Return the record and then continue... 1362 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1363 id=id, name=name, description=descr) 1364 # Dirty trick to speed up this line: 1365 # record.letter_annotations["phred_quality"] = qualities 1366 dict.__setitem__(record._per_letter_annotations, 1367 "phred_quality", qualities) 1368 yield record 1369 1370 if not line: 1371 return # StopIteration 1372 assert False, "Should not reach this line"
1373 1374
1375 -class FastqPhredWriter(SequentialSequenceWriter):
1376 """Class to write standard FASTQ format files (using PHRED quality scores) (OBSOLETE). 1377 1378 Although you can use this class directly, you are strongly encouraged 1379 to use the ``as_fastq`` function, or top level ``Bio.SeqIO.write()`` 1380 function instead via the format name "fastq" or the alias "fastq-sanger". 1381 1382 For example, this code reads in a standard Sanger style FASTQ file 1383 (using PHRED scores) and re-saves it as another Sanger style FASTQ file: 1384 1385 >>> from Bio import SeqIO 1386 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1387 >>> with open("Quality/temp.fastq", "w") as out_handle: 1388 ... SeqIO.write(record_iterator, out_handle, "fastq") 1389 3 1390 1391 You might want to do this if the original file included extra line breaks, 1392 which while valid may not be supported by all tools. The output file from 1393 Biopython will have each sequence on a single line, and each quality 1394 string on a single line (which is considered desirable for maximum 1395 compatibility). 1396 1397 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1398 quality scores) is converted into a standard Sanger style FASTQ file using 1399 PHRED qualities: 1400 1401 >>> from Bio import SeqIO 1402 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1403 >>> with open("Quality/temp.fastq", "w") as out_handle: 1404 ... SeqIO.write(record_iterator, out_handle, "fastq") 1405 5 1406 1407 This code is also called if you use the .format("fastq") method of a 1408 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1409 1410 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1411 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1412 warning is issued. 1413 1414 P.S. To avoid cluttering up your working directory, you can delete this 1415 temporary file now: 1416 1417 >>> import os 1418 >>> os.remove("Quality/temp.fastq") 1419 """ 1420 1421 assert SANGER_SCORE_OFFSET == ord("!") 1422
1423 - def write_record(self, record):
1424 """Write a single FASTQ record to the file.""" 1425 assert self._header_written 1426 assert not self._footer_written 1427 self._record_written = True 1428 # TODO - Is an empty sequence allowed in FASTQ format? 1429 if record.seq is None: 1430 raise ValueError("No sequence for record %s" % record.id) 1431 seq_str = str(record.seq) 1432 qualities_str = _get_sanger_quality_str(record) 1433 if len(qualities_str) != len(seq_str): 1434 raise ValueError("Record %s has sequence length %i but %i quality scores" 1435 % (record.id, len(seq_str), len(qualities_str))) 1436 1437 # FASTQ files can include a description, just like FASTA files 1438 # (at least, this is what the NCBI Short Read Archive does) 1439 id = self.clean(record.id) 1440 description = self.clean(record.description) 1441 if description and description.split(None, 1)[0] == id: 1442 # The description includes the id at the start 1443 title = description 1444 elif description: 1445 title = "%s %s" % (id, description) 1446 else: 1447 title = id 1448 1449 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1450 1451
1452 -def as_fastq(record):
1453 """Turn a SeqRecord into a Sanger FASTQ formated string. 1454 1455 This is used internally by the SeqRecord's .format("fastq") 1456 method and by the SeqIO.write(..., ..., "fastq") function, 1457 and under the format alias "fastq-sanger" as well. 1458 """ 1459 seq_str = _get_seq_string(record) 1460 qualities_str = _get_sanger_quality_str(record) 1461 if len(qualities_str) != len(seq_str): 1462 raise ValueError("Record %s has sequence length %i but %i quality scores" 1463 % (record.id, len(seq_str), len(qualities_str))) 1464 id = _clean(record.id) 1465 description = _clean(record.description) 1466 if description and description.split(None, 1)[0] == id: 1467 title = description 1468 elif description: 1469 title = "%s %s" % (id, description) 1470 else: 1471 title = id 1472 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1473 1474
1475 -class QualPhredWriter(SequentialSequenceWriter):
1476 """Class to write QUAL format files (using PHRED quality scores) (OBSOLETE). 1477 1478 Although you can use this class directly, you are strongly encouraged 1479 to use the ``as_qual`` function, or top level ``Bio.SeqIO.write()`` 1480 function instead. 1481 1482 For example, this code reads in a FASTQ file and saves the quality scores 1483 into a QUAL file: 1484 1485 >>> from Bio import SeqIO 1486 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1487 >>> with open("Quality/temp.qual", "w") as out_handle: 1488 ... SeqIO.write(record_iterator, out_handle, "qual") 1489 3 1490 1491 This code is also called if you use the .format("qual") method of a 1492 SeqRecord. 1493 1494 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1495 1496 >>> import os 1497 >>> os.remove("Quality/temp.qual") 1498 """ 1499
1500 - def __init__(self, handle, wrap=60, record2title=None):
1501 """Create a QUAL writer. 1502 1503 Arguments: 1504 - handle - Handle to an output file, e.g. as returned 1505 by open(filename, "w") 1506 - wrap - Optional line length used to wrap sequence lines. 1507 Defaults to wrapping the sequence at 60 characters. Use 1508 zero (or None) for no wrapping, giving a single long line 1509 for the sequence. 1510 - record2title - Optional function to return the text to be 1511 used for the title line of each record. By default a 1512 combination of the record.id and record.description is 1513 used. If the record.description starts with the record.id, 1514 then just the record.description is used. 1515 1516 The record2title argument is present for consistency with the 1517 Bio.SeqIO.FastaIO writer class. 1518 """ 1519 SequentialSequenceWriter.__init__(self, handle) 1520 # self.handle = handle 1521 self.wrap = None 1522 if wrap: 1523 if wrap < 1: 1524 raise ValueError 1525 self.wrap = wrap 1526 self.record2title = record2title
1527
1528 - def write_record(self, record):
1529 """Write a single QUAL record to the file.""" 1530 assert self._header_written 1531 assert not self._footer_written 1532 self._record_written = True 1533 1534 handle = self.handle 1535 wrap = self.wrap 1536 1537 if self.record2title: 1538 title = self.clean(self.record2title(record)) 1539 else: 1540 id = self.clean(record.id) 1541 description = self.clean(record.description) 1542 if description and description.split(None, 1)[0] == id: 1543 # The description includes the id at the start 1544 title = description 1545 elif description: 1546 title = "%s %s" % (id, description) 1547 else: 1548 title = id 1549 handle.write(">%s\n" % title) 1550 1551 qualities = _get_phred_quality(record) 1552 try: 1553 # This rounds to the nearest integer. 1554 # TODO - can we record a float in a qual file? 1555 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1556 except TypeError as e: 1557 if None in qualities: 1558 raise TypeError("A quality value of None was found") 1559 else: 1560 raise e 1561 1562 if wrap > 5: 1563 # Fast wrapping 1564 data = " ".join(qualities_strs) 1565 while True: 1566 if len(data) <= wrap: 1567 self.handle.write(data + "\n") 1568 break 1569 else: 1570 # By construction there must be spaces in the first X chars 1571 # (unless we have X digit or higher quality scores!) 1572 i = data.rfind(" ", 0, wrap) 1573 handle.write(data[:i] + "\n") 1574 data = data[i + 1:] 1575 elif wrap: 1576 # Safe wrapping 1577 while qualities_strs: 1578 line = qualities_strs.pop(0) 1579 while qualities_strs \ 1580 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1581 line += " " + qualities_strs.pop(0) 1582 handle.write(line + "\n") 1583 else: 1584 # No wrapping 1585 data = " ".join(qualities_strs) 1586 handle.write(data + "\n")
1587 1588
1589 -def as_qual(record):
1590 """Turn a SeqRecord into a QUAL formated string. 1591 1592 This is used internally by the SeqRecord's .format("qual") 1593 method and by the SeqIO.write(..., ..., "qual") function. 1594 """ 1595 id = _clean(record.id) 1596 description = _clean(record.description) 1597 if description and description.split(None, 1)[0] == id: 1598 title = description 1599 elif description: 1600 title = "%s %s" % (id, description) 1601 else: 1602 title = id 1603 lines = [">%s\n" % title] 1604 1605 qualities = _get_phred_quality(record) 1606 try: 1607 # This rounds to the nearest integer. 1608 # TODO - can we record a float in a qual file? 1609 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1610 except TypeError as e: 1611 if None in qualities: 1612 raise TypeError("A quality value of None was found") 1613 else: 1614 raise e 1615 1616 # Safe wrapping 1617 while qualities_strs: 1618 line = qualities_strs.pop(0) 1619 while qualities_strs \ 1620 and len(line) + 1 + len(qualities_strs[0]) < 60: 1621 line += " " + qualities_strs.pop(0) 1622 lines.append(line + "\n") 1623 return "".join(lines)
1624 1625
1626 -class FastqSolexaWriter(SequentialSequenceWriter):
1627 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities) (OBSOLETE). 1628 1629 This outputs FASTQ files like those from the early Solexa/Illumina 1630 pipeline, using Solexa scores and an ASCII offset of 64. These are 1631 NOT compatible with the standard Sanger style PHRED FASTQ files. 1632 1633 If your records contain a "solexa_quality" entry under letter_annotations, 1634 this is used, otherwise any "phred_quality" entry will be used after 1635 conversion using the solexa_quality_from_phred function. If neither style 1636 of quality scores are present, an exception is raised. 1637 1638 Although you can use this class directly, you are strongly encouraged 1639 to use the ``as_fastq_solexa`` function, or top-level ``Bio.SeqIO.write()`` 1640 function instead. For example, this code reads in a FASTQ file and re-saves 1641 it as another FASTQ file: 1642 1643 >>> from Bio import SeqIO 1644 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1645 >>> with open("Quality/temp.fastq", "w") as out_handle: 1646 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1647 5 1648 1649 You might want to do this if the original file included extra line breaks, 1650 which (while valid) may not be supported by all tools. The output file 1651 from Biopython will have each sequence on a single line, and each quality 1652 string on a single line (which is considered desirable for maximum 1653 compatibility). 1654 1655 This code is also called if you use the .format("fastq-solexa") method of 1656 a SeqRecord. For example, 1657 1658 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1659 >>> print(record.format("fastq-solexa")) 1660 @Test PHRED qualities from 40 to 0 inclusive 1661 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1662 + 1663 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1664 <BLANKLINE> 1665 1666 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1667 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1668 a warning is issued. 1669 1670 P.S. Don't forget to delete the temp file if you don't need it anymore: 1671 1672 >>> import os 1673 >>> os.remove("Quality/temp.fastq") 1674 """ 1675
1676 - def write_record(self, record):
1677 """Write a single FASTQ record to the file.""" 1678 assert self._header_written 1679 assert not self._footer_written 1680 self._record_written = True 1681 1682 # TODO - Is an empty sequence allowed in FASTQ format? 1683 if record.seq is None: 1684 raise ValueError("No sequence for record %s" % record.id) 1685 seq_str = str(record.seq) 1686 qualities_str = _get_solexa_quality_str(record) 1687 if len(qualities_str) != len(seq_str): 1688 raise ValueError("Record %s has sequence length %i but %i quality scores" 1689 % (record.id, len(seq_str), len(qualities_str))) 1690 1691 # FASTQ files can include a description, just like FASTA files 1692 # (at least, this is what the NCBI Short Read Archive does) 1693 id = self.clean(record.id) 1694 description = self.clean(record.description) 1695 if description and description.split(None, 1)[0] == id: 1696 # The description includes the id at the start 1697 title = description 1698 elif description: 1699 title = "%s %s" % (id, description) 1700 else: 1701 title = id 1702 1703 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1704 1705
1706 -def as_fastq_solexa(record):
1707 """Turn a SeqRecord into a Solexa FASTQ formated string. 1708 1709 This is used internally by the SeqRecord's .format("fastq-solexa") 1710 method and by the SeqIO.write(..., ..., "fastq-solexa") function. 1711 """ 1712 seq_str = _get_seq_string(record) 1713 qualities_str = _get_solexa_quality_str(record) 1714 if len(qualities_str) != len(seq_str): 1715 raise ValueError("Record %s has sequence length %i but %i quality scores" 1716 % (record.id, len(seq_str), len(qualities_str))) 1717 id = _clean(record.id) 1718 description = _clean(record.description) 1719 if description and description.split(None, 1)[0] == id: 1720 # The description includes the id at the start 1721 title = description 1722 elif description: 1723 title = "%s %s" % (id, description) 1724 else: 1725 title = id 1726 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1727 1728
1729 -class FastqIlluminaWriter(SequentialSequenceWriter):
1730 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores) (OBSOLETE). 1731 1732 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1733 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1734 compatible with the standard Sanger style PHRED FASTQ files which use an 1735 ASCII offset of 32. 1736 1737 Although you can use this class directly, you are strongly encouraged to 1738 use the ``as_fastq_illumina`` or top-level ``Bio.SeqIO.write()`` function 1739 with format name "fastq-illumina" instead. This code is also called if you 1740 use the .format("fastq-illumina") method of a SeqRecord. For example, 1741 1742 >>> from Bio import SeqIO 1743 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1744 >>> print(record.format("fastq-illumina")) 1745 @Test PHRED qualities from 40 to 0 inclusive 1746 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1747 + 1748 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1749 <BLANKLINE> 1750 1751 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1752 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1753 warning is issued. 1754 """ 1755
1756 - def write_record(self, record):
1757 """Write a single FASTQ record to the file.""" 1758 assert self._header_written 1759 assert not self._footer_written 1760 self._record_written = True 1761 1762 # TODO - Is an empty sequence allowed in FASTQ format? 1763 if record.seq is None: 1764 raise ValueError("No sequence for record %s" % record.id) 1765 seq_str = str(record.seq) 1766 qualities_str = _get_illumina_quality_str(record) 1767 if len(qualities_str) != len(seq_str): 1768 raise ValueError("Record %s has sequence length %i but %i quality scores" 1769 % (record.id, len(seq_str), len(qualities_str))) 1770 1771 # FASTQ files can include a description, just like FASTA files 1772 # (at least, this is what the NCBI Short Read Archive does) 1773 id = self.clean(record.id) 1774 description = self.clean(record.description) 1775 if description and description.split(None, 1)[0] == id: 1776 # The description includes the id at the start 1777 title = description 1778 elif description: 1779 title = "%s %s" % (id, description) 1780 else: 1781 title = id 1782 1783 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1784 1785
1786 -def as_fastq_illumina(record):
1787 """Turn a SeqRecord into an Illumina FASTQ formated string. 1788 1789 This is used internally by the SeqRecord's .format("fastq-illumina") 1790 method and by the SeqIO.write(..., ..., "fastq-illumina") function. 1791 """ 1792 seq_str = _get_seq_string(record) 1793 qualities_str = _get_illumina_quality_str(record) 1794 if len(qualities_str) != len(seq_str): 1795 raise ValueError("Record %s has sequence length %i but %i quality scores" 1796 % (record.id, len(seq_str), len(qualities_str))) 1797 id = _clean(record.id) 1798 description = _clean(record.description) 1799 if description and description.split(None, 1)[0] == id: 1800 title = description 1801 elif description: 1802 title = "%s %s" % (id, description) 1803 else: 1804 title = id 1805 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1806 1807
1808 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1809 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1810 1811 For example, consider this short QUAL file with PHRED quality scores:: 1812 1813 >EAS54_6_R1_2_1_413_324 1814 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1815 26 26 26 23 23 1816 >EAS54_6_R1_2_1_540_792 1817 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1818 26 18 26 23 18 1819 >EAS54_6_R1_2_1_443_348 1820 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1821 24 18 18 18 18 1822 1823 And a matching FASTA file:: 1824 1825 >EAS54_6_R1_2_1_413_324 1826 CCCTTCTTGTCTTCAGCGTTTCTCC 1827 >EAS54_6_R1_2_1_540_792 1828 TTGGCAGGCCAAGGCCGATGGATCA 1829 >EAS54_6_R1_2_1_443_348 1830 GTTGCTTCTGGCGTGGGTGGGGGGG 1831 1832 You can parse these separately using Bio.SeqIO with the "qual" and 1833 "fasta" formats, but then you'll get a group of SeqRecord objects with 1834 no sequence, and a matching group with the sequence but not the 1835 qualities. Because it only deals with one input file handle, Bio.SeqIO 1836 can't be used to read the two files together - but this function can! 1837 For example, 1838 1839 >>> with open("Quality/example.fasta", "rU") as f: 1840 ... with open("Quality/example.qual", "rU") as q: 1841 ... for record in PairedFastaQualIterator(f, q): 1842 ... print("%s %s" % (record.id, record.seq)) 1843 ... 1844 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1845 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1846 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1847 1848 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1849 they are in each record's per-letter-annotation dictionary as a simple 1850 list of integers: 1851 1852 >>> print(record.letter_annotations["phred_quality"]) 1853 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1854 1855 If you have access to data as a FASTQ format file, using that directly 1856 would be simpler and more straight forward. Note that you can easily use 1857 this function to convert paired FASTA and QUAL files into FASTQ files: 1858 1859 >>> from Bio import SeqIO 1860 >>> with open("Quality/example.fasta", "rU") as f: 1861 ... with open("Quality/example.qual", "rU") as q: 1862 ... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") 1863 ... 1864 3 1865 1866 And don't forget to clean up the temp file if you don't need it anymore: 1867 1868 >>> import os 1869 >>> os.remove("Quality/temp.fastq") 1870 """ 1871 from Bio.SeqIO.FastaIO import FastaIterator 1872 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1873 title2ids=title2ids) 1874 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1875 title2ids=title2ids) 1876 1877 # Using (Python 3 style) zip wouldn't load everything into memory, 1878 # but also would not catch any extra records found in only one file. 1879 while True: 1880 try: 1881 f_rec = next(fasta_iter) 1882 except StopIteration: 1883 f_rec = None 1884 try: 1885 q_rec = next(qual_iter) 1886 except StopIteration: 1887 q_rec = None 1888 if f_rec is None and q_rec is None: 1889 # End of both files 1890 break 1891 if f_rec is None: 1892 raise ValueError("FASTA file has more entries than the QUAL file.") 1893 if q_rec is None: 1894 raise ValueError("QUAL file has more entries than the FASTA file.") 1895 if f_rec.id != q_rec.id: 1896 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1897 % (f_rec.id, q_rec.id)) 1898 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1899 raise ValueError("Sequence length and number of quality scores disagree for %s" 1900 % f_rec.id) 1901 # Merge the data.... 1902 f_rec.letter_annotations[ 1903 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1904 yield f_rec
1905 # Done 1906 1907 1908 if __name__ == "__main__": 1909 from Bio._utils import run_doctest 1910 run_doctest(verbose=0) 1911