1
2
3
4
5
6
7
8
9 """Bio.SeqIO support for the FASTQ and QUAL file formats.
10
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
12 shown below.
13
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
18
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a non-negative quality value to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
22
23 Pe = 1.0, Q = 0
24 Pe = 0.1, Q = 10
25 Pe = 0.01, Q = 20
26 ...
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
29
30 In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
31 In the QUAL format these quality values are held as space separated text in
32 a FASTA like file format. In the FASTQ format, each quality values is encoded
33 with a single ASCI character using chr(Q+33), meaning zero maps to the
34 character "!" and for example 80 maps to "q". For the Sanger FASTQ standard
35 the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
36 quality are then stored in pairs in a FASTA like format.
37
38 Unfortunately there is no official document describing the FASTQ file format,
39 and worse, several related but different variants exist. For more details,
40 please read this open access publication::
41
42 The Sanger FASTQ file format for sequences with quality scores, and the
43 Solexa/Illumina FASTQ variants.
44 P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
45 M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
46 Nucleic Acids Research 2010 38(6):1767-1771
47 http://dx.doi.org/10.1093/nar/gkp1137
48
49 The good news is that Roche 454 sequencers can output files in the QUAL format,
50 and sensibly they use PHREP style scores like Sanger. Converting a pair of
51 FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
52 files from a Roche 454 SFF binary file, use the Roche off instrument command
53 line tool "sffinfo" with the -q or -qual argument. You can extract a matching
54 FASTA file using the -s or -seq argument instead.
55
56 The bad news is that Solexa/Illumina did things differently - they have their
57 own scoring system AND their own incompatible versions of the FASTQ format.
58 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
59 be negative. PHRED scores and Solexa scores are NOT interchangeable (but a
60 reasonable mapping can be achieved between them, and they are approximately
61 equal for higher quality reads).
62
63 Confusingly early Solexa pipelines produced a FASTQ like file but using their
64 own score mapping and an ASCII offset of 64. To make things worse, for the
65 Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
66 FASTQ file format, this time using PHRED scores (which is more consistent) but
67 with an ASCII offset of 64.
68
69 i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
70 file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
71
72 The good news is that as of CASAVA version 1.8, Illumina sequencers will
73 produce FASTQ files using the standard Sanger encoding.
74
75 You are expected to use this module via the Bio.SeqIO functions, with the
76 following format names:
77
78 - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
79 - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
80 offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
81 These can potentially hold PHRED scores from 0 to 93.
82 - "fastq-sanger" is an alias for "fastq".
83 - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
84 files, using Solexa scores with an ASCII offset 64. These can hold Solexa
85 scores from -5 to 62.
86 - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
87 PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
88 to 62.
89
90 We could potentially add support for "qual-solexa" meaning QUAL files which
91 contain Solexa scores, but thus far there isn't any reason to use such files.
92
93 For example, consider the following short FASTQ file::
94
95 @EAS54_6_R1_2_1_413_324
96 CCCTTCTTGTCTTCAGCGTTTCTCC
97 +
98 ;;3;;;;;;;;;;;;7;;;;;;;88
99 @EAS54_6_R1_2_1_540_792
100 TTGGCAGGCCAAGGCCGATGGATCA
101 +
102 ;;;;;;;;;;;7;;;;;-;;;3;83
103 @EAS54_6_R1_2_1_443_348
104 GTTGCTTCTGGCGTGGGTGGGGGGG
105 +
106 ;;;;;;;;;;;9;7;;.7;393333
107
108 This contains three reads of length 25. From the read length these were
109 probably originally from an early Solexa/Illumina sequencer but this file
110 follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
111 offet of 33). This means we can parse this file using Bio.SeqIO using
112 "fastq" as the format name:
113
114 >>> from Bio import SeqIO
115 >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
116 ... print record.id, record.seq
117 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
118 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
119 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
120
121 The qualities are held as a list of integers in each record's annotation:
122
123 >>> print record
124 ID: EAS54_6_R1_2_1_443_348
125 Name: EAS54_6_R1_2_1_443_348
126 Description: EAS54_6_R1_2_1_443_348
127 Number of features: 0
128 Per letter annotation for: phred_quality
129 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
130 >>> print record.letter_annotations["phred_quality"]
131 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
132
133 You can use the SeqRecord format method to show this in the QUAL format:
134
135 >>> print record.format("qual")
136 >EAS54_6_R1_2_1_443_348
137 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
138 24 18 18 18 18
139 <BLANKLINE>
140
141 Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
142
143 >>> print record.format("fastq")
144 @EAS54_6_R1_2_1_443_348
145 GTTGCTTCTGGCGTGGGTGGGGGGG
146 +
147 ;;;;;;;;;;;9;7;;.7;393333
148 <BLANKLINE>
149
150 Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
151 of 64):
152
153 >>> print record.format("fastq-illumina")
154 @EAS54_6_R1_2_1_443_348
155 GTTGCTTCTGGCGTGGGTGGGGGGG
156 +
157 ZZZZZZZZZZZXZVZZMVZRXRRRR
158 <BLANKLINE>
159
160 You can also get Biopython to convert the scores and show a Solexa style
161 FASTQ file:
162
163 >>> print record.format("fastq-solexa")
164 @EAS54_6_R1_2_1_443_348
165 GTTGCTTCTGGCGTGGGTGGGGGGG
166 +
167 ZZZZZZZZZZZXZVZZMVZRXRRRR
168 <BLANKLINE>
169
170 Notice that this is actually the same output as above using "fastq-illumina"
171 as the format! The reason for this is all these scores are high enough that
172 the PHRED and Solexa scores are almost equal. The differences become apparent
173 for poor quality reads. See the functions solexa_quality_from_phred and
174 phred_quality_from_solexa for more details.
175
176 If you wanted to trim your sequences (perhaps to remove low quality regions,
177 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
178
179 >>> sub_rec = record[5:15]
180 >>> print sub_rec
181 ID: EAS54_6_R1_2_1_443_348
182 Name: EAS54_6_R1_2_1_443_348
183 Description: EAS54_6_R1_2_1_443_348
184 Number of features: 0
185 Per letter annotation for: phred_quality
186 Seq('TTCTGGCGTG', SingleLetterAlphabet())
187 >>> print sub_rec.letter_annotations["phred_quality"]
188 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
189 >>> print sub_rec.format("fastq")
190 @EAS54_6_R1_2_1_443_348
191 TTCTGGCGTG
192 +
193 ;;;;;;9;7;
194 <BLANKLINE>
195
196 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
197
198 >>> from Bio import SeqIO
199 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
200 >>> out_handle = open("Quality/temp.qual", "w")
201 >>> SeqIO.write(record_iterator, out_handle, "qual")
202 3
203 >>> out_handle.close()
204
205 You can of course read in a QUAL file, such as the one we just created:
206
207 >>> from Bio import SeqIO
208 >>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
209 ... print record.id, record.seq
210 EAS54_6_R1_2_1_413_324 ?????????????????????????
211 EAS54_6_R1_2_1_540_792 ?????????????????????????
212 EAS54_6_R1_2_1_443_348 ?????????????????????????
213
214 Notice that QUAL files don't have a proper sequence present! But the quality
215 information is there:
216
217 >>> print record
218 ID: EAS54_6_R1_2_1_443_348
219 Name: EAS54_6_R1_2_1_443_348
220 Description: EAS54_6_R1_2_1_443_348
221 Number of features: 0
222 Per letter annotation for: phred_quality
223 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
224 >>> print record.letter_annotations["phred_quality"]
225 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
226
227 Just to keep things tidy, if you are following this example yourself, you can
228 delete this temporary file now:
229
230 >>> import os
231 >>> os.remove("Quality/temp.qual")
232
233 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
234 files. Because the Bio.SeqIO system is designed for reading single files, you
235 would have to read the two in separately and then combine the data. However,
236 since this is such a common thing to want to do, there is a helper iterator
237 defined in this module that does this for you - PairedFastaQualIterator.
238
239 Alternatively, if you have enough RAM to hold all the records in memory at once,
240 then a simple dictionary approach would work:
241
242 >>> from Bio import SeqIO
243 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
244 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"):
245 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
246
247 You can then access any record by its key, and get both the sequence and the
248 quality scores.
249
250 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
251 @EAS54_6_R1_2_1_540_792
252 TTGGCAGGCCAAGGCCGATGGATCA
253 +
254 ;;;;;;;;;;;7;;;;;-;;;3;83
255 <BLANKLINE>
256
257 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
258 using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
259 "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
260 for the more recent variant), as this cannot be detected reliably
261 automatically.
262
263 To illustrate this problem, let's consider an artifical example:
264
265 >>> from Bio.Seq import Seq
266 >>> from Bio.Alphabet import generic_dna
267 >>> from Bio.SeqRecord import SeqRecord
268 >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
269 ... description="Made up!")
270 >>> print test.format("fasta")
271 >Test Made up!
272 NACGTACGTA
273 <BLANKLINE>
274 >>> print test.format("fastq")
275 Traceback (most recent call last):
276 ...
277 ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
278
279 We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
280 or FASTQ format we need to provide some quality scores. These are held as a
281 list of integers (one for each base) in the letter_annotations dictionary:
282
283 >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40]
284 >>> print test.format("qual")
285 >Test Made up!
286 0 1 2 3 4 5 10 20 30 40
287 <BLANKLINE>
288 >>> print test.format("fastq")
289 @Test Made up!
290 NACGTACGTA
291 +
292 !"#$%&+5?I
293 <BLANKLINE>
294
295 We can check this FASTQ encoding - the first PHRED quality was zero, and this
296 mapped to a exclamation mark, while the final score was 40 and this mapped to
297 the letter "I":
298
299 >>> ord('!') - 33
300 0
301 >>> ord('I') - 33
302 40
303 >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
304 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
305
306 Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
307 scores with an offset of 64:
308
309 >>> print test.format("fastq-illumina")
310 @Test Made up!
311 NACGTACGTA
312 +
313 @ABCDEJT^h
314 <BLANKLINE>
315
316 And we can check this too - the first PHRED score was zero, and this mapped to
317 "@", while the final score was 40 and this mapped to "h":
318
319 >>> ord("@") - 64
320 0
321 >>> ord("h") - 64
322 40
323 >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
324 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
325
326 Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
327 FASTQ files look for the same data! Then we have the older Solexa/Illumina
328 format to consider which encodes Solexa scores instead of PHRED scores.
329
330 First let's see what Biopython says if we convert the PHRED scores into Solexa
331 scores (rounding to one decimal place):
332
333 >>> for q in [0,1,2,3,4,5,10,20,30,40]:
334 ... print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))
335 PHRED 0 maps to Solexa -5.0
336 PHRED 1 maps to Solexa -5.0
337 PHRED 2 maps to Solexa -2.3
338 PHRED 3 maps to Solexa -0.0
339 PHRED 4 maps to Solexa 1.8
340 PHRED 5 maps to Solexa 3.3
341 PHRED 10 maps to Solexa 9.5
342 PHRED 20 maps to Solexa 20.0
343 PHRED 30 maps to Solexa 30.0
344 PHRED 40 maps to Solexa 40.0
345
346 Now here is the record using the old Solexa style FASTQ file:
347
348 >>> print test.format("fastq-solexa")
349 @Test Made up!
350 NACGTACGTA
351 +
352 ;;>@BCJT^h
353 <BLANKLINE>
354
355 Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
356
357 >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
358 [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
359
360 This explains why the last few letters of this FASTQ output matched that using
361 the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
362 are approximately equal.
363
364 """
365 __docformat__ = "epytext en"
366
367 from Bio.Alphabet import single_letter_alphabet
368 from Bio.Seq import Seq, UnknownSeq
369 from Bio.SeqRecord import SeqRecord
370 from Bio.SeqIO.Interfaces import SequentialSequenceWriter
371 from math import log
372 import warnings
373 from Bio import BiopythonWarning, BiopythonParserWarning
374
375
376
377
378 SANGER_SCORE_OFFSET = 33
379 SOLEXA_SCORE_OFFSET = 64
380
381
383 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
384
385 PHRED and Solexa quality scores are both log transformations of a
386 probality of error (high score = low probability of error). This function
387 takes a PHRED score, transforms it back to a probability of error, and
388 then re-expresses it as a Solexa score. This assumes the error estimates
389 are equivalent.
390
391 How does this work exactly? Well the PHRED quality is minus ten times the
392 base ten logarithm of the probability of error::
393
394 phred_quality = -10*log(error,10)
395
396 Therefore, turning this round::
397
398 error = 10 ** (- phred_quality / 10)
399
400 Now, Solexa qualities use a different log transformation::
401
402 solexa_quality = -10*log(error/(1-error),10)
403
404 After substitution and a little manipulation we get::
405
406 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
407
408 However, real Solexa files use a minimum quality of -5. This does have a
409 good reason - a random base call would be correct 25% of the time,
410 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
411 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
412 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
413
414 Taken literally, this logarithic formula would map a PHRED quality of zero
415 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
416 score of zero means a probability of error of one (i.e. the base call is
417 definitely wrong), which is worse than random! In practice, a PHRED quality
418 of zero usually means a default value, or perhaps random - and therefore
419 mapping it to the minimum Solexa score of -5 is reasonable.
420
421 In conclusion, we follow EMBOSS, and take this logarithmic formula but also
422 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
423 quality of zero to -5.0 as well.
424
425 Note this function will return a floating point number, it is up to you to
426 round this to the nearest integer if appropriate. e.g.
427
428 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
429 80.00
430 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
431 50.00
432 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
433 19.96
434 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
435 9.54
436 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
437 3.35
438 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
439 1.80
440 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
441 -0.02
442 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
443 -2.33
444 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
445 -5.00
446 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
447 -5.00
448
449 Notice that for high quality reads PHRED and Solexa scores are numerically
450 equal. The differences are important for poor quality reads, where PHRED
451 has a minimum of zero but Solexa scores can be negative.
452
453 Finally, as a special case where None is used for a "missing value", None
454 is returned:
455
456 >>> print solexa_quality_from_phred(None)
457 None
458 """
459 if phred_quality is None:
460
461
462 return None
463 elif phred_quality > 0:
464
465
466 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10))
467 elif phred_quality == 0:
468
469 return -5.0
470 else:
471 raise ValueError("PHRED qualities must be positive (or zero), not %s"
472 % repr(phred_quality))
473
474
476 """Convert a Solexa quality (which can be negative) to a PHRED quality.
477
478 PHRED and Solexa quality scores are both log transformations of a
479 probality of error (high score = low probability of error). This function
480 takes a Solexa score, transforms it back to a probability of error, and
481 then re-expresses it as a PHRED score. This assumes the error estimates
482 are equivalent.
483
484 The underlying formulas are given in the documentation for the sister
485 function solexa_quality_from_phred, in this case the operation is::
486
487 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
488
489 This will return a floating point number, it is up to you to round this to
490 the nearest integer if appropriate. e.g.
491
492 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
493 80.00
494 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
495 20.04
496 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
497 10.41
498 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
499 3.01
500 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
501 1.19
502
503 Note that a solexa_quality less then -5 is not expected, will trigger a
504 warning, but will still be converted as per the logarithmic mapping
505 (giving a number between 0 and 1.19 back).
506
507 As a special case where None is used for a "missing value", None is
508 returned:
509
510 >>> print phred_quality_from_solexa(None)
511 None
512 """
513 if solexa_quality is None:
514
515 return None
516 if solexa_quality < -5:
517 warnings.warn("Solexa quality less than -5 passed, %s"
518 % repr(solexa_quality), BiopythonWarning)
519 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
520
521
523 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
524
525 If there are no PHRED qualities, but there are Solexa qualities, those are
526 used instead after conversion.
527 """
528 try:
529 return record.letter_annotations["phred_quality"]
530 except KeyError:
531 pass
532 try:
533 return [phred_quality_from_solexa(q) for
534 q in record.letter_annotations["solexa_quality"]]
535 except KeyError:
536 raise ValueError("No suitable quality scores found in "
537 "letter_annotations of SeqRecord (id=%s)."
538 % record.id)
539
540
541 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET)))
542 for qp in range(0, 93 + 1))
543
544 _solexa_to_sanger_quality_str = dict(
545 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) +
546 SANGER_SCORE_OFFSET)))
547 for qs in range(-5, 93 + 1))
548
549
551 """Returns a Sanger FASTQ encoded quality string (PRIVATE).
552
553 >>> from Bio.Seq import Seq
554 >>> from Bio.SeqRecord import SeqRecord
555 >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
556 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
557 >>> _get_sanger_quality_str(r)
558 'SI?5+!'
559
560 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
561 the PHRED qualities are integers, this function is able to use a very fast
562 pre-cached mapping. However, if they are floats which differ slightly, then
563 it has to do the appropriate rounding - which is slower:
564
565 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
566 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
567 >>> _get_sanger_quality_str(r2)
568 'SI?5+!'
569
570 If your scores include a None value, this raises an exception:
571
572 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
573 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
574 >>> _get_sanger_quality_str(r3)
575 Traceback (most recent call last):
576 ...
577 TypeError: A quality value of None was found
578
579 If (strangely) your record has both PHRED and Solexa scores, then the PHRED
580 scores are used in preference:
581
582 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
583 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0],
584 ... "solexa_quality":[-5,-4,0,None,0,40]})
585 >>> _get_sanger_quality_str(r4)
586 'SI?5+!'
587
588 If there are no PHRED scores, but there are Solexa scores, these are used
589 instead (after the approriate conversion):
590
591 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
592 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
593 >>> _get_sanger_quality_str(r5)
594 'I?5+$"'
595
596 Again, integer Solexa scores can be looked up in a pre-cached mapping making
597 this very fast. You can still use approximate floating point scores:
598
599 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
600 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
601 >>> _get_sanger_quality_str(r6)
602 'I?5+$"'
603
604 Notice that due to the limited range of printable ASCII characters, a
605 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
606 file (using ASCII 126, the tilde). This function will issue a warning
607 in this situation.
608 """
609
610
611
612 try:
613
614 qualities = record.letter_annotations["phred_quality"]
615 except KeyError:
616
617 pass
618 else:
619
620 try:
621 return "".join([_phred_to_sanger_quality_str[qp]
622 for qp in qualities])
623 except KeyError:
624
625 pass
626 if None in qualities:
627 raise TypeError("A quality value of None was found")
628 if max(qualities) >= 93.5:
629 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
630 BiopythonWarning)
631
632 return "".join([chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET))
633 for qp in qualities])
634
635 try:
636 qualities = record.letter_annotations["solexa_quality"]
637 except KeyError:
638 raise ValueError("No suitable quality scores found in "
639 "letter_annotations of SeqRecord (id=%s)."
640 % record.id)
641
642 try:
643 return "".join([_solexa_to_sanger_quality_str[qs]
644 for qs in qualities])
645 except KeyError:
646
647 pass
648 if None in qualities:
649 raise TypeError("A quality value of None was found")
650
651
652 if max(qualities) >= 93.5:
653 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
654 BiopythonWarning)
655
656 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET))
657 for qs in qualities])
658
659
660 assert 62 + SOLEXA_SCORE_OFFSET == 126
661 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET))
662 for qp in range(0, 62 + 1))
663
664 _solexa_to_illumina_quality_str = dict(
665 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET))
666 for qs in range(-5, 62 + 1))
667
668
670 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).
671
672 Notice that due to the limited range of printable ASCII characters, a
673 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
674 file (using ASCII 126, the tilde). This function will issue a warning
675 in this situation.
676 """
677
678
679
680 try:
681
682 qualities = record.letter_annotations["phred_quality"]
683 except KeyError:
684
685 pass
686 else:
687
688 try:
689 return "".join([_phred_to_illumina_quality_str[qp]
690 for qp in qualities])
691 except KeyError:
692
693 pass
694 if None in qualities:
695 raise TypeError("A quality value of None was found")
696 if max(qualities) >= 62.5:
697 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
698 BiopythonWarning)
699
700 return "".join([chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET))
701 for qp in qualities])
702
703 try:
704 qualities = record.letter_annotations["solexa_quality"]
705 except KeyError:
706 raise ValueError("No suitable quality scores found in "
707 "letter_annotations of SeqRecord (id=%s)."
708 % record.id)
709
710 try:
711 return "".join([_solexa_to_illumina_quality_str[qs]
712 for qs in qualities])
713 except KeyError:
714
715 pass
716 if None in qualities:
717 raise TypeError("A quality value of None was found")
718
719
720 if max(qualities) >= 62.5:
721 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
722 BiopythonWarning)
723
724 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET))
725 for qs in qualities])
726
727
728 assert 62 + SOLEXA_SCORE_OFFSET == 126
729 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET)))
730 for qs in range(-5, 62 + 1))
731
732 _phred_to_solexa_quality_str = dict(
733 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) +
734 SOLEXA_SCORE_OFFSET)))
735 for qp in range(0, 62 + 1))
736
737
739 """Returns a Solexa FASTQ encoded quality string (PRIVATE).
740
741 Notice that due to the limited range of printable ASCII characters, a
742 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
743 file (using ASCII 126, the tilde). This function will issue a warning
744 in this situation.
745 """
746
747
748
749 try:
750
751 qualities = record.letter_annotations["solexa_quality"]
752 except KeyError:
753
754 pass
755 else:
756
757 try:
758 return "".join([_solexa_to_solexa_quality_str[qs]
759 for qs in qualities])
760 except KeyError:
761
762 pass
763 if None in qualities:
764 raise TypeError("A quality value of None was found")
765 if max(qualities) >= 62.5:
766 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
767 BiopythonWarning)
768
769 return "".join([chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET))
770 for qs in qualities])
771
772 try:
773 qualities = record.letter_annotations["phred_quality"]
774 except KeyError:
775 raise ValueError("No suitable quality scores found in "
776 "letter_annotations of SeqRecord (id=%s)."
777 % record.id)
778
779 try:
780 return "".join([_phred_to_solexa_quality_str[qp]
781 for qp in qualities])
782 except KeyError:
783
784
785 pass
786 if None in qualities:
787 raise TypeError("A quality value of None was found")
788
789
790 if max(qualities) >= 62.5:
791 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
792 BiopythonWarning)
793 return "".join([chr(min(126,
794 int(round(solexa_quality_from_phred(qp))) +
795 SOLEXA_SCORE_OFFSET))
796 for qp in qualities])
797
798
799
801 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
802
803 This code does not try to interpret the quality string numerically. It
804 just returns tuples of the title, sequence and quality as strings. For
805 the sequence and quality, any whitespace (such as new lines) is removed.
806
807 Our SeqRecord based FASTQ iterators call this function internally, and then
808 turn the strings into a SeqRecord objects, mapping the quality string into
809 a list of numerical scores. If you want to do a custom quality mapping,
810 then you might consider calling this function directly.
811
812 For parsing FASTQ files, the title string from the "@" line at the start
813 of each record can optionally be omitted on the "+" lines. If it is
814 repeated, it must be identical.
815
816 The sequence string and the quality string can optionally be split over
817 multiple lines, although several sources discourage this. In comparison,
818 for the FASTA file format line breaks between 60 and 80 characters are
819 the norm.
820
821 WARNING - Because the "@" character can appear in the quality string,
822 this can cause problems as this is also the marker for the start of
823 a new sequence. In fact, the "+" sign can also appear as well. Some
824 sources recommended having no line breaks in the quality to avoid this,
825 but even that is not enough, consider this example::
826
827 @071113_EAS56_0053:1:1:998:236
828 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
829 +071113_EAS56_0053:1:1:998:236
830 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
831 @071113_EAS56_0053:1:1:182:712
832 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
833 +
834 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
835 @071113_EAS56_0053:1:1:153:10
836 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
837 +
838 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
839 @071113_EAS56_0053:1:3:990:501
840 TGGGAGGTTTTATGTGGA
841 AAGCAGCAATGTACAAGA
842 +
843 IIIIIII.IIIIII1@44
844 @-7.%<&+/$/%4(++(%
845
846 This is four PHRED encoded FASTQ entries originally from an NCBI source
847 (given the read length of 36, these are probably Solexa Illumna reads where
848 the quality has been mapped onto the PHRED values).
849
850 This example has been edited to illustrate some of the nasty things allowed
851 in the FASTQ format. Firstly, on the "+" lines most but not all of the
852 (redundant) identifiers are omitted. In real files it is likely that all or
853 none of these extra identifiers will be present.
854
855 Secondly, while the first three sequences have been shown without line
856 breaks, the last has been split over multiple lines. In real files any line
857 breaks are likely to be consistent.
858
859 Thirdly, some of the quality string lines start with an "@" character. For
860 the second record this is unavoidable. However for the fourth sequence this
861 only happens because its quality string is split over two lines. A naive
862 parser could wrongly treat any line starting with an "@" as the beginning of
863 a new sequence! This code copes with this possible ambiguity by keeping
864 track of the length of the sequence which gives the expected length of the
865 quality string.
866
867 Using this tricky example file as input, this short bit of code demonstrates
868 what this parsing function would return:
869
870 >>> handle = open("Quality/tricky.fastq", "rU")
871 >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
872 ... print title
873 ... print sequence, quality
874 071113_EAS56_0053:1:1:998:236
875 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
876 071113_EAS56_0053:1:1:182:712
877 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
878 071113_EAS56_0053:1:1:153:10
879 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
880 071113_EAS56_0053:1:3:990:501
881 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
882 >>> handle.close()
883
884 Finally we note that some sources state that the quality string should
885 start with "!" (which using the PHRED mapping means the first letter always
886 has a quality score of zero). This rather restrictive rule is not widely
887 observed, so is therefore ignored here. One plus point about this "!" rule
888 is that (provided there are no line breaks in the quality sequence) it
889 would prevent the above problem with the "@" character.
890 """
891
892
893 handle_readline = handle.readline
894
895
896 while True:
897 line = handle_readline()
898 if not line:
899 return
900 if line[0] == "@":
901 break
902 if isinstance(line[0], int):
903 raise ValueError("Is this handle in binary mode not text mode?")
904
905 while line:
906 if line[0] != "@":
907 raise ValueError(
908 "Records in Fastq files should start with '@' character")
909 title_line = line[1:].rstrip()
910
911
912
913 seq_string = handle_readline().rstrip()
914
915 while True:
916 line = handle_readline()
917 if not line:
918 raise ValueError("End of file without quality information.")
919 if line[0] == "+":
920
921 second_title = line[1:].rstrip()
922 if second_title and second_title != title_line:
923 raise ValueError("Sequence and quality captions differ.")
924 break
925 seq_string += line.rstrip()
926
927
928 if " " in seq_string or "\t" in seq_string:
929 raise ValueError("Whitespace is not allowed in the sequence.")
930 seq_len = len(seq_string)
931
932
933 quality_string = handle_readline().rstrip()
934
935 while True:
936 line = handle_readline()
937 if not line:
938 break
939 if line[0] == "@":
940
941
942
943
944 if len(quality_string) >= seq_len:
945
946
947 break
948
949 quality_string += line.rstrip()
950
951 if seq_len != len(quality_string):
952 raise ValueError("Lengths of sequence and quality values differs "
953 " for %s (%i and %i)."
954 % (title_line, seq_len, len(quality_string)))
955
956
957 yield (title_line, seq_string, quality_string)
958 raise StopIteration
959
960
962 """Generator function to iterate over FASTQ records (as SeqRecord objects).
963
964 - handle - input file
965 - alphabet - optional alphabet
966 - title2ids - A function that, when given the title line from the FASTQ
967 file (without the beginning >), will return the id, name and
968 description (in that order) for the record as a tuple of
969 strings. If this is not given, then the entire title line
970 will be used as the description, and the first word as the
971 id and name.
972
973 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
974
975 For each sequence in a (Sanger style) FASTQ file there is a matching string
976 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
977 values with an offset of 33.
978
979 For example, consider a file containing three short reads::
980
981 @EAS54_6_R1_2_1_413_324
982 CCCTTCTTGTCTTCAGCGTTTCTCC
983 +
984 ;;3;;;;;;;;;;;;7;;;;;;;88
985 @EAS54_6_R1_2_1_540_792
986 TTGGCAGGCCAAGGCCGATGGATCA
987 +
988 ;;;;;;;;;;;7;;;;;-;;;3;83
989 @EAS54_6_R1_2_1_443_348
990 GTTGCTTCTGGCGTGGGTGGGGGGG
991 +
992 ;;;;;;;;;;;9;7;;.7;393333
993
994 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
995 string encoding the PHRED qualities using a ASCI values with an offset of
996 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
997
998 Using this module directly you might run:
999
1000 >>> handle = open("Quality/example.fastq", "rU")
1001 >>> for record in FastqPhredIterator(handle):
1002 ... print record.id, record.seq
1003 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1004 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1005 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1006 >>> handle.close()
1007
1008 Typically however, you would call this via Bio.SeqIO instead with "fastq"
1009 (or "fastq-sanger") as the format:
1010
1011 >>> from Bio import SeqIO
1012 >>> handle = open("Quality/example.fastq", "rU")
1013 >>> for record in SeqIO.parse(handle, "fastq"):
1014 ... print record.id, record.seq
1015 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1016 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1017 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1018 >>> handle.close()
1019
1020 If you want to look at the qualities, they are record in each record's
1021 per-letter-annotation dictionary as a simple list of integers:
1022
1023 >>> print record.letter_annotations["phred_quality"]
1024 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1025
1026 """
1027 assert SANGER_SCORE_OFFSET == ord("!")
1028
1029
1030
1031
1032
1033 q_mapping = dict()
1034 for letter in range(0, 255):
1035 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET
1036 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1037 if title2ids:
1038 id, name, descr = title2ids(title_line)
1039 else:
1040 descr = title_line
1041 id = descr.split()[0]
1042 name = id
1043 record = SeqRecord(Seq(seq_string, alphabet),
1044 id=id, name=name, description=descr)
1045 qualities = [q_mapping[letter] for letter in quality_string]
1046 if qualities and (min(qualities) < 0 or max(qualities) > 93):
1047 raise ValueError("Invalid character in quality string")
1048
1049
1050
1051
1052
1053 dict.__setitem__(record._per_letter_annotations,
1054 "phred_quality", qualities)
1055 yield record
1056
1057
1059 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
1060
1061 The optional arguments are the same as those for the FastqPhredIterator.
1062
1063 For each sequence in Solexa/Illumina FASTQ files there is a matching string
1064 encoding the Solexa integer qualities using ASCII values with an offset
1065 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
1066 will NOT perform any automatic conversion when loading.
1067
1068 NOTE - This file format is used by the OLD versions of the Solexa/Illumina
1069 pipeline. See also the FastqIlluminaIterator function for the NEW version.
1070
1071 For example, consider a file containing these five records::
1072
1073 @SLXA-B3_649_FC8437_R1_1_1_610_79
1074 GATGTGCAATACCTTTGTAGAGGAA
1075 +SLXA-B3_649_FC8437_R1_1_1_610_79
1076 YYYYYYYYYYYYYYYYYYWYWYYSU
1077 @SLXA-B3_649_FC8437_R1_1_1_397_389
1078 GGTTTGAGAAAGAGAAATGAGATAA
1079 +SLXA-B3_649_FC8437_R1_1_1_397_389
1080 YYYYYYYYYWYYYYWWYYYWYWYWW
1081 @SLXA-B3_649_FC8437_R1_1_1_850_123
1082 GAGGGTGTTGATCATGATGATGGCG
1083 +SLXA-B3_649_FC8437_R1_1_1_850_123
1084 YYYYYYYYYYYYYWYYWYYSYYYSY
1085 @SLXA-B3_649_FC8437_R1_1_1_362_549
1086 GGAAACAAAGTTTTTCTCAACATAG
1087 +SLXA-B3_649_FC8437_R1_1_1_362_549
1088 YYYYYYYYYYYYYYYYYYWWWWYWY
1089 @SLXA-B3_649_FC8437_R1_1_1_183_714
1090 GTATTATTTAATGGCATACACTCAA
1091 +SLXA-B3_649_FC8437_R1_1_1_183_714
1092 YYYYYYYYYYWYYYYWYWWUWWWQQ
1093
1094 Using this module directly you might run:
1095
1096 >>> handle = open("Quality/solexa_example.fastq", "rU")
1097 >>> for record in FastqSolexaIterator(handle):
1098 ... print record.id, record.seq
1099 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1100 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1101 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1102 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1103 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1104 >>> handle.close()
1105
1106 Typically however, you would call this via Bio.SeqIO instead with
1107 "fastq-solexa" as the format:
1108
1109 >>> from Bio import SeqIO
1110 >>> handle = open("Quality/solexa_example.fastq", "rU")
1111 >>> for record in SeqIO.parse(handle, "fastq-solexa"):
1112 ... print record.id, record.seq
1113 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1114 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1115 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1116 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1117 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1118 >>> handle.close()
1119
1120 If you want to look at the qualities, they are recorded in each record's
1121 per-letter-annotation dictionary as a simple list of integers:
1122
1123 >>> print record.letter_annotations["solexa_quality"]
1124 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
1125
1126 These scores aren't very good, but they are high enough that they map
1127 almost exactly onto PHRED scores:
1128
1129 >>> print "%0.2f" % phred_quality_from_solexa(25)
1130 25.01
1131
1132 Let's look at faked example read which is even worse, where there are
1133 more noticeable differences between the Solexa and PHRED scores::
1134
1135 @slxa_0001_1_0001_01
1136 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1137 +slxa_0001_1_0001_01
1138 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1139
1140 Again, you would typically use Bio.SeqIO to read this file in (rather than
1141 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
1142 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
1143 as shown above. This example has only as one entry, so instead we can
1144 use the Bio.SeqIO.read() function:
1145
1146 >>> from Bio import SeqIO
1147 >>> handle = open("Quality/solexa_faked.fastq", "rU")
1148 >>> record = SeqIO.read(handle, "fastq-solexa")
1149 >>> handle.close()
1150 >>> print record.id, record.seq
1151 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1152 >>> print record.letter_annotations["solexa_quality"]
1153 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1154
1155 These quality scores are so low that when converted from the Solexa scheme
1156 into PHRED scores they look quite different:
1157
1158 >>> print "%0.2f" % phred_quality_from_solexa(-1)
1159 2.54
1160 >>> print "%0.2f" % phred_quality_from_solexa(-5)
1161 1.19
1162
1163 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
1164 method to output the record(s):
1165
1166 >>> print record.format("fastq-solexa")
1167 @slxa_0001_1_0001_01
1168 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1169 +
1170 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1171 <BLANKLINE>
1172
1173 Note this output is slightly different from the input file as Biopython
1174 has left out the optional repetition of the sequence identifier on the "+"
1175 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
1176 output format instead, and Biopython will do the conversion for you:
1177
1178 >>> print record.format("fastq")
1179 @slxa_0001_1_0001_01
1180 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1181 +
1182 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
1183 <BLANKLINE>
1184
1185 >>> print record.format("qual")
1186 >slxa_0001_1_0001_01
1187 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
1188 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1189 1 1
1190 <BLANKLINE>
1191
1192 As shown above, the poor quality Solexa reads have been mapped to the
1193 equivalent PHRED score (e.g. -5 to 1 as shown earlier).
1194 """
1195 q_mapping = dict()
1196 for letter in range(0, 255):
1197 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET
1198 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1199 if title2ids:
1200 id, name, descr = title_line
1201 else:
1202 descr = title_line
1203 id = descr.split()[0]
1204 name = id
1205 record = SeqRecord(Seq(seq_string, alphabet),
1206 id=id, name=name, description=descr)
1207 qualities = [q_mapping[letter] for letter in quality_string]
1208
1209 if qualities and (min(qualities) < -5 or max(qualities) > 62):
1210 raise ValueError("Invalid character in quality string")
1211
1212
1213 dict.__setitem__(record._per_letter_annotations,
1214 "solexa_quality", qualities)
1215 yield record
1216
1217
1219 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
1220
1221 The optional arguments are the same as those for the FastqPhredIterator.
1222
1223 For each sequence in Illumina 1.3+ FASTQ files there is a matching string
1224 encoding PHRED integer qualities using ASCII values with an offset of 64.
1225
1226 >>> from Bio import SeqIO
1227 >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina")
1228 >>> print record.id, record.seq
1229 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1230 >>> max(record.letter_annotations["phred_quality"])
1231 40
1232 >>> min(record.letter_annotations["phred_quality"])
1233 0
1234
1235 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
1236 with an ASCII offset of 64. They are approximately equal but only for high
1237 quality reads. If you have an old Solexa/Illumina file with negative
1238 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
1239
1240 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
1241 Traceback (most recent call last):
1242 ...
1243 ValueError: Invalid character in quality string
1244
1245 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
1246 """
1247 q_mapping = dict()
1248 for letter in range(0, 255):
1249 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET
1250 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1251 if title2ids:
1252 id, name, descr = title2ids(title_line)
1253 else:
1254 descr = title_line
1255 id = descr.split()[0]
1256 name = id
1257 record = SeqRecord(Seq(seq_string, alphabet),
1258 id=id, name=name, description=descr)
1259 qualities = [q_mapping[letter] for letter in quality_string]
1260 if qualities and (min(qualities) < 0 or max(qualities) > 62):
1261 raise ValueError("Invalid character in quality string")
1262
1263
1264 dict.__setitem__(record._per_letter_annotations,
1265 "phred_quality", qualities)
1266 yield record
1267
1268
1270 """For QUAL files which include PHRED quality scores, but no sequence.
1271
1272 For example, consider this short QUAL file::
1273
1274 >EAS54_6_R1_2_1_413_324
1275 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1276 26 26 26 23 23
1277 >EAS54_6_R1_2_1_540_792
1278 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1279 26 18 26 23 18
1280 >EAS54_6_R1_2_1_443_348
1281 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1282 24 18 18 18 18
1283
1284 Using this module directly you might run:
1285
1286 >>> handle = open("Quality/example.qual", "rU")
1287 >>> for record in QualPhredIterator(handle):
1288 ... print record.id, record.seq
1289 EAS54_6_R1_2_1_413_324 ?????????????????????????
1290 EAS54_6_R1_2_1_540_792 ?????????????????????????
1291 EAS54_6_R1_2_1_443_348 ?????????????????????????
1292 >>> handle.close()
1293
1294 Typically however, you would call this via Bio.SeqIO instead with "qual"
1295 as the format:
1296
1297 >>> from Bio import SeqIO
1298 >>> handle = open("Quality/example.qual", "rU")
1299 >>> for record in SeqIO.parse(handle, "qual"):
1300 ... print record.id, record.seq
1301 EAS54_6_R1_2_1_413_324 ?????????????????????????
1302 EAS54_6_R1_2_1_540_792 ?????????????????????????
1303 EAS54_6_R1_2_1_443_348 ?????????????????????????
1304 >>> handle.close()
1305
1306 Becase QUAL files don't contain the sequence string itself, the seq
1307 property is set to an UnknownSeq object. As no alphabet was given, this
1308 has defaulted to a generic single letter alphabet and the character "?"
1309 used.
1310
1311 By specifying a nucleotide alphabet, "N" is used instead:
1312
1313 >>> from Bio import SeqIO
1314 >>> from Bio.Alphabet import generic_dna
1315 >>> handle = open("Quality/example.qual", "rU")
1316 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
1317 ... print record.id, record.seq
1318 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
1319 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
1320 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
1321 >>> handle.close()
1322
1323 However, the quality scores themselves are available as a list of integers
1324 in each record's per-letter-annotation:
1325
1326 >>> print record.letter_annotations["phred_quality"]
1327 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1328
1329 You can still slice one of these SeqRecord objects with an UnknownSeq:
1330
1331 >>> sub_record = record[5:10]
1332 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
1333 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
1334
1335 As of Biopython 1.59, this parser will accept files with negatives quality
1336 scores but will replace them with the lowest possible PHRED score of zero.
1337 This will trigger a warning, previously it raised a ValueError exception.
1338 """
1339
1340 while True:
1341 line = handle.readline()
1342 if line == "":
1343 return
1344 if line[0] == ">":
1345 break
1346
1347 while True:
1348 if line[0] != ">":
1349 raise ValueError(
1350 "Records in Fasta files should start with '>' character")
1351 if title2ids:
1352 id, name, descr = title2ids(line[1:].rstrip())
1353 else:
1354 descr = line[1:].rstrip()
1355 id = descr.split()[0]
1356 name = id
1357
1358 qualities = []
1359 line = handle.readline()
1360 while True:
1361 if not line:
1362 break
1363 if line[0] == ">":
1364 break
1365 qualities.extend([int(word) for word in line.split()])
1366 line = handle.readline()
1367
1368 if qualities and min(qualities) < 0:
1369 warnings.warn(("Negative quality score %i found, " +
1370 "substituting PHRED zero instead.")
1371 % min(qualities), BiopythonParserWarning)
1372 qualities = [max(0, q) for q in qualities]
1373
1374
1375 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
1376 id=id, name=name, description=descr)
1377
1378
1379 dict.__setitem__(record._per_letter_annotations,
1380 "phred_quality", qualities)
1381 yield record
1382
1383 if not line:
1384 return
1385 assert False, "Should not reach this line"
1386
1387
1389 """Class to write standard FASTQ format files (using PHRED quality scores).
1390
1391 Although you can use this class directly, you are strongly encouraged
1392 to use the Bio.SeqIO.write() function instead via the format name "fastq"
1393 or the alias "fastq-sanger". For example, this code reads in a standard
1394 Sanger style FASTQ file (using PHRED scores) and re-saves it as another
1395 Sanger style FASTQ file:
1396
1397 >>> from Bio import SeqIO
1398 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1399 >>> out_handle = open("Quality/temp.fastq", "w")
1400 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1401 3
1402 >>> out_handle.close()
1403
1404 You might want to do this if the original file included extra line breaks,
1405 which while valid may not be supported by all tools. The output file from
1406 Biopython will have each sequence on a single line, and each quality
1407 string on a single line (which is considered desirable for maximum
1408 compatibility).
1409
1410 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
1411 quality scores) is converted into a standard Sanger style FASTQ file using
1412 PHRED qualities:
1413
1414 >>> from Bio import SeqIO
1415 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1416 >>> out_handle = open("Quality/temp.fastq", "w")
1417 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1418 5
1419 >>> out_handle.close()
1420
1421 This code is also called if you use the .format("fastq") method of a
1422 SeqRecord, or .format("fastq-sanger") if you prefer that alias.
1423
1424 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
1425 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1426 warning is issued.
1427
1428 P.S. To avoid cluttering up your working directory, you can delete this
1429 temporary file now:
1430
1431 >>> import os
1432 >>> os.remove("Quality/temp.fastq")
1433 """
1434 assert SANGER_SCORE_OFFSET == ord("!")
1435
1437 """Write a single FASTQ record to the file."""
1438 assert self._header_written
1439 assert not self._footer_written
1440 self._record_written = True
1441
1442 if record.seq is None:
1443 raise ValueError("No sequence for record %s" % record.id)
1444 seq_str = str(record.seq)
1445 qualities_str = _get_sanger_quality_str(record)
1446 if len(qualities_str) != len(seq_str):
1447 raise ValueError("Record %s has sequence length %i but %i quality scores"
1448 % (record.id, len(seq_str), len(qualities_str)))
1449
1450
1451
1452 id = self.clean(record.id)
1453 description = self.clean(record.description)
1454 if description and description.split(None, 1)[0] == id:
1455
1456 title = description
1457 elif description:
1458 title = "%s %s" % (id, description)
1459 else:
1460 title = id
1461
1462 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1463
1464
1466 """Class to write QUAL format files (using PHRED quality scores).
1467
1468 Although you can use this class directly, you are strongly encouraged
1469 to use the Bio.SeqIO.write() function instead. For example, this code
1470 reads in a FASTQ file and saves the quality scores into a QUAL file:
1471
1472 >>> from Bio import SeqIO
1473 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1474 >>> out_handle = open("Quality/temp.qual", "w")
1475 >>> SeqIO.write(record_iterator, out_handle, "qual")
1476 3
1477 >>> out_handle.close()
1478
1479 This code is also called if you use the .format("qual") method of a
1480 SeqRecord.
1481
1482 P.S. Don't forget to clean up the temp file if you don't need it anymore:
1483
1484 >>> import os
1485 >>> os.remove("Quality/temp.qual")
1486 """
1487 - def __init__(self, handle, wrap=60, record2title=None):
1488 """Create a QUAL writer.
1489
1490 Arguments:
1491 - handle - Handle to an output file, e.g. as returned
1492 by open(filename, "w")
1493 - wrap - Optional line length used to wrap sequence lines.
1494 Defaults to wrapping the sequence at 60 characters
1495 Use zero (or None) for no wrapping, giving a single
1496 long line for the sequence.
1497 - record2title - Optional function to return the text to be
1498 used for the title line of each record. By default
1499 a combination of the record.id and record.description
1500 is used. If the record.description starts with the
1501 record.id, then just the record.description is used.
1502
1503 The record2title argument is present for consistency with the
1504 Bio.SeqIO.FastaIO writer class.
1505 """
1506 SequentialSequenceWriter.__init__(self, handle)
1507
1508 self.wrap = None
1509 if wrap:
1510 if wrap < 1:
1511 raise ValueError
1512 self.wrap = wrap
1513 self.record2title = record2title
1514
1516 """Write a single QUAL record to the file."""
1517 assert self._header_written
1518 assert not self._footer_written
1519 self._record_written = True
1520
1521 handle = self.handle
1522 wrap = self.wrap
1523
1524 if self.record2title:
1525 title = self.clean(self.record2title(record))
1526 else:
1527 id = self.clean(record.id)
1528 description = self.clean(record.description)
1529 if description and description.split(None, 1)[0] == id:
1530
1531 title = description
1532 elif description:
1533 title = "%s %s" % (id, description)
1534 else:
1535 title = id
1536 handle.write(">%s\n" % title)
1537
1538 qualities = _get_phred_quality(record)
1539 try:
1540
1541
1542 qualities_strs = [("%i" % round(q, 0)) for q in qualities]
1543 except TypeError, e:
1544 if None in qualities:
1545 raise TypeError("A quality value of None was found")
1546 else:
1547 raise e
1548
1549 if wrap > 5:
1550
1551 data = " ".join(qualities_strs)
1552 while True:
1553 if len(data) <= wrap:
1554 self.handle.write(data + "\n")
1555 break
1556 else:
1557
1558
1559 i = data.rfind(" ", 0, wrap)
1560 handle.write(data[:i] + "\n")
1561 data = data[i + 1:]
1562 elif wrap:
1563
1564 while qualities_strs:
1565 line = qualities_strs.pop(0)
1566 while qualities_strs \
1567 and len(line) + 1 + len(qualities_strs[0]) < wrap:
1568 line += " " + qualities_strs.pop(0)
1569 handle.write(line + "\n")
1570 else:
1571
1572 data = " ".join(qualities_strs)
1573 handle.write(data + "\n")
1574
1575
1577 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
1578
1579 This outputs FASTQ files like those from the early Solexa/Illumina
1580 pipeline, using Solexa scores and an ASCII offset of 64. These are
1581 NOT compatible with the standard Sanger style PHRED FASTQ files.
1582
1583 If your records contain a "solexa_quality" entry under letter_annotations,
1584 this is used, otherwise any "phred_quality" entry will be used after
1585 conversion using the solexa_quality_from_phred function. If neither style
1586 of quality scores are present, an exception is raised.
1587
1588 Although you can use this class directly, you are strongly encouraged
1589 to use the Bio.SeqIO.write() function instead. For example, this code
1590 reads in a FASTQ file and re-saves it as another FASTQ file:
1591
1592 >>> from Bio import SeqIO
1593 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1594 >>> out_handle = open("Quality/temp.fastq", "w")
1595 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
1596 5
1597 >>> out_handle.close()
1598
1599 You might want to do this if the original file included extra line breaks,
1600 which (while valid) may not be supported by all tools. The output file
1601 from Biopython will have each sequence on a single line, and each quality
1602 string on a single line (which is considered desirable for maximum
1603 compatibility).
1604
1605 This code is also called if you use the .format("fastq-solexa") method of
1606 a SeqRecord. For example,
1607
1608 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1609 >>> print record.format("fastq-solexa")
1610 @Test PHRED qualities from 40 to 0 inclusive
1611 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1612 +
1613 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
1614 <BLANKLINE>
1615
1616 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
1617 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,
1618 a warning is issued.
1619
1620 P.S. Don't forget to delete the temp file if you don't need it anymore:
1621
1622 >>> import os
1623 >>> os.remove("Quality/temp.fastq")
1624 """
1626 """Write a single FASTQ record to the file."""
1627 assert self._header_written
1628 assert not self._footer_written
1629 self._record_written = True
1630
1631
1632 if record.seq is None:
1633 raise ValueError("No sequence for record %s" % record.id)
1634 seq_str = str(record.seq)
1635 qualities_str = _get_solexa_quality_str(record)
1636 if len(qualities_str) != len(seq_str):
1637 raise ValueError("Record %s has sequence length %i but %i quality scores"
1638 % (record.id, len(seq_str), len(qualities_str)))
1639
1640
1641
1642 id = self.clean(record.id)
1643 description = self.clean(record.description)
1644 if description and description.split(None, 1)[0] == id:
1645
1646 title = description
1647 elif description:
1648 title = "%s %s" % (id, description)
1649 else:
1650 title = id
1651
1652 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1653
1654
1656 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
1657
1658 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
1659 using PHRED scores and an ASCII offset of 64. Note these files are NOT
1660 compatible with the standard Sanger style PHRED FASTQ files which use an
1661 ASCII offset of 32.
1662
1663 Although you can use this class directly, you are strongly encouraged to
1664 use the Bio.SeqIO.write() function with format name "fastq-illumina"
1665 instead. This code is also called if you use the .format("fastq-illumina")
1666 method of a SeqRecord. For example,
1667
1668 >>> from Bio import SeqIO
1669 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1670 >>> print record.format("fastq-illumina")
1671 @Test PHRED qualities from 40 to 0 inclusive
1672 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1673 +
1674 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
1675 <BLANKLINE>
1676
1677 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
1678 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1679 warning is issued.
1680 """
1682 """Write a single FASTQ record to the file."""
1683 assert self._header_written
1684 assert not self._footer_written
1685 self._record_written = True
1686
1687
1688 if record.seq is None:
1689 raise ValueError("No sequence for record %s" % record.id)
1690 seq_str = str(record.seq)
1691 qualities_str = _get_illumina_quality_str(record)
1692 if len(qualities_str) != len(seq_str):
1693 raise ValueError("Record %s has sequence length %i but %i quality scores"
1694 % (record.id, len(seq_str), len(qualities_str)))
1695
1696
1697
1698 id = self.clean(record.id)
1699 description = self.clean(record.description)
1700 if description and description.split(None, 1)[0] == id:
1701
1702 title = description
1703 elif description:
1704 title = "%s %s" % (id, description)
1705 else:
1706 title = id
1707
1708 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1709
1710
1712 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
1713
1714 For example, consider this short QUAL file with PHRED quality scores::
1715
1716 >EAS54_6_R1_2_1_413_324
1717 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1718 26 26 26 23 23
1719 >EAS54_6_R1_2_1_540_792
1720 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1721 26 18 26 23 18
1722 >EAS54_6_R1_2_1_443_348
1723 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1724 24 18 18 18 18
1725
1726 And a matching FASTA file::
1727
1728 >EAS54_6_R1_2_1_413_324
1729 CCCTTCTTGTCTTCAGCGTTTCTCC
1730 >EAS54_6_R1_2_1_540_792
1731 TTGGCAGGCCAAGGCCGATGGATCA
1732 >EAS54_6_R1_2_1_443_348
1733 GTTGCTTCTGGCGTGGGTGGGGGGG
1734
1735 You can parse these separately using Bio.SeqIO with the "qual" and
1736 "fasta" formats, but then you'll get a group of SeqRecord objects with
1737 no sequence, and a matching group with the sequence but not the
1738 qualities. Because it only deals with one input file handle, Bio.SeqIO
1739 can't be used to read the two files together - but this function can!
1740 For example,
1741
1742 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1743 ... open("Quality/example.qual", "rU"))
1744 >>> for record in rec_iter:
1745 ... print record.id, record.seq
1746 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1747 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1748 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1749
1750 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1751 they are in each record's per-letter-annotation dictionary as a simple
1752 list of integers:
1753
1754 >>> print record.letter_annotations["phred_quality"]
1755 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1756
1757 If you have access to data as a FASTQ format file, using that directly
1758 would be simpler and more straight forward. Note that you can easily use
1759 this function to convert paired FASTA and QUAL files into FASTQ files:
1760
1761 >>> from Bio import SeqIO
1762 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1763 ... open("Quality/example.qual", "rU"))
1764 >>> out_handle = open("Quality/temp.fastq", "w")
1765 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1766 3
1767 >>> out_handle.close()
1768
1769 And don't forget to clean up the temp file if you don't need it anymore:
1770
1771 >>> import os
1772 >>> os.remove("Quality/temp.fastq")
1773 """
1774 from Bio.SeqIO.FastaIO import FastaIterator
1775 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet,
1776 title2ids=title2ids)
1777 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet,
1778 title2ids=title2ids)
1779
1780
1781
1782 while True:
1783 try:
1784 f_rec = fasta_iter.next()
1785 except StopIteration:
1786 f_rec = None
1787 try:
1788 q_rec = qual_iter.next()
1789 except StopIteration:
1790 q_rec = None
1791 if f_rec is None and q_rec is None:
1792
1793 break
1794 if f_rec is None:
1795 raise ValueError("FASTA file has more entries than the QUAL file.")
1796 if q_rec is None:
1797 raise ValueError("QUAL file has more entries than the FASTA file.")
1798 if f_rec.id != q_rec.id:
1799 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)."
1800 % (f_rec.id, q_rec.id))
1801 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
1802 raise ValueError("Sequence length and number of quality scores disagree for %s"
1803 % f_rec.id)
1804
1805 f_rec.letter_annotations[
1806 "phred_quality"] = q_rec.letter_annotations["phred_quality"]
1807 yield f_rec
1808
1809
1810
1811 if __name__ == "__main__":
1812 from Bio._utils import run_doctest
1813 run_doctest(verbose=0)
1814