Package Bio :: Package SeqIO :: Module SffIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.SffIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # Based on code contributed and copyright 2009 by Jose Blanca (COMAV-UPV). 
   3  # 
   4  # This code is part of the Biopython distribution and governed by its 
   5  # license.  Please see the LICENSE file that should have been included 
   6  # as part of this package. 
   7  """Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format. 
   8   
   9  SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for 
  10  Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used 
  11  as the native output format from early versions of Ion Torrent's PGM platform 
  12  as well. You are expected to use this module via the Bio.SeqIO functions under 
  13  the format name "sff" (or "sff-trim" as described below). 
  14   
  15  For example, to iterate over the records in an SFF file, 
  16   
  17      >>> from Bio import SeqIO 
  18      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 
  19      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
  20      ... 
  21      E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT... 
  22      E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA... 
  23      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
  24      E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT... 
  25      E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA... 
  26      E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC... 
  27      E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA... 
  28      E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG... 
  29      E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC... 
  30      E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA... 
  31   
  32  Each SeqRecord object will contain all the annotation from the SFF file, 
  33  including the PHRED quality scores. 
  34   
  35      >>> print("%s %i" % (record.id, len(record))) 
  36      E3MFGYR02F7Z7G 219 
  37      >>> print("%s..." % record.seq[:10]) 
  38      tcagAATCAT... 
  39      >>> print("%r..." % (record.letter_annotations["phred_quality"][:10])) 
  40      [22, 21, 23, 28, 26, 15, 12, 21, 28, 21]... 
  41   
  42  Notice that the sequence is given in mixed case, the central upper case region 
  43  corresponds to the trimmed sequence. This matches the output of the Roche 
  44  tools (and the 3rd party tool sff_extract) for SFF to FASTA. 
  45   
  46      >>> print(record.annotations["clip_qual_left"]) 
  47      4 
  48      >>> print(record.annotations["clip_qual_right"]) 
  49      134 
  50      >>> print(record.seq[:4]) 
  51      tcag 
  52      >>> print("%s...%s" % (record.seq[4:20], record.seq[120:134])) 
  53      AATCATCCACTTTTTA...CAAAACACAAACAG 
  54      >>> print(record.seq[134:]) 
  55      atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn 
  56   
  57  The annotations dictionary also contains any adapter clip positions 
  58  (usually zero), and information about the flows. e.g. 
  59   
  60      >>> len(record.annotations) 
  61      11 
  62      >>> print(record.annotations["flow_key"]) 
  63      TCAG 
  64      >>> print(record.annotations["flow_values"][:10]) 
  65      (83, 1, 128, 7, 4, 84, 6, 106, 3, 172) 
  66      >>> print(len(record.annotations["flow_values"])) 
  67      400 
  68      >>> print(record.annotations["flow_index"][:10]) 
  69      (1, 2, 3, 2, 2, 0, 3, 2, 3, 3) 
  70      >>> print(len(record.annotations["flow_index"])) 
  71      219 
  72   
  73  Note that to convert from a raw reading in flow_values to the corresponding 
  74  homopolymer stretch estimate, the value should be rounded to the nearest 100: 
  75   
  76      >>> print("%r..." % [int(round(value, -2)) // 100 
  77      ...                  for value in record.annotations["flow_values"][:10]]) 
  78      ... 
  79      [1, 0, 1, 0, 0, 1, 0, 1, 0, 2]... 
  80   
  81  If a read name is exactly 14 alphanumeric characters, the annotations 
  82  dictionary will also contain meta-data about the read extracted by 
  83  interpretting the name as a 454 Sequencing System "Universal" Accession 
  84  Number. Note that if a read name happens to be exactly 14 alphanumeric 
  85  characters but was not generated automatically, these annotation records 
  86  will contain nonsense information. 
  87   
  88      >>> print(record.annotations["region"]) 
  89      2 
  90      >>> print(record.annotations["time"]) 
  91      [2008, 1, 9, 16, 16, 0] 
  92      >>> print(record.annotations["coords"]) 
  93      (2434, 1658) 
  94   
  95  As a convenience method, you can read the file with SeqIO format name "sff-trim" 
  96  instead of "sff" to get just the trimmed sequences (without any annotation 
  97  except for the PHRED quality scores and anything encoded in the read names): 
  98   
  99      >>> from Bio import SeqIO 
 100      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"): 
 101      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 102      ... 
 103      E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC... 
 104      E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC... 
 105      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 106      E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT... 
 107      E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA... 
 108      E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC... 
 109      E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC... 
 110      E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC... 
 111      E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG... 
 112      E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT... 
 113   
 114  Looking at the final record in more detail, note how this differs to the 
 115  example above: 
 116   
 117      >>> print("%s %i" % (record.id, len(record))) 
 118      E3MFGYR02F7Z7G 130 
 119      >>> print("%s..." % record.seq[:10]) 
 120      AATCATCCAC... 
 121      >>> print("%r..." % record.letter_annotations["phred_quality"][:10]) 
 122      [26, 15, 12, 21, 28, 21, 36, 28, 27, 27]... 
 123      >>> len(record.annotations) 
 124      3 
 125      >>> print(record.annotations["region"]) 
 126      2 
 127      >>> print(record.annotations["coords"]) 
 128      (2434, 1658) 
 129      >>> print(record.annotations["time"]) 
 130      [2008, 1, 9, 16, 16, 0] 
 131   
 132  You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF 
 133  reads into a FASTQ file (or a FASTA file and a QUAL file), e.g. 
 134   
 135      >>> from Bio import SeqIO 
 136      >>> try: 
 137      ...     from StringIO import StringIO # Python 2 
 138      ... except ImportError: 
 139      ...     from io import StringIO # Python 3 
 140      ... 
 141      >>> out_handle = StringIO() 
 142      >>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff", 
 143      ...                       out_handle, "fastq") 
 144      ... 
 145      >>> print("Converted %i records" % count) 
 146      Converted 10 records 
 147   
 148  The output FASTQ file would start like this: 
 149   
 150      >>> print("%s..." % out_handle.getvalue()[:50]) 
 151      @E3MFGYR02JWQ7T 
 152      tcagGGTCTACATGTTGGTTAACCCGTACTGATT... 
 153   
 154  Bio.SeqIO.index() provides memory efficient random access to the reads in an 
 155  SFF file by name. SFF files can include an index within the file, which can 
 156  be read in making this very fast. If the index is missing (or in a format not 
 157  yet supported in Biopython) the file is indexed by scanning all the reads - 
 158  which is a little slower. For example, 
 159   
 160      >>> from Bio import SeqIO 
 161      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 162      >>> record = reads["E3MFGYR02JHD4H"] 
 163      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 164      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
 165      >>> reads.close() 
 166   
 167  Or, using the trimmed reads: 
 168   
 169      >>> from Bio import SeqIO 
 170      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim") 
 171      >>> record = reads["E3MFGYR02JHD4H"] 
 172      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 173      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 174      >>> reads.close() 
 175   
 176  You can also use the Bio.SeqIO.write() function with the "sff" format. Note 
 177  that this requires all the flow information etc, and thus is probably only 
 178  useful for SeqRecord objects originally from reading another SFF file (and 
 179  not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim"). 
 180   
 181  As an example, let's pretend this example SFF file represents some DNA which 
 182  was pre-amplified with a PCR primers AAAGANNNNN. The following script would 
 183  produce a sub-file containing all those reads whose post-quality clipping 
 184  region (i.e. the sequence after trimming) starts with AAAGA exactly (the non- 
 185  degenerate bit of this pretend primer): 
 186   
 187      >>> from Bio import SeqIO 
 188      >>> records = (record for record in 
 189      ...            SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 190      ...            if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA")) 
 191      ... 
 192      >>> count = SeqIO.write(records, "temp_filtered.sff", "sff") 
 193      >>> print("Selected %i records" % count) 
 194      Selected 2 records 
 195   
 196  Of course, for an assembly you would probably want to remove these primers. 
 197  If you want FASTA or FASTQ output, you could just slice the SeqRecord. However, 
 198  if you want SFF output we have to preserve all the flow information - the trick 
 199  is just to adjust the left clip position! 
 200   
 201      >>> from Bio import SeqIO 
 202      >>> def filter_and_trim(records, primer): 
 203      ...     for record in records: 
 204      ...         if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): 
 205      ...             record.annotations["clip_qual_left"] += len(primer) 
 206      ...             yield record 
 207      ... 
 208      >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 209      >>> count = SeqIO.write(filter_and_trim(records, "AAAGA"), 
 210      ...                     "temp_filtered.sff", "sff") 
 211      ... 
 212      >>> print("Selected %i records" % count) 
 213      Selected 2 records 
 214   
 215  We can check the results, note the lower case clipped region now includes the "AAAGA" 
 216  sequence: 
 217   
 218      >>> for record in SeqIO.parse("temp_filtered.sff", "sff"): 
 219      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 220      ... 
 221      E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC... 
 222      E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA... 
 223      >>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"): 
 224      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 225      ... 
 226      E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG... 
 227      E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG... 
 228      >>> import os 
 229      >>> os.remove("temp_filtered.sff") 
 230   
 231  For a description of the file format, please see the Roche manuals and: 
 232  http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats 
 233   
 234  """ 
 235   
 236  from __future__ import print_function 
 237   
 238  from Bio.SeqIO.Interfaces import SequenceWriter 
 239  from Bio import Alphabet 
 240  from Bio.Seq import Seq 
 241  from Bio.SeqRecord import SeqRecord 
 242  import struct 
 243  import sys 
 244  import re 
 245   
 246  from Bio._py3k import _bytes_to_string, _as_bytes 
 247  _null = b"\0" 
 248  _sff = b".sff" 
 249  _hsh = b".hsh" 
 250  _srt = b".srt" 
 251  _mft = b".mft" 
 252  _flag = b"\xff" 
 253   
 254  __docformat__ = "restructuredtext en" 
 255   
 256   
257 -def _check_mode(handle):
258 """Ensure handle not opened in text mode. 259 260 Ensures mode is not set for Universal new line 261 and ensures mode is binary for Windows 262 """ 263 # TODO - Does this need to be stricter under Python 3? 264 mode = "" 265 if hasattr(handle, "mode"): 266 mode = handle.mode 267 if mode == 1: 268 # gzip.open(...) does this, fine 269 return 270 mode = str(mode) 271 272 if mode and "U" in mode.upper(): 273 raise ValueError("SFF files must NOT be opened in universal new " 274 "lines mode. Binary mode is recommended (although " 275 "on Unix the default mode is also fine).") 276 elif mode and "B" not in mode.upper() \ 277 and sys.platform == "win32": 278 raise ValueError("SFF files must be opened in binary mode on Windows")
279 280
281 -def _sff_file_header(handle):
282 """Read in an SFF file header (PRIVATE). 283 284 Assumes the handle is at the start of the file, will read forwards 285 though the header and leave the handle pointing at the first record. 286 Returns a tuple of values from the header (header_length, index_offset, 287 index_length, number_of_reads, flows_per_read, flow_chars, key_sequence) 288 289 >>> with open("Roche/greek.sff", "rb") as handle: 290 ... values = _sff_file_header(handle) 291 ... 292 >>> print(values[0]) 293 840 294 >>> print(values[1]) 295 65040 296 >>> print(values[2]) 297 256 298 >>> print(values[3]) 299 24 300 >>> print(values[4]) 301 800 302 >>> values[-1] 303 'TCAG' 304 305 """ 306 _check_mode(handle) 307 # file header (part one) 308 # use big endiean encdoing > 309 # magic_number I 310 # version 4B 311 # index_offset Q 312 # index_length I 313 # number_of_reads I 314 # header_length H 315 # key_length H 316 # number_of_flows_per_read H 317 # flowgram_format_code B 318 # [rest of file header depends on the number of flows and how many keys] 319 fmt = '>4s4BQIIHHHB' 320 assert 31 == struct.calcsize(fmt) 321 data = handle.read(31) 322 if not data: 323 raise ValueError("Empty file.") 324 elif len(data) < 13: 325 raise ValueError("File too small to hold a valid SFF header.") 326 magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ 327 number_of_reads, header_length, key_length, number_of_flows_per_read, \ 328 flowgram_format = struct.unpack(fmt, data) 329 if magic_number in [_hsh, _srt, _mft]: 330 # Probably user error, calling Bio.SeqIO.parse() twice! 331 raise ValueError("Handle seems to be at SFF index block, not start") 332 if magic_number != _sff: # 779314790 333 raise ValueError("SFF file did not start '.sff', but %s" 334 % repr(magic_number)) 335 if (ver0, ver1, ver2, ver3) != (0, 0, 0, 1): 336 raise ValueError("Unsupported SFF version in header, %i.%i.%i.%i" 337 % (ver0, ver1, ver2, ver3)) 338 if flowgram_format != 1: 339 raise ValueError("Flowgram format code %i not supported" 340 % flowgram_format) 341 if (index_offset != 0) ^ (index_length != 0): 342 raise ValueError("Index offset %i but index length %i" 343 % (index_offset, index_length)) 344 flow_chars = _bytes_to_string(handle.read(number_of_flows_per_read)) 345 key_sequence = _bytes_to_string(handle.read(key_length)) 346 # According to the spec, the header_length field should be the total number 347 # of bytes required by this set of header fields, and should be equal to 348 # "31 + number_of_flows_per_read + key_length" rounded up to the next value 349 # divisible by 8. 350 assert header_length % 8 == 0 351 padding = header_length - number_of_flows_per_read - key_length - 31 352 assert 0 <= padding < 8, padding 353 if handle.read(padding).count(_null) != padding: 354 import warnings 355 from Bio import BiopythonParserWarning 356 warnings.warn("Your SFF file is invalid, post header %i byte " 357 "null padding region contained data." % padding, 358 BiopythonParserWarning) 359 return header_length, index_offset, index_length, \ 360 number_of_reads, number_of_flows_per_read, \ 361 flow_chars, key_sequence
362 363
364 -def _sff_do_slow_index(handle):
365 """Generates an index by scanning though all the reads in an SFF file (PRIVATE). 366 367 This is a slow but generic approach if we can't parse the provided index 368 (if present). 369 370 Will use the handle seek/tell functions. 371 """ 372 handle.seek(0) 373 header_length, index_offset, index_length, number_of_reads, \ 374 number_of_flows_per_read, flow_chars, key_sequence \ 375 = _sff_file_header(handle) 376 # Now on to the reads... 377 read_header_fmt = '>2HI4H' 378 read_header_size = struct.calcsize(read_header_fmt) 379 # NOTE - assuming flowgram_format==1, which means struct type H 380 read_flow_fmt = ">%iH" % number_of_flows_per_read 381 read_flow_size = struct.calcsize(read_flow_fmt) 382 assert 1 == struct.calcsize(">B") 383 assert 1 == struct.calcsize(">s") 384 assert 1 == struct.calcsize(">c") 385 assert read_header_size % 8 == 0 # Important for padding calc later! 386 for read in range(number_of_reads): 387 record_offset = handle.tell() 388 if record_offset == index_offset: 389 # Found index block within reads, ignore it: 390 offset = index_offset + index_length 391 if offset % 8: 392 offset += 8 - (offset % 8) 393 assert offset % 8 == 0 394 handle.seek(offset) 395 record_offset = offset 396 # assert record_offset%8 == 0 # Worth checking, but slow 397 # First the fixed header 398 data = handle.read(read_header_size) 399 read_header_length, name_length, seq_len, clip_qual_left, \ 400 clip_qual_right, clip_adapter_left, clip_adapter_right \ 401 = struct.unpack(read_header_fmt, data) 402 if read_header_length < 10 or read_header_length % 8 != 0: 403 raise ValueError("Malformed read header, says length is %i:\n%s" 404 % (read_header_length, repr(data))) 405 # now the name and any padding (remainder of header) 406 name = _bytes_to_string(handle.read(name_length)) 407 padding = read_header_length - read_header_size - name_length 408 if handle.read(padding).count(_null) != padding: 409 import warnings 410 from Bio import BiopythonParserWarning 411 warnings.warn("Your SFF file is invalid, post name %i byte " 412 "padding region contained data" % padding, 413 BiopythonParserWarning) 414 assert record_offset + read_header_length == handle.tell() 415 # now the flowgram values, flowgram index, bases and qualities 416 size = read_flow_size + 3 * seq_len 417 handle.seek(size, 1) 418 # now any padding... 419 padding = size % 8 420 if padding: 421 padding = 8 - padding 422 if handle.read(padding).count(_null) != padding: 423 import warnings 424 from Bio import BiopythonParserWarning 425 warnings.warn("Your SFF file is invalid, post quality %i " 426 "byte padding region contained data" % padding, 427 BiopythonParserWarning) 428 # print("%s %s %i" % (read, name, record_offset)) 429 yield name, record_offset 430 if handle.tell() % 8 != 0: 431 raise ValueError( 432 "After scanning reads, did not end on a multiple of 8")
433 434
435 -def _sff_find_roche_index(handle):
436 """Locate any existing Roche style XML meta data and read index (PRIVATE). 437 438 Makes a number of hard coded assumptions based on reverse engineered SFF 439 files from Roche 454 machines. 440 441 Returns a tuple of read count, SFF "index" offset and size, XML offset 442 and size, and the actual read index offset and size. 443 444 Raises a ValueError for unsupported or non-Roche index blocks. 445 """ 446 handle.seek(0) 447 header_length, index_offset, index_length, number_of_reads, \ 448 number_of_flows_per_read, flow_chars, key_sequence \ 449 = _sff_file_header(handle) 450 assert handle.tell() == header_length 451 if not index_offset or not index_offset: 452 raise ValueError("No index present in this SFF file") 453 # Now jump to the header... 454 handle.seek(index_offset) 455 fmt = ">4s4B" 456 fmt_size = struct.calcsize(fmt) 457 data = handle.read(fmt_size) 458 if not data: 459 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found nothing" 460 % (index_length, index_offset)) 461 if len(data) < fmt_size: 462 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found %s" 463 % (index_length, index_offset, repr(data))) 464 magic_number, ver0, ver1, ver2, ver3 = struct.unpack(fmt, data) 465 if magic_number == _mft: # 778921588 466 # Roche 454 manifest index 467 # This is typical from raw Roche 454 SFF files (2009), and includes 468 # both an XML manifest and the sorted index. 469 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 470 # This is "1.00" as a string 471 raise ValueError("Unsupported version in .mft index header, %i.%i.%i.%i" 472 % (ver0, ver1, ver2, ver3)) 473 fmt2 = ">LL" 474 fmt2_size = struct.calcsize(fmt2) 475 xml_size, data_size = struct.unpack(fmt2, handle.read(fmt2_size)) 476 if index_length != fmt_size + fmt2_size + xml_size + data_size: 477 raise ValueError("Problem understanding .mft index header, %i != %i + %i + %i + %i" 478 % (index_length, fmt_size, fmt2_size, xml_size, data_size)) 479 return number_of_reads, header_length, \ 480 index_offset, index_length, \ 481 index_offset + fmt_size + fmt2_size, xml_size, \ 482 index_offset + fmt_size + fmt2_size + xml_size, data_size 483 elif magic_number == _srt: # 779317876 484 # Roche 454 sorted index 485 # I've had this from Roche tool sfffile when the read identifiers 486 # had nonstandard lengths and there was no XML manifest. 487 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 488 # This is "1.00" as a string 489 raise ValueError("Unsupported version in .srt index header, %i.%i.%i.%i" 490 % (ver0, ver1, ver2, ver3)) 491 data = handle.read(4) 492 if data != _null * 4: 493 raise ValueError( 494 "Did not find expected null four bytes in .srt index") 495 return number_of_reads, header_length, \ 496 index_offset, index_length, \ 497 0, 0, \ 498 index_offset + fmt_size + 4, index_length - fmt_size - 4 499 elif magic_number == _hsh: 500 raise ValueError("Hash table style indexes (.hsh) in SFF files are " 501 "not (yet) supported") 502 else: 503 raise ValueError("Unknown magic number %s in SFF index header:\n%s" 504 % (repr(magic_number), repr(data)))
505 506
507 -def ReadRocheXmlManifest(handle):
508 """Reads any Roche style XML manifest data in the SFF "index". 509 510 The SFF file format allows for multiple different index blocks, and Roche 511 took advantage of this to define their own index block which also embeds 512 an XML manifest string. This is not a publically documented extension to 513 the SFF file format, this was reverse engineered. 514 515 The handle should be to an SFF file opened in binary mode. This function 516 will use the handle seek/tell functions and leave the handle in an 517 arbitrary location. 518 519 Any XML manifest found is returned as a Python string, which you can then 520 parse as appropriate, or reuse when writing out SFF files with the 521 SffWriter class. 522 523 Returns a string, or raises a ValueError if an Roche manifest could not be 524 found. 525 """ 526 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 527 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 528 handle) 529 if not xml_offset or not xml_size: 530 raise ValueError("No XML manifest found") 531 handle.seek(xml_offset) 532 return _bytes_to_string(handle.read(xml_size))
533 534 535 # This is a generator function!
536 -def _sff_read_roche_index(handle):
537 """Reads any existing Roche style read index provided in the SFF file (PRIVATE). 538 539 Will use the handle seek/tell functions. 540 541 This works on ".srt1.00" and ".mft1.00" style Roche SFF index blocks. 542 543 Roche SFF indices use base 255 not 256, meaning we see bytes in range the 544 range 0 to 254 only. This appears to be so that byte 0xFF (character 255) 545 can be used as a marker character to separate entries (required if the 546 read name lengths vary). 547 548 Note that since only four bytes are used for the read offset, this is 549 limited to 255^4 bytes (nearly 4GB). If you try to use the Roche sfffile 550 tool to combine SFF files beyound this limit, they issue a warning and 551 omit the index (and manifest). 552 """ 553 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 554 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 555 handle) 556 # Now parse the read index... 557 handle.seek(read_index_offset) 558 fmt = ">5B" 559 for read in range(number_of_reads): 560 # TODO - Be more aware of when the index should end? 561 data = handle.read(6) 562 while True: 563 more = handle.read(1) 564 if not more: 565 raise ValueError("Premature end of file!") 566 data += more 567 if more == _flag: 568 break 569 assert data[-1:] == _flag, data[-1:] 570 name = _bytes_to_string(data[:-6]) 571 off4, off3, off2, off1, off0 = struct.unpack(fmt, data[-6:-1]) 572 offset = off0 + 255 * off1 + 65025 * off2 + 16581375 * off3 573 if off4: 574 # Could in theory be used as a fifth piece of offset information, 575 # i.e. offset =+ 4228250625L*off4, but testing the Roche tools this 576 # is not the case. They simple don't support such large indexes. 577 raise ValueError("Expected a null terminator to the read name.") 578 yield name, offset 579 if handle.tell() != read_index_offset + read_index_size: 580 raise ValueError("Problem with index length? %i vs %i" 581 % (handle.tell(), read_index_offset + read_index_size))
582 583 _valid_UAN_read_name = re.compile(r'^[a-zA-Z0-9]{14}$') 584 585
586 -def _sff_read_seq_record(handle, number_of_flows_per_read, flow_chars, 587 key_sequence, alphabet, trim=False):
588 """Parse the next read in the file, return data as a SeqRecord (PRIVATE).""" 589 # Now on to the reads... 590 # the read header format (fixed part): 591 # read_header_length H 592 # name_length H 593 # seq_len I 594 # clip_qual_left H 595 # clip_qual_right H 596 # clip_adapter_left H 597 # clip_adapter_right H 598 # [rest of read header depends on the name length etc] 599 read_header_fmt = '>2HI4H' 600 read_header_size = struct.calcsize(read_header_fmt) 601 read_flow_fmt = ">%iH" % number_of_flows_per_read 602 read_flow_size = struct.calcsize(read_flow_fmt) 603 604 read_header_length, name_length, seq_len, clip_qual_left, \ 605 clip_qual_right, clip_adapter_left, clip_adapter_right \ 606 = struct.unpack(read_header_fmt, handle.read(read_header_size)) 607 if clip_qual_left: 608 clip_qual_left -= 1 # python counting 609 if clip_adapter_left: 610 clip_adapter_left -= 1 # python counting 611 if read_header_length < 10 or read_header_length % 8 != 0: 612 raise ValueError("Malformed read header, says length is %i" 613 % read_header_length) 614 # now the name and any padding (remainder of header) 615 name = _bytes_to_string(handle.read(name_length)) 616 padding = read_header_length - read_header_size - name_length 617 if handle.read(padding).count(_null) != padding: 618 import warnings 619 from Bio import BiopythonParserWarning 620 warnings.warn("Your SFF file is invalid, post name %i " 621 "byte padding region contained data" % padding, 622 BiopythonParserWarning) 623 # now the flowgram values, flowgram index, bases and qualities 624 # NOTE - assuming flowgram_format==1, which means struct type H 625 flow_values = handle.read(read_flow_size) # unpack later if needed 626 temp_fmt = ">%iB" % seq_len # used for flow index and quals 627 flow_index = handle.read(seq_len) # unpack later if needed 628 seq = _bytes_to_string(handle.read(seq_len)) # TODO - Use bytes in Seq? 629 quals = list(struct.unpack(temp_fmt, handle.read(seq_len))) 630 # now any padding... 631 padding = (read_flow_size + seq_len * 3) % 8 632 if padding: 633 padding = 8 - padding 634 if handle.read(padding).count(_null) != padding: 635 import warnings 636 from Bio import BiopythonParserWarning 637 warnings.warn("Your SFF file is invalid, post quality %i " 638 "byte padding region contained data" % padding, 639 BiopythonParserWarning) 640 # Follow Roche and apply most aggressive of qual and adapter clipping. 641 # Note Roche seems to ignore adapter clip fields when writing SFF, 642 # and uses just the quality clipping values for any clipping. 643 clip_left = max(clip_qual_left, clip_adapter_left) 644 # Right clipping of zero means no clipping 645 if clip_qual_right: 646 if clip_adapter_right: 647 clip_right = min(clip_qual_right, clip_adapter_right) 648 else: 649 # Typical case with Roche SFF files 650 clip_right = clip_qual_right 651 elif clip_adapter_right: 652 clip_right = clip_adapter_right 653 else: 654 clip_right = seq_len 655 # Now build a SeqRecord 656 if trim: 657 if clip_left >= clip_right: 658 # Raise an error? 659 import warnings 660 from Bio import BiopythonParserWarning 661 warnings.warn("Overlapping clip values in SFF record, trimmed to nothing", 662 BiopythonParserWarning) 663 seq = "" 664 quals = [] 665 else: 666 seq = seq[clip_left:clip_right].upper() 667 quals = quals[clip_left:clip_right] 668 # Don't record the clipping values, flow etc, they make no sense now: 669 annotations = {} 670 else: 671 if clip_left >= clip_right: 672 import warnings 673 from Bio import BiopythonParserWarning 674 warnings.warn("Overlapping clip values in SFF record", BiopythonParserWarning) 675 seq = seq.lower() 676 else: 677 # This use of mixed case mimics the Roche SFF tool's FASTA output 678 seq = seq[:clip_left].lower() + \ 679 seq[clip_left:clip_right].upper() + \ 680 seq[clip_right:].lower() 681 annotations = {"flow_values": struct.unpack(read_flow_fmt, flow_values), 682 "flow_index": struct.unpack(temp_fmt, flow_index), 683 "flow_chars": flow_chars, 684 "flow_key": key_sequence, 685 "clip_qual_left": clip_qual_left, 686 "clip_qual_right": clip_qual_right, 687 "clip_adapter_left": clip_adapter_left, 688 "clip_adapter_right": clip_adapter_right} 689 if re.match(_valid_UAN_read_name, name): 690 annotations["time"] = _get_read_time(name) 691 annotations["region"] = _get_read_region(name) 692 annotations["coords"] = _get_read_xy(name) 693 record = SeqRecord(Seq(seq, alphabet), 694 id=name, 695 name=name, 696 description="", 697 annotations=annotations) 698 # Dirty trick to speed up this line: 699 # record.letter_annotations["phred_quality"] = quals 700 dict.__setitem__(record._per_letter_annotations, 701 "phred_quality", quals) 702 # Return the record and then continue... 703 return record
704 705 _powers_of_36 = [36 ** i for i in range(6)] 706 707
708 -def _string_as_base_36(string):
709 """Interpret a string as a base-36 number as per 454 manual.""" 710 total = 0 711 for c, power in zip(string[::-1], _powers_of_36): 712 # For reference: ord('0') = 48, ord('9') = 57 713 # For reference: ord('A') = 65, ord('Z') = 90 714 # For reference: ord('a') = 97, ord('z') = 122 715 if 48 <= ord(c) <= 57: 716 val = ord(c) - 22 # equivalent to: - ord('0') + 26 717 elif 65 <= ord(c) <= 90: 718 val = ord(c) - 65 719 elif 97 <= ord(c) <= 122: 720 val = ord(c) - 97 721 else: 722 # Invalid character 723 val = 0 724 total += val * power 725 return total
726 727
728 -def _get_read_xy(read_name):
729 """Extract coordinates from last 5 characters of read name.""" 730 number = _string_as_base_36(read_name[9:]) 731 return divmod(number, 4096)
732 733 _time_denominators = [13 * 32 * 24 * 60 * 60, 734 32 * 24 * 60 * 60, 735 24 * 60 * 60, 736 60 * 60, 737 60] 738 739
740 -def _get_read_time(read_name):
741 """Extract time from first 6 characters of read name.""" 742 time_list = [] 743 remainder = _string_as_base_36(read_name[:6]) 744 for denominator in _time_denominators: 745 this_term, remainder = divmod(remainder, denominator) 746 time_list.append(this_term) 747 time_list.append(remainder) 748 time_list[0] += 2000 749 return time_list
750 751
752 -def _get_read_region(read_name):
753 """Extract region from read name.""" 754 return int(read_name[8])
755 756
757 -def _sff_read_raw_record(handle, number_of_flows_per_read):
758 """Extract the next read in the file as a raw (bytes) string (PRIVATE).""" 759 read_header_fmt = '>2HI' 760 read_header_size = struct.calcsize(read_header_fmt) 761 read_flow_fmt = ">%iH" % number_of_flows_per_read 762 read_flow_size = struct.calcsize(read_flow_fmt) 763 764 raw = handle.read(read_header_size) 765 read_header_length, name_length, seq_len \ 766 = struct.unpack(read_header_fmt, raw) 767 if read_header_length < 10 or read_header_length % 8 != 0: 768 raise ValueError("Malformed read header, says length is %i" 769 % read_header_length) 770 # now the four clip values (4H = 8 bytes), and read name 771 raw += handle.read(8 + name_length) 772 # and any padding (remainder of header) 773 padding = read_header_length - read_header_size - 8 - name_length 774 pad = handle.read(padding) 775 if pad.count(_null) != padding: 776 import warnings 777 from Bio import BiopythonParserWarning 778 warnings.warn("Your SFF file is invalid, post name %i " 779 "byte padding region contained data" % padding, 780 BiopythonParserWarning) 781 raw += pad 782 # now the flowgram values, flowgram index, bases and qualities 783 raw += handle.read(read_flow_size + seq_len * 3) 784 padding = (read_flow_size + seq_len * 3) % 8 785 # now any padding... 786 if padding: 787 padding = 8 - padding 788 pad = handle.read(padding) 789 if pad.count(_null) != padding: 790 import warnings 791 from Bio import BiopythonParserWarning 792 warnings.warn("Your SFF file is invalid, post quality %i " 793 "byte padding region contained data" % padding, 794 BiopythonParserWarning) 795 raw += pad 796 # Return the raw bytes 797 return raw
798 799
800 -class _AddTellHandle(object):
801 """Wrapper for handles which do not support the tell method (PRIVATE). 802 803 Intended for use with things like network handles where tell (and reverse 804 seek) are not supported. The SFF file needs to track the current offset in 805 order to deal with the index block. 806 """
807 - def __init__(self, handle):
808 self._handle = handle 809 self._offset = 0
810
811 - def read(self, length):
812 data = self._handle.read(length) 813 self._offset += len(data) 814 return data
815
816 - def tell(self):
817 return self._offset
818
819 - def seek(self, offset):
820 if offset < self._offset: 821 raise RuntimeError("Can't seek backwards") 822 self._handle.read(offset - self._offset)
823
824 - def close(self):
825 return self._handle.close()
826 827 828 # This is a generator function!
829 -def SffIterator(handle, alphabet=Alphabet.generic_dna, trim=False):
830 """Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects). 831 832 - handle - input file, an SFF file, e.g. from Roche 454 sequencing. 833 This must NOT be opened in universal read lines mode! 834 - alphabet - optional alphabet, defaults to generic DNA. 835 - trim - should the sequences be trimmed? 836 837 The resulting SeqRecord objects should match those from a paired FASTA 838 and QUAL file converted from the SFF file using the Roche 454 tool 839 ssfinfo. i.e. The sequence will be mixed case, with the trim regions 840 shown in lower case. 841 842 This function is used internally via the Bio.SeqIO functions: 843 844 >>> from Bio import SeqIO 845 >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 846 ... print("%s %i" % (record.id, len(record))) 847 ... 848 E3MFGYR02JWQ7T 265 849 E3MFGYR02JA6IL 271 850 E3MFGYR02JHD4H 310 851 E3MFGYR02GFKUC 299 852 E3MFGYR02FTGED 281 853 E3MFGYR02FR9G7 261 854 E3MFGYR02GAZMS 278 855 E3MFGYR02HHZ8O 221 856 E3MFGYR02GPGB1 269 857 E3MFGYR02F7Z7G 219 858 859 You can also call it directly: 860 861 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 862 ... for record in SffIterator(handle): 863 ... print("%s %i" % (record.id, len(record))) 864 ... 865 E3MFGYR02JWQ7T 265 866 E3MFGYR02JA6IL 271 867 E3MFGYR02JHD4H 310 868 E3MFGYR02GFKUC 299 869 E3MFGYR02FTGED 281 870 E3MFGYR02FR9G7 261 871 E3MFGYR02GAZMS 278 872 E3MFGYR02HHZ8O 221 873 E3MFGYR02GPGB1 269 874 E3MFGYR02F7Z7G 219 875 876 Or, with the trim option: 877 878 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 879 ... for record in SffIterator(handle, trim=True): 880 ... print("%s %i" % (record.id, len(record))) 881 ... 882 E3MFGYR02JWQ7T 260 883 E3MFGYR02JA6IL 265 884 E3MFGYR02JHD4H 292 885 E3MFGYR02GFKUC 295 886 E3MFGYR02FTGED 277 887 E3MFGYR02FR9G7 256 888 E3MFGYR02GAZMS 271 889 E3MFGYR02HHZ8O 150 890 E3MFGYR02GPGB1 221 891 E3MFGYR02F7Z7G 130 892 893 """ 894 if isinstance(Alphabet._get_base_alphabet(alphabet), 895 Alphabet.ProteinAlphabet): 896 raise ValueError("Invalid alphabet, SFF files do not hold proteins.") 897 if isinstance(Alphabet._get_base_alphabet(alphabet), 898 Alphabet.RNAAlphabet): 899 raise ValueError("Invalid alphabet, SFF files do not hold RNA.") 900 try: 901 assert 0 == handle.tell(), "Not at start of file, offset %i" % handle.tell() 902 except AttributeError: 903 # Probably a network handle or something like that 904 handle = _AddTellHandle(handle) 905 header_length, index_offset, index_length, number_of_reads, \ 906 number_of_flows_per_read, flow_chars, key_sequence \ 907 = _sff_file_header(handle) 908 # Now on to the reads... 909 # the read header format (fixed part): 910 # read_header_length H 911 # name_length H 912 # seq_len I 913 # clip_qual_left H 914 # clip_qual_right H 915 # clip_adapter_left H 916 # clip_adapter_right H 917 # [rest of read header depends on the name length etc] 918 read_header_fmt = '>2HI4H' 919 read_header_size = struct.calcsize(read_header_fmt) 920 read_flow_fmt = ">%iH" % number_of_flows_per_read 921 read_flow_size = struct.calcsize(read_flow_fmt) 922 assert 1 == struct.calcsize(">B") 923 assert 1 == struct.calcsize(">s") 924 assert 1 == struct.calcsize(">c") 925 assert read_header_size % 8 == 0 # Important for padding calc later! 926 # The spec allows for the index block to be before or even in the middle 927 # of the reads. We can check that if we keep track of our position 928 # in the file... 929 for read in range(number_of_reads): 930 if index_offset and handle.tell() == index_offset: 931 offset = index_offset + index_length 932 if offset % 8: 933 offset += 8 - (offset % 8) 934 assert offset % 8 == 0 935 handle.seek(offset) 936 # Now that we've done this, we don't need to do it again. Clear 937 # the index_offset so we can skip extra handle.tell() calls: 938 index_offset = 0 939 yield _sff_read_seq_record(handle, 940 number_of_flows_per_read, 941 flow_chars, 942 key_sequence, 943 alphabet, 944 trim) 945 _check_eof(handle, index_offset, index_length)
946 947
948 -def _check_eof(handle, index_offset, index_length):
949 """Check final padding is OK (8 byte alignment) and file ends (PRIVATE). 950 951 Will attempt to spot apparent SFF file concatenation and give an error. 952 953 Will not attempt to seek, only moves the handle forward. 954 """ 955 offset = handle.tell() 956 extra = b"" 957 padding = 0 958 959 if index_offset and offset <= index_offset: 960 # Index block then end of file... 961 if offset < index_offset: 962 raise ValueError("Gap of %i bytes after final record end %i, " 963 "before %i where index starts?" 964 % (index_offset - offset, offset, index_offset)) 965 # Doing read to jump the index rather than a seek 966 # in case this is a network handle or similar 967 handle.read(index_offset + index_length - offset) 968 offset = index_offset + index_length 969 assert offset == handle.tell(), \ 970 "Wanted %i, got %i, index is %i to %i" \ 971 % (offset, handle.tell(), index_offset, index_offset + index_length) 972 973 if offset % 8: 974 padding = 8 - (offset % 8) 975 extra = handle.read(padding) 976 977 if padding >= 4 and extra[-4:] == _sff: 978 # Seen this in one user supplied file, should have been 979 # four bytes of null padding but was actually .sff and 980 # the start of a new concatenated SFF file! 981 raise ValueError("Your SFF file is invalid, post index %i byte " 982 "null padding region ended '.sff' which could " 983 "be the start of a concatenated SFF file? " 984 "See offset %i" % (padding, offset)) 985 if padding and not extra: 986 # TODO - Is this error harmless enough to just ignore? 987 import warnings 988 from Bio import BiopythonParserWarning 989 warnings.warn("Your SFF file is technically invalid as it is missing " 990 "a terminal %i byte null padding region." % padding, 991 BiopythonParserWarning) 992 return 993 if extra.count(_null) != padding: 994 import warnings 995 from Bio import BiopythonParserWarning 996 warnings.warn("Your SFF file is invalid, post index %i byte " 997 "null padding region contained data: %r" 998 % (padding, extra), BiopythonParserWarning) 999 1000 offset = handle.tell() 1001 assert offset % 8 == 0, \ 1002 "Wanted offset %i %% 8 = %i to be zero" % (offset, offset % 8) 1003 # Should now be at the end of the file... 1004 extra = handle.read(4) 1005 if extra == _sff: 1006 raise ValueError("Additional data at end of SFF file, " 1007 "perhaps multiple SFF files concatenated? " 1008 "See offset %i" % offset) 1009 elif extra: 1010 raise ValueError("Additional data at end of SFF file, " 1011 "see offset %i" % offset)
1012 1013 1014 # This is a generator function!
1015 -def _SffTrimIterator(handle, alphabet=Alphabet.generic_dna):
1016 """Iterate over SFF reads (as SeqRecord objects) with trimming (PRIVATE).""" 1017 return SffIterator(handle, alphabet, trim=True)
1018 1019
1020 -class SffWriter(SequenceWriter):
1021 """SFF file writer.""" 1022
1023 - def __init__(self, handle, index=True, xml=None):
1024 """Creates the writer object. 1025 1026 - handle - Output handle, ideally in binary write mode. 1027 - index - Boolean argument, should we try and write an index? 1028 - xml - Optional string argument, xml manifest to be recorded in the index 1029 block (see function ReadRocheXmlManifest for reading this data). 1030 """ 1031 _check_mode(handle) 1032 self.handle = handle 1033 self._xml = xml 1034 if index: 1035 self._index = [] 1036 else: 1037 self._index = None
1038
1039 - def write_file(self, records):
1040 """Use this to write an entire file containing the given records.""" 1041 try: 1042 self._number_of_reads = len(records) 1043 except TypeError: 1044 self._number_of_reads = 0 # dummy value 1045 if not hasattr(self.handle, "seek") \ 1046 or not hasattr(self.handle, "tell"): 1047 raise ValueError("A handle with a seek/tell methods is " 1048 "required in order to record the total " 1049 "record count in the file header (once it " 1050 "is known at the end).") 1051 if self._index is not None and \ 1052 not (hasattr(self.handle, "seek") and hasattr(self.handle, "tell")): 1053 import warnings 1054 warnings.warn("A handle with a seek/tell methods is required in " 1055 "order to record an SFF index.") 1056 self._index = None 1057 self._index_start = 0 1058 self._index_length = 0 1059 if not hasattr(records, "next"): 1060 records = iter(records) 1061 # Get the first record in order to find the flow information 1062 # we will need for the header. 1063 try: 1064 record = next(records) 1065 except StopIteration: 1066 record = None 1067 if record is None: 1068 # No records -> empty SFF file (or an error)? 1069 # We can't write a header without the flow information. 1070 # return 0 1071 raise ValueError("Must have at least one sequence") 1072 try: 1073 self._key_sequence = _as_bytes(record.annotations["flow_key"]) 1074 self._flow_chars = _as_bytes(record.annotations["flow_chars"]) 1075 self._number_of_flows_per_read = len(self._flow_chars) 1076 except KeyError: 1077 raise ValueError("Missing SFF flow information") 1078 self.write_header() 1079 self.write_record(record) 1080 count = 1 1081 for record in records: 1082 self.write_record(record) 1083 count += 1 1084 if self._number_of_reads == 0: 1085 # Must go back and record the record count... 1086 offset = self.handle.tell() 1087 self.handle.seek(0) 1088 self._number_of_reads = count 1089 self.write_header() 1090 self.handle.seek(offset) # not essential? 1091 else: 1092 assert count == self._number_of_reads 1093 if self._index is not None: 1094 self._write_index() 1095 return count
1096
1097 - def _write_index(self):
1098 assert len(self._index) == self._number_of_reads 1099 handle = self.handle 1100 self._index.sort() 1101 self._index_start = handle.tell() # need for header 1102 # XML... 1103 if self._xml is not None: 1104 xml = _as_bytes(self._xml) 1105 else: 1106 from Bio import __version__ 1107 xml = "<!-- This file was output with Biopython %s -->\n" % __version__ 1108 xml += "<!-- This XML and index block attempts to mimic Roche SFF files -->\n" 1109 xml += "<!-- This file may be a combination of multiple SFF files etc -->\n" 1110 xml = _as_bytes(xml) 1111 xml_len = len(xml) 1112 # Write to the file... 1113 fmt = ">I4BLL" 1114 fmt_size = struct.calcsize(fmt) 1115 handle.write(_null * fmt_size + xml) # fill this later 1116 fmt2 = ">6B" 1117 assert 6 == struct.calcsize(fmt2) 1118 self._index.sort() 1119 index_len = 0 # don't know yet! 1120 for name, offset in self._index: 1121 # Roche files record the offsets using base 255 not 256. 1122 # See comments for parsing the index block. There may be a faster 1123 # way to code this, but we can't easily use shifts due to odd base 1124 off3 = offset 1125 off0 = off3 % 255 1126 off3 -= off0 1127 off1 = off3 % 65025 1128 off3 -= off1 1129 off2 = off3 % 16581375 1130 off3 -= off2 1131 assert offset == off0 + off1 + off2 + off3, \ 1132 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1133 off3, off2, off1, off0 = off3 // 16581375, off2 // 65025, \ 1134 off1 // 255, off0 1135 assert off0 < 255 and off1 < 255 and off2 < 255 and off3 < 255, \ 1136 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1137 handle.write(name + struct.pack(fmt2, 0, 1138 off3, off2, off1, off0, 255)) 1139 index_len += len(name) + 6 1140 # Note any padding in not included: 1141 self._index_length = fmt_size + xml_len + index_len # need for header 1142 # Pad out to an 8 byte boundary (although I have noticed some 1143 # real Roche SFF files neglect to do this depsite their manual 1144 # suggesting this padding should be there): 1145 if self._index_length % 8: 1146 padding = 8 - (self._index_length % 8) 1147 handle.write(_null * padding) 1148 else: 1149 padding = 0 1150 offset = handle.tell() 1151 assert offset == self._index_start + self._index_length + padding, \ 1152 "%i vs %i + %i + %i" % (offset, self._index_start, 1153 self._index_length, padding) 1154 # Must now go back and update the index header with index size... 1155 handle.seek(self._index_start) 1156 handle.write(struct.pack(fmt, 778921588, # magic number 1157 49, 46, 48, 48, # Roche index version, "1.00" 1158 xml_len, index_len) + xml) 1159 # Must now go back and update the header... 1160 handle.seek(0) 1161 self.write_header() 1162 handle.seek(offset) # not essential?
1163
1164 - def write_header(self):
1165 # Do header... 1166 key_length = len(self._key_sequence) 1167 # file header (part one) 1168 # use big endiean encdoing > 1169 # magic_number I 1170 # version 4B 1171 # index_offset Q 1172 # index_length I 1173 # number_of_reads I 1174 # header_length H 1175 # key_length H 1176 # number_of_flows_per_read H 1177 # flowgram_format_code B 1178 # [rest of file header depends on the number of flows and how many keys] 1179 fmt = '>I4BQIIHHHB%is%is' % ( 1180 self._number_of_flows_per_read, key_length) 1181 # According to the spec, the header_length field should be the total 1182 # number of bytes required by this set of header fields, and should be 1183 # equal to "31 + number_of_flows_per_read + key_length" rounded up to 1184 # the next value divisible by 8. 1185 if struct.calcsize(fmt) % 8 == 0: 1186 padding = 0 1187 else: 1188 padding = 8 - (struct.calcsize(fmt) % 8) 1189 header_length = struct.calcsize(fmt) + padding 1190 assert header_length % 8 == 0 1191 header = struct.pack(fmt, 779314790, # magic number 0x2E736666 1192 0, 0, 0, 1, # version 1193 self._index_start, self._index_length, 1194 self._number_of_reads, 1195 header_length, key_length, 1196 self._number_of_flows_per_read, 1197 1, # the only flowgram format code we support 1198 self._flow_chars, self._key_sequence) 1199 self.handle.write(header + _null * padding)
1200
1201 - def write_record(self, record):
1202 """Write a single additional record to the output file. 1203 1204 This assumes the header has been done. 1205 """ 1206 # Basics 1207 name = _as_bytes(record.id) 1208 name_len = len(name) 1209 seq = _as_bytes(str(record.seq).upper()) 1210 seq_len = len(seq) 1211 # Qualities 1212 try: 1213 quals = record.letter_annotations["phred_quality"] 1214 except KeyError: 1215 raise ValueError("Missing PHRED qualities information for %s" % record.id) 1216 # Flow 1217 try: 1218 flow_values = record.annotations["flow_values"] 1219 flow_index = record.annotations["flow_index"] 1220 if self._key_sequence != _as_bytes(record.annotations["flow_key"]) \ 1221 or self._flow_chars != _as_bytes(record.annotations["flow_chars"]): 1222 raise ValueError("Records have inconsistent SFF flow data") 1223 except KeyError: 1224 raise ValueError("Missing SFF flow information for %s" % record.id) 1225 except AttributeError: 1226 raise ValueError("Header not written yet?") 1227 # Clipping 1228 try: 1229 clip_qual_left = record.annotations["clip_qual_left"] 1230 if clip_qual_left < 0: 1231 raise ValueError("Negative SFF clip_qual_left value for %s" % record.id) 1232 if clip_qual_left: 1233 clip_qual_left += 1 1234 clip_qual_right = record.annotations["clip_qual_right"] 1235 if clip_qual_right < 0: 1236 raise ValueError("Negative SFF clip_qual_right value for %s" % record.id) 1237 clip_adapter_left = record.annotations["clip_adapter_left"] 1238 if clip_adapter_left < 0: 1239 raise ValueError("Negative SFF clip_adapter_left value for %s" % record.id) 1240 if clip_adapter_left: 1241 clip_adapter_left += 1 1242 clip_adapter_right = record.annotations["clip_adapter_right"] 1243 if clip_adapter_right < 0: 1244 raise ValueError("Negative SFF clip_adapter_right value for %s" % record.id) 1245 except KeyError: 1246 raise ValueError("Missing SFF clipping information for %s" % record.id) 1247 1248 # Capture information for index 1249 if self._index is not None: 1250 offset = self.handle.tell() 1251 # Check the position of the final record (before sort by name) 1252 # Using a four-digit base 255 number, so the upper bound is 1253 # 254*(1)+254*(255)+254*(255**2)+254*(255**3) = 4228250624 1254 # or equivalently it overflows at 255**4 = 4228250625 1255 if offset > 4228250624: 1256 import warnings 1257 warnings.warn("Read %s has file offset %i, which is too large " 1258 "to store in the Roche SFF index structure. No " 1259 "index block will be recorded." % (name, offset)) 1260 # No point recoring the offsets now 1261 self._index = None 1262 else: 1263 self._index.append((name, self.handle.tell())) 1264 1265 # the read header format (fixed part): 1266 # read_header_length H 1267 # name_length H 1268 # seq_len I 1269 # clip_qual_left H 1270 # clip_qual_right H 1271 # clip_adapter_left H 1272 # clip_adapter_right H 1273 # [rest of read header depends on the name length etc] 1274 # name 1275 # flow values 1276 # flow index 1277 # sequence 1278 # padding 1279 read_header_fmt = '>2HI4H%is' % name_len 1280 if struct.calcsize(read_header_fmt) % 8 == 0: 1281 padding = 0 1282 else: 1283 padding = 8 - (struct.calcsize(read_header_fmt) % 8) 1284 read_header_length = struct.calcsize(read_header_fmt) + padding 1285 assert read_header_length % 8 == 0 1286 data = struct.pack(read_header_fmt, 1287 read_header_length, 1288 name_len, seq_len, 1289 clip_qual_left, clip_qual_right, 1290 clip_adapter_left, clip_adapter_right, 1291 name) + _null * padding 1292 assert len(data) == read_header_length 1293 # now the flowgram values, flowgram index, bases and qualities 1294 # NOTE - assuming flowgram_format==1, which means struct type H 1295 read_flow_fmt = ">%iH" % self._number_of_flows_per_read 1296 read_flow_size = struct.calcsize(read_flow_fmt) 1297 temp_fmt = ">%iB" % seq_len # used for flow index and quals 1298 data += struct.pack(read_flow_fmt, *flow_values) \ 1299 + struct.pack(temp_fmt, *flow_index) \ 1300 + seq \ 1301 + struct.pack(temp_fmt, *quals) 1302 # now any final padding... 1303 padding = (read_flow_size + seq_len * 3) % 8 1304 if padding: 1305 padding = 8 - padding 1306 self.handle.write(data + _null * padding)
1307 1308 1309 if __name__ == "__main__": 1310 print("Running quick self test") 1311 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1312 with open(filename, "rb") as handle: 1313 metadata = ReadRocheXmlManifest(handle) 1314 with open(filename, "rb") as handle: 1315 index1 = sorted(_sff_read_roche_index(handle)) 1316 with open(filename, "rb") as handle: 1317 index2 = sorted(_sff_do_slow_index(handle)) 1318 assert index1 == index2 1319 with open(filename, "rb") as handle: 1320 assert len(index1) == len(list(SffIterator(handle))) 1321 from Bio._py3k import StringIO 1322 from io import BytesIO 1323 with open(filename, "rb") as handle: 1324 assert len(index1) == len(list(SffIterator(BytesIO(handle.read())))) 1325 1326 if sys.platform != "win32" and sys.version_info[0] < 3: 1327 # Can be lazy and treat as binary... 1328 with open(filename, "r") as handle: 1329 assert len(index1) == len(list(SffIterator(handle))) 1330 with open(filename) as handle: 1331 index2 = sorted(_sff_read_roche_index(handle)) 1332 assert index1 == index2 1333 with open(filename, "r") as handle: 1334 index2 = sorted(_sff_do_slow_index(handle)) 1335 assert index1 == index2 1336 with open(filename, "r") as handle: 1337 assert len(index1) == len(list(SffIterator(handle))) 1338 with open(filename, "r") as handle: 1339 assert len(index1) == len(list(SffIterator(BytesIO(handle.read())))) 1340 1341 with open(filename, "rb") as handle: 1342 sff = list(SffIterator(handle)) 1343 1344 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb") as handle: 1345 sff2 = list(SffIterator(handle)) 1346 assert len(sff) == len(sff2) 1347 for old, new in zip(sff, sff2): 1348 assert old.id == new.id 1349 assert str(old.seq) == str(new.seq) 1350 1351 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb") as handle: 1352 sff2 = list(SffIterator(handle)) 1353 assert len(sff) == len(sff2) 1354 for old, new in zip(sff, sff2): 1355 assert old.id == new.id 1356 assert str(old.seq) == str(new.seq) 1357 1358 with open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb") as handle: 1359 sff2 = list(SffIterator(handle)) 1360 assert len(sff) == len(sff2) 1361 for old, new in zip(sff, sff2): 1362 assert old.id == new.id 1363 assert str(old.seq) == str(new.seq) 1364 1365 with open("../../Tests/Roche/E3MFGYR02_index_at_start.sff", "rb") as handle: 1366 sff2 = list(SffIterator(handle)) 1367 assert len(sff) == len(sff2) 1368 for old, new in zip(sff, sff2): 1369 assert old.id == new.id 1370 assert str(old.seq) == str(new.seq) 1371 1372 with open("../../Tests/Roche/E3MFGYR02_index_in_middle.sff", "rb") as handle: 1373 sff2 = list(SffIterator(handle)) 1374 assert len(sff) == len(sff2) 1375 for old, new in zip(sff, sff2): 1376 assert old.id == new.id 1377 assert str(old.seq) == str(new.seq) 1378 1379 with open(filename, "rb") as handle: 1380 sff_trim = list(SffIterator(handle, trim=True)) 1381 1382 with open(filename, "rb") as handle: 1383 print(ReadRocheXmlManifest(handle)) 1384 1385 from Bio import SeqIO 1386 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.fasta" 1387 fasta_no_trim = list(SeqIO.parse(filename, "fasta")) 1388 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.qual" 1389 qual_no_trim = list(SeqIO.parse(filename, "qual")) 1390 1391 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.fasta" 1392 fasta_trim = list(SeqIO.parse(filename, "fasta")) 1393 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.qual" 1394 qual_trim = list(SeqIO.parse(filename, "qual")) 1395 1396 for s, sT, f, q, fT, qT in zip(sff, sff_trim, fasta_no_trim, 1397 qual_no_trim, fasta_trim, qual_trim): 1398 # print("") 1399 print(s.id) 1400 # print(s.seq) 1401 # print(s.letter_annotations["phred_quality"]) 1402 1403 assert s.id == f.id == q.id 1404 assert str(s.seq) == str(f.seq) 1405 assert s.letter_annotations[ 1406 "phred_quality"] == q.letter_annotations["phred_quality"] 1407 1408 assert s.id == sT.id == fT.id == qT.id 1409 assert str(sT.seq) == str(fT.seq) 1410 assert sT.letter_annotations[ 1411 "phred_quality"] == qT.letter_annotations["phred_quality"] 1412 1413 print("Writing with a list of SeqRecords...") 1414 handle = BytesIO() 1415 w = SffWriter(handle, xml=metadata) 1416 w.write_file(sff) # list 1417 data = handle.getvalue() 1418 print("And again with an iterator...") 1419 handle = BytesIO() 1420 w = SffWriter(handle, xml=metadata) 1421 w.write_file(iter(sff)) 1422 assert data == handle.getvalue() 1423 # Check 100% identical to the original: 1424 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1425 with open(filename, "rb") as handle: 1426 original = handle.read() 1427 assert len(data) == len(original) 1428 assert data == original 1429 del data 1430 1431 print("-" * 50) 1432 filename = "../../Tests/Roche/greek.sff" 1433 with open(filename, "rb") as handle: 1434 for record in SffIterator(handle): 1435 print(record.id) 1436 with open(filename, "rb") as handle: 1437 index1 = sorted(_sff_read_roche_index(handle)) 1438 with open(filename, "rb") as handle: 1439 index2 = sorted(_sff_do_slow_index(handle)) 1440 assert index1 == index2 1441 try: 1442 with open(filename, "rb") as handle: 1443 print(ReadRocheXmlManifest(handle)) 1444 assert False, "Should fail!" 1445 except ValueError: 1446 pass 1447 1448 with open(filename, "rb") as handle: 1449 for record in SffIterator(handle): 1450 pass 1451 try: 1452 for record in SffIterator(handle): 1453 print(record.id) 1454 assert False, "Should have failed" 1455 except ValueError as err: 1456 print("Checking what happens on re-reading a handle:") 1457 print(err) 1458 1459 """ 1460 # Ugly code to make test files... 1461 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1462 padding = len(index)%8 1463 if padding: 1464 padding = 8 - padding 1465 index += chr(0)*padding 1466 assert len(index)%8 == 0 1467 1468 # Ugly bit of code to make a fake index at start 1469 records = list(SffIterator( 1470 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1471 out_handle = open( 1472 "../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "w") 1473 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1474 padding = len(index)%8 1475 if padding: 1476 padding = 8 - padding 1477 index += chr(0)*padding 1478 w = SffWriter(out_handle, index=False, xml=None) 1479 # Fake the header... 1480 w._number_of_reads = len(records) 1481 w._index_start = 0 1482 w._index_length = 0 1483 w._key_sequence = records[0].annotations["flow_key"] 1484 w._flow_chars = records[0].annotations["flow_chars"] 1485 w._number_of_flows_per_read = len(w._flow_chars) 1486 w.write_header() 1487 w._index_start = out_handle.tell() 1488 w._index_length = len(index) 1489 out_handle.seek(0) 1490 w.write_header() # this time with index info 1491 w.handle.write(index) 1492 for record in records: 1493 w.write_record(record) 1494 out_handle.close() 1495 records2 = list(SffIterator( 1496 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1497 for old, new in zip(records, records2): 1498 assert str(old.seq)==str(new.seq) 1499 i = list(_sff_do_slow_index( 1500 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1501 1502 # Ugly bit of code to make a fake index in middle 1503 records = list(SffIterator( 1504 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1505 out_handle = open( 1506 "../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "w") 1507 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1508 padding = len(index)%8 1509 if padding: 1510 padding = 8 - padding 1511 index += chr(0)*padding 1512 w = SffWriter(out_handle, index=False, xml=None) 1513 # Fake the header... 1514 w._number_of_reads = len(records) 1515 w._index_start = 0 1516 w._index_length = 0 1517 w._key_sequence = records[0].annotations["flow_key"] 1518 w._flow_chars = records[0].annotations["flow_chars"] 1519 w._number_of_flows_per_read = len(w._flow_chars) 1520 w.write_header() 1521 for record in records[:5]: 1522 w.write_record(record) 1523 w._index_start = out_handle.tell() 1524 w._index_length = len(index) 1525 w.handle.write(index) 1526 for record in records[5:]: 1527 w.write_record(record) 1528 out_handle.seek(0) 1529 w.write_header() # this time with index info 1530 out_handle.close() 1531 records2 = list(SffIterator( 1532 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1533 for old, new in zip(records, records2): 1534 assert str(old.seq)==str(new.seq) 1535 j = list(_sff_do_slow_index( 1536 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1537 1538 # Ugly bit of code to make a fake index at end 1539 records = list(SffIterator( 1540 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1541 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "w") as out_handle: 1542 w = SffWriter(out_handle, index=False, xml=None) 1543 # Fake the header... 1544 w._number_of_reads = len(records) 1545 w._index_start = 0 1546 w._index_length = 0 1547 w._key_sequence = records[0].annotations["flow_key"] 1548 w._flow_chars = records[0].annotations["flow_chars"] 1549 w._number_of_flows_per_read = len(w._flow_chars) 1550 w.write_header() 1551 for record in records: 1552 w.write_record(record) 1553 w._index_start = out_handle.tell() 1554 w._index_length = len(index) 1555 out_handle.write(index) 1556 out_handle.seek(0) 1557 w.write_header() # this time with index info 1558 records2 = list(SffIterator( 1559 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1560 for old, new in zip(records, records2): 1561 assert str(old.seq)==str(new.seq) 1562 try: 1563 print(ReadRocheXmlManifest( 1564 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1565 assert False, "Should fail!" 1566 except ValueError: 1567 pass 1568 k = list(_sff_do_slow_index( 1569 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1570 """ 1571 1572 print("Done") 1573