Package Bio :: Module bgzf
[hide private]
[frames] | no frames]

Source Code for Module Bio.bgzf

  1  #!/usr/bin/env python 
  2  # Copyright 2010-2011 by Peter Cock. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7  r"""Read and write BGZF compressed files (the GZIP variant used in BAM). 
  8   
  9  The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text 
 10  format (SAM), and a compressed binary format (BAM). The latter uses a 
 11  modified form of gzip compression called BGZF (Blocked GNU Zip Format), 
 12  which can be applied to any file format to provide compression with 
 13  efficient random access. BGZF is described together with the SAM/BAM 
 14  file format at http://samtools.sourceforge.net/SAM1.pdf 
 15   
 16  Please read the text below about 'virtual offsets' before using BGZF 
 17  files for random access. 
 18   
 19   
 20  Aim of this module 
 21  ------------------ 
 22   
 23  The Python gzip library can be used to read BGZF files, since for 
 24  decompression they are just (specialised) gzip files. What this 
 25  module aims to facilitate is random access to BGZF files (using the 
 26  'virtual offset' idea), and writing BGZF files (which means using 
 27  suitably sized gzip blocks and writing the extra 'BC' field in the 
 28  gzip headers). As in the gzip library, the zlib library is used 
 29  internally. 
 30   
 31  In addition to being required for random access to and writing of 
 32  BAM files, the BGZF format can also be used on other sequential 
 33  data (in the sense of one record after another), such as most of 
 34  the sequence data formats supported in Bio.SeqIO (like FASTA, 
 35  FASTQ, GenBank, etc) or large MAF alignments. 
 36   
 37  The Bio.SeqIO indexing functions use this module to support BGZF files. 
 38   
 39   
 40  Technical Introduction to BGZF 
 41  ------------------------------ 
 42   
 43  The gzip file format allows multiple compressed blocks, each of which 
 44  could be a stand alone gzip file. As an interesting bonus, this means 
 45  you can use Unix "cat" to combined to gzip files into one by 
 46  concatenating them. Also, each block can have one of several compression 
 47  levels (including uncompressed, which actually takes up a little bit 
 48  more space due to the gzip header). 
 49   
 50  What the BAM designers realised was that while random access to data 
 51  stored in traditional gzip files was slow, breaking the file into 
 52  gzip blocks would allow fast random access to each block. To access 
 53  a particular piece of the decompressed data, you just need to know 
 54  which block it starts in (the offset of the gzip block start), and 
 55  how far into the (decompressed) contents of the block you need to 
 56  read. 
 57   
 58  One problem with this is finding the gzip block sizes efficiently. 
 59  You can do it with a standard gzip file, but it requires every block 
 60  to be decompressed -- and that would be rather slow. Additionally 
 61  typical gzip files may use very large blocks. 
 62   
 63  All that differs in BGZF is that compressed size of each gzip block 
 64  is limited to 2^16 bytes, and an extra 'BC' field in the gzip header 
 65  records this size. Traditional decompression tools can ignore this, 
 66  and unzip the file just like any other gzip file. 
 67   
 68  The point of this is you can look at the first BGZF block, find out 
 69  how big it is from this 'BC' header, and thus seek immediately to 
 70  the second block, and so on. 
 71   
 72  The BAM indexing scheme records read positions using a 64 bit 
 73  'virtual offset', comprising coffset<<16|uoffset, where coffset is 
 74  the file offset of the BGZF block containing the start of the read 
 75  (unsigned integer using up to 64-16 = 48 bits), and uoffset is the 
 76  offset within the (decompressed) block (unsigned 16 bit integer). 
 77   
 78  This limits you to BAM files where the last block starts by 2^48 
 79  bytes, or 256 petabytes, and the decompressed size of each block 
 80  is at most 2^16 bytes, or 64kb. Note that this matches the BGZF 
 81  'BC' field size which limits the compressed size of each block to 
 82  2^16 bytes, allowing for BAM files to use BGZF with no gzip 
 83  compression (useful for intermediate files in memory to reduced 
 84  CPU load). 
 85   
 86   
 87  Warning about namespaces 
 88  ------------------------ 
 89   
 90  It is considered a bad idea to use "from XXX import *" in Python, because 
 91  it pollutes the namespace. This is a real issue with Bio.bgzf (and the 
 92  standard Python library gzip) because they contain a function called open 
 93  i.e. Suppose you do this: 
 94   
 95  >>> from Bio.bgzf import * 
 96  >>> print open.__module__ 
 97  Bio.bgzf 
 98   
 99  Or, 
100   
101  >>> from gzip import * 
102  >>> print open.__module__ 
103  gzip 
104   
105  Notice that the open function has been replaced. You can "fix" this if you 
106  need to by importing the built-in open function: 
107   
108  >>> from __builtin__ import open 
109   
110  However, what we recommend instead is to use the explicit namespace, e.g. 
111   
112  >>> from Bio import bgzf 
113  >>> print bgzf.open.__module__ 
114  Bio.bgzf 
115   
116   
117  Example 
118  ------- 
119   
120  This is an ordinary GenBank file compressed using BGZF, so it can 
121  be decompressed using gzip, 
122   
123  >>> import gzip 
124  >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r") 
125  >>> assert 0 == handle.tell() 
126  >>> line = handle.readline() 
127  >>> assert 80 == handle.tell() 
128  >>> line = handle.readline() 
129  >>> assert 143 == handle.tell() 
130  >>> data = handle.read(70000) 
131  >>> assert 70143 == handle.tell() 
132  >>> handle.close() 
133   
134  We can also access the file using the BGZF reader - but pay 
135  attention to the file offsets which will be explained below: 
136   
137  >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r") 
138  >>> assert 0 == handle.tell() 
139  >>> print handle.readline().rstrip() 
140  LOCUS       NC_000932             154478 bp    DNA     circular PLN 15-APR-2009 
141  >>> assert 80 == handle.tell() 
142  >>> print handle.readline().rstrip() 
143  DEFINITION  Arabidopsis thaliana chloroplast, complete genome. 
144  >>> assert 143 == handle.tell() 
145  >>> data = handle.read(70000) 
146  >>> assert 987828735 == handle.tell() 
147  >>> print handle.readline().rstrip() 
148  f="GeneID:844718" 
149  >>> print handle.readline().rstrip() 
150       CDS             complement(join(84337..84771,85454..85843)) 
151  >>> offset = handle.seek(make_virtual_offset(55074, 126)) 
152  >>> print handle.readline().rstrip() 
153      68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat 
154  >>> handle.close() 
155   
156  Notice the handle's offset looks different as a BGZF file. This 
157  brings us to the key point about BGZF, which is the block structure: 
158   
159  >>> handle = open("GenBank/NC_000932.gb.bgz", "rb") 
160  >>> for values in BgzfBlocks(handle): 
161  ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values 
162  Raw start 0, raw length 15073; data start 0, data length 65536 
163  Raw start 15073, raw length 17857; data start 65536, data length 65536 
164  Raw start 32930, raw length 22144; data start 131072, data length 65536 
165  Raw start 55074, raw length 22230; data start 196608, data length 65536 
166  Raw start 77304, raw length 14939; data start 262144, data length 43478 
167  Raw start 92243, raw length 28; data start 305622, data length 0 
168  >>> handle.close() 
169   
170  In this example the first three blocks are 'full' and hold 65536 bytes 
171  of uncompressed data. The fourth block isn't full and holds 43478 bytes. 
172  Finally there is a special empty fifth block which takes 28 bytes on 
173  disk and serves as an 'end of file' (EOF) marker. If this is missing, 
174  it is possible your BGZF file is incomplete. 
175   
176  By reading ahead 70,000 bytes we moved into the second BGZF block, 
177  and at that point the BGZF virtual offsets start to look different 
178  a simple offset into the decompressed data as exposed by the gzip 
179  library. 
180   
181  As an example, consider seeking to the decompressed position 196734. 
182  Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this 
183  is equivalent to jumping the first three blocks (which in this 
184  specific example are all size 65536 after decompression - which 
185  does not always hold) and starting at byte 126 of the fourth block 
186  (after decompression). For BGZF, we need to know the fourth block's 
187  offset of 55074 and the offset within the block of 126 to get the 
188  BGZF virtual offset. 
189   
190  >>> print 55074 << 16 | 126 
191  3609329790 
192  >>> print bgzf.make_virtual_offset(55074, 126) 
193  3609329790 
194   
195  Thus for this BGZF file, decompressed position 196734 corresponds 
196  to the virtual offset 3609329790. However, another BGZF file with 
197  different contents would have compressed more or less efficiently, 
198  so the compressed blocks would be different sizes. What this means 
199  is the mapping between the uncompressed offset and the compressed 
200  virtual offset depends on the BGZF file you are using. 
201   
202  If you are accessing a BGZF file via this module, just use the 
203  handle.tell() method to note the virtual offset of a position you 
204  may later want to return to using handle.seek(). 
205   
206  The catch with BGZF virtual offsets is while they can be compared 
207  (which offset comes first in the file), you cannot safely subtract 
208  them to get the size of the data between them, nor add/subtract 
209  a relative offset. 
210   
211  Of course you can parse this file with Bio.SeqIO using BgzfReader, 
212  although there isn't any benefit over using gzip.open(...), unless 
213  you want to index BGZF compressed sequence files: 
214   
215  >>> from Bio import SeqIO 
216  >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz") 
217  >>> record = SeqIO.read(handle, "genbank") 
218  >>> handle.close() 
219  >>> print record.id 
220  NC_000932.1 
221   
222  """ 
223   
224  import zlib 
225  import struct 
226  import __builtin__  # to access the usual open function 
227   
228  from Bio._py3k import _as_bytes, _as_string 
229   
230  #For Python 2 can just use: _bgzf_magic = '\x1f\x8b\x08\x04' 
231  #but need to use bytes on Python 3 
232  _bgzf_magic = _as_bytes("\x1f\x8b\x08\x04") 
233  _bgzf_header = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00" 
234                           "\x00\xff\x06\x00\x42\x43\x02\x00") 
235  _bgzf_eof = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC" + 
236                        "\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00") 
237  _bytes_BC = _as_bytes("BC") 
238  _empty_bytes_string = _as_bytes("") 
239  _bytes_newline = _as_bytes("\n") 
240   
241   
242 -def open(filename, mode="rb"):
243 """Open a BGZF file for reading, writing or appending.""" 244 if "r" in mode.lower(): 245 return BgzfReader(filename, mode) 246 elif "w" in mode.lower() or "a" in mode.lower(): 247 return BgzfWriter(filename, mode) 248 else: 249 raise ValueError("Bad mode %r" % mode)
250 251
252 -def make_virtual_offset(block_start_offset, within_block_offset):
253 """Compute a BGZF virtual offset from block start and within block offsets. 254 255 The BAM indexing scheme records read positions using a 64 bit 256 'virtual offset', comprising in C terms: 257 258 block_start_offset<<16 | within_block_offset 259 260 Here block_start_offset is the file offset of the BGZF block 261 start (unsigned integer using up to 64-16 = 48 bits), and 262 within_block_offset within the (decompressed) block (unsigned 263 16 bit integer). 264 265 >>> make_virtual_offset(0,0) 266 0 267 >>> make_virtual_offset(0,1) 268 1 269 >>> make_virtual_offset(0, 2**16 - 1) 270 65535 271 >>> make_virtual_offset(0, 2**16) 272 Traceback (most recent call last): 273 ... 274 ValueError: Require 0 <= within_block_offset < 2**16, got 65536 275 276 >>> 65536 == make_virtual_offset(1,0) 277 True 278 >>> 65537 == make_virtual_offset(1,1) 279 True 280 >>> 131071 == make_virtual_offset(1, 2**16 - 1) 281 True 282 283 >>> 6553600000 == make_virtual_offset(100000,0) 284 True 285 >>> 6553600001 == make_virtual_offset(100000,1) 286 True 287 >>> 6553600010 == make_virtual_offset(100000,10) 288 True 289 290 >>> make_virtual_offset(2**48,0) 291 Traceback (most recent call last): 292 ... 293 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656 294 295 """ 296 if within_block_offset < 0 or within_block_offset >= 65536: 297 raise ValueError("Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset) 298 if block_start_offset < 0 or block_start_offset >= 281474976710656: 299 raise ValueError("Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset) 300 return (block_start_offset<<16) | within_block_offset
301 302
303 -def split_virtual_offset(virtual_offset):
304 """Divides a 64-bit BGZF virtual offset into block start & within block offsets. 305 306 >>> (100000, 0) == split_virtual_offset(6553600000) 307 True 308 >>> (100000, 10) == split_virtual_offset(6553600010) 309 True 310 311 """ 312 start = virtual_offset>>16 313 return start, virtual_offset ^ (start<<16)
314 315
316 -def BgzfBlocks(handle):
317 """Low level debugging function to inspect BGZF blocks. 318 319 Returns the block start offset (see virtual offsets), the block 320 length (add these for the start of the next block), and the 321 decompressed length of the blocks contents (limited to 65536 in 322 BGZF). 323 324 >>> from __builtin__ import open 325 >>> handle = open("SamBam/ex1.bam", "rb") 326 >>> for values in BgzfBlocks(handle): 327 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values 328 Raw start 0, raw length 18239; data start 0, data length 65536 329 Raw start 18239, raw length 18223; data start 65536, data length 65536 330 Raw start 36462, raw length 18017; data start 131072, data length 65536 331 Raw start 54479, raw length 17342; data start 196608, data length 65536 332 Raw start 71821, raw length 17715; data start 262144, data length 65536 333 Raw start 89536, raw length 17728; data start 327680, data length 65536 334 Raw start 107264, raw length 17292; data start 393216, data length 63398 335 Raw start 124556, raw length 28; data start 456614, data length 0 336 >>> handle.close() 337 338 Indirectly we can tell this file came from an old version of 339 samtools because all the blocks (except the final one and the 340 dummy empty EOF marker block) are 65536 bytes. Later versions 341 avoid splitting a read between two blocks, and give the header 342 its own block (useful to speed up replacing the header). You 343 can see this in ex1_refresh.bam created using samtools 0.1.18: 344 345 samtools view -b ex1.bam > ex1_refresh.bam 346 347 >>> handle = open("SamBam/ex1_refresh.bam", "rb") 348 >>> for values in BgzfBlocks(handle): 349 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values 350 Raw start 0, raw length 53; data start 0, data length 38 351 Raw start 53, raw length 18195; data start 38, data length 65434 352 Raw start 18248, raw length 18190; data start 65472, data length 65409 353 Raw start 36438, raw length 18004; data start 130881, data length 65483 354 Raw start 54442, raw length 17353; data start 196364, data length 65519 355 Raw start 71795, raw length 17708; data start 261883, data length 65411 356 Raw start 89503, raw length 17709; data start 327294, data length 65466 357 Raw start 107212, raw length 17390; data start 392760, data length 63854 358 Raw start 124602, raw length 28; data start 456614, data length 0 359 >>> handle.close() 360 361 The above example has no embedded SAM header (thus the first block 362 is very small at just 38 bytes of decompressed data), while the next 363 example does (a larger block of 103 bytes). Notice that the rest of 364 the blocks show the same sizes (they contain the same read data): 365 366 >>> handle = open("SamBam/ex1_header.bam", "rb") 367 >>> for values in BgzfBlocks(handle): 368 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values 369 Raw start 0, raw length 104; data start 0, data length 103 370 Raw start 104, raw length 18195; data start 103, data length 65434 371 Raw start 18299, raw length 18190; data start 65537, data length 65409 372 Raw start 36489, raw length 18004; data start 130946, data length 65483 373 Raw start 54493, raw length 17353; data start 196429, data length 65519 374 Raw start 71846, raw length 17708; data start 261948, data length 65411 375 Raw start 89554, raw length 17709; data start 327359, data length 65466 376 Raw start 107263, raw length 17390; data start 392825, data length 63854 377 Raw start 124653, raw length 28; data start 456679, data length 0 378 >>> handle.close() 379 380 """ 381 data_start = 0 382 while True: 383 start_offset = handle.tell() 384 #This may raise StopIteration which is perfect here 385 block_length, data = _load_bgzf_block(handle) 386 data_len = len(data) 387 yield start_offset, block_length, data_start, data_len 388 data_start += data_len
389 390
391 -def _load_bgzf_block(handle, text_mode=False):
392 """Internal function to load the next BGZF function (PRIVATE).""" 393 magic = handle.read(4) 394 if not magic: 395 #End of file 396 raise StopIteration 397 if magic != _bgzf_magic: 398 raise ValueError(r"A BGZF (e.g. a BAM file) block should start with " 399 r"%r, not %r; handle.tell() now says %r" 400 % (_bgzf_magic, magic, handle.tell())) 401 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = \ 402 struct.unpack("<LBBH", handle.read(8)) 403 404 block_size = None 405 x_len = 0 406 while x_len < extra_len: 407 subfield_id = handle.read(2) 408 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t 409 subfield_data = handle.read(subfield_len) 410 x_len += subfield_len + 4 411 if subfield_id == _bytes_BC: 412 assert subfield_len == 2, "Wrong BC payload length" 413 assert block_size is None, "Two BC subfields?" 414 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t 415 assert x_len == extra_len, (x_len, extra_len) 416 assert block_size is not None, "Missing BC, this isn't a BGZF file!" 417 #Now comes the compressed data, CRC, and length of uncompressed data. 418 deflate_size = block_size - 1 - extra_len - 19 419 d = zlib.decompressobj(-15) # Negative window size means no headers 420 data = d.decompress(handle.read(deflate_size)) + d.flush() 421 expected_crc = handle.read(4) 422 expected_size = struct.unpack("<I", handle.read(4))[0] 423 assert expected_size == len(data), \ 424 "Decompressed to %i, not %i" % (len(data), expected_size) 425 #Should cope with a mix of Python platforms... 426 crc = zlib.crc32(data) 427 if crc < 0: 428 crc = struct.pack("<i", crc) 429 else: 430 crc = struct.pack("<I", crc) 431 assert expected_crc == crc, \ 432 "CRC is %s, not %s" % (crc, expected_crc) 433 if text_mode: 434 return block_size, _as_string(data) 435 else: 436 return block_size, data
437 438
439 -class BgzfReader(object):
440 r"""BGZF reader, acts like a read only handle but seek/tell differ. 441 442 Let's use the BgzfBlocks function to have a peak at the BGZF blocks 443 in an example BAM file, 444 445 >>> from __builtin__ import open 446 >>> handle = open("SamBam/ex1.bam", "rb") 447 >>> for values in BgzfBlocks(handle): 448 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values 449 Raw start 0, raw length 18239; data start 0, data length 65536 450 Raw start 18239, raw length 18223; data start 65536, data length 65536 451 Raw start 36462, raw length 18017; data start 131072, data length 65536 452 Raw start 54479, raw length 17342; data start 196608, data length 65536 453 Raw start 71821, raw length 17715; data start 262144, data length 65536 454 Raw start 89536, raw length 17728; data start 327680, data length 65536 455 Raw start 107264, raw length 17292; data start 393216, data length 63398 456 Raw start 124556, raw length 28; data start 456614, data length 0 457 >>> handle.close() 458 459 Now let's see how to use this block information to jump to 460 specific parts of the decompressed BAM file: 461 462 >>> handle = BgzfReader("SamBam/ex1.bam", "rb") 463 >>> assert 0 == handle.tell() 464 >>> magic = handle.read(4) 465 >>> assert 4 == handle.tell() 466 467 So far nothing so strange, we got the magic marker used at the 468 start of a decompressed BAM file, and the handle position makes 469 sense. Now however, let's jump to the end of this block and 4 470 bytes into the next block by reading 65536 bytes, 471 472 >>> data = handle.read(65536) 473 >>> len(data) 474 65536 475 >>> assert 1195311108 == handle.tell() 476 477 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit 478 virtual offset, which means: 479 480 >>> split_virtual_offset(1195311108) 481 (18239, 4) 482 483 You should spot 18239 as the start of the second BGZF block, while 484 the 4 is the offset into this block. See also make_virtual_offset, 485 486 >>> make_virtual_offset(18239, 4) 487 1195311108 488 489 Let's jump back to almost the start of the file, 490 491 >>> make_virtual_offset(0, 2) 492 2 493 >>> handle.seek(2) 494 2 495 >>> handle.close() 496 497 Note that you can use the max_cache argument to limit the number of 498 BGZF blocks cached in memory. The default is 100, and since each 499 block can be up to 64kb, the default cache could take up to 6MB of 500 RAM. The cache is not important for reading through the file in one 501 pass, but is important for improving performance of random access. 502 """ 503
504 - def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
505 #TODO - Assuming we can seek, check for 28 bytes EOF empty block 506 #and if missing warn about possible truncation (as in samtools)? 507 if max_cache < 1: 508 raise ValueError("Use max_cache with a minimum of 1") 509 #Must open the BGZF file in binary mode, but we may want to 510 #treat the contents as either text or binary (unicode or 511 #bytes under Python 3) 512 if fileobj: 513 assert filename is None 514 handle = fileobj 515 assert "b" in handle.mode.lower() 516 else: 517 if "w" in mode.lower() \ 518 or "a" in mode.lower(): 519 raise ValueError("Must use read mode (default), not write or append mode") 520 handle = __builtin__.open(filename, "rb") 521 self._text = "b" not in mode.lower() 522 if self._text: 523 self._newline = "\n" 524 else: 525 self._newline = _bytes_newline 526 self._handle = handle 527 self.max_cache = max_cache 528 self._buffers = {} 529 self._block_start_offset = None 530 self._block_raw_length = None 531 self._load_block(handle.tell())
532
533 - def _load_block(self, start_offset=None):
534 if start_offset is None: 535 #If the file is being read sequentially, then _handle.tell() 536 #should be pointing at the start of the next block. 537 #However, if seek has been used, we can't assume that. 538 start_offset = self._block_start_offset + self._block_raw_length 539 if start_offset == self._block_start_offset: 540 self._within_block_offset = 0 541 return 542 elif start_offset in self._buffers: 543 #Already in cache 544 self._buffer, self._block_raw_length = self._buffers[start_offset] 545 self._within_block_offset = 0 546 self._block_start_offset = start_offset 547 return 548 #Must hit the disk... first check cache limits, 549 while len(self._buffers) >= self.max_cache: 550 #TODO - Implemente LRU cache removal? 551 self._buffers.popitem() 552 #Now load the block 553 handle = self._handle 554 if start_offset is not None: 555 handle.seek(start_offset) 556 self._block_start_offset = handle.tell() 557 try: 558 block_size, self._buffer = _load_bgzf_block(handle, self._text) 559 except StopIteration: 560 #EOF 561 block_size = 0 562 if self._text: 563 self._buffer = "" 564 else: 565 self._buffer = _empty_bytes_string 566 self._within_block_offset = 0 567 self._block_raw_length = block_size 568 #Finally save the block in our cache, 569 self._buffers[self._block_start_offset] = self._buffer, block_size
570
571 - def tell(self):
572 """Returns a 64-bit unsigned BGZF virtual offset.""" 573 if 0 < self._within_block_offset == len(self._buffer): 574 #Special case where we're right at the end of a (non empty) block. 575 #For non-maximal blocks could give two possible virtual offsets, 576 #but for a maximal block can't use 65536 as the within block 577 #offset. Therefore for consistency, use the next block and a 578 #within block offset of zero. 579 return (self._block_start_offset + self._block_raw_length) << 16 580 else: 581 #return make_virtual_offset(self._block_start_offset, 582 # self._within_block_offset) 583 #TODO - Include bounds checking as in make_virtual_offset? 584 return (self._block_start_offset<<16) | self._within_block_offset
585
586 - def seek(self, virtual_offset):
587 """Seek to a 64-bit unsigned BGZF virtual offset.""" 588 #Do this inline to avoid a function call, 589 #start_offset, within_block = split_virtual_offset(virtual_offset) 590 start_offset = virtual_offset>>16 591 within_block = virtual_offset ^ (start_offset<<16) 592 if start_offset != self._block_start_offset: 593 #Don't need to load the block if already there 594 #(this avoids a function call since _load_block would do nothing) 595 self._load_block(start_offset) 596 assert start_offset == self._block_start_offset 597 if within_block >= len(self._buffer) \ 598 and not (within_block == 0 and len(self._buffer)==0): 599 raise ValueError("Within offset %i but block size only %i" 600 % (within_block, len(self._buffer))) 601 self._within_block_offset = within_block 602 #assert virtual_offset == self.tell(), \ 603 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \ 604 # % (virtual_offset, start_offset, within_block, 605 # self.tell(), self._block_start_offset, self._within_block_offset) 606 return virtual_offset
607
608 - def read(self, size=-1):
609 if size < 0: 610 raise NotImplementedError("Don't be greedy, that could be massive!") 611 elif size == 0: 612 if self._text: 613 return "" 614 else: 615 return _empty_bytes_string 616 elif self._within_block_offset + size <= len(self._buffer): 617 #This may leave us right at the end of a block 618 #(lazy loading, don't load the next block unless we have too) 619 data = self._buffer[self._within_block_offset:self._within_block_offset + size] 620 self._within_block_offset += size 621 assert data # Must be at least 1 byte 622 return data 623 else: 624 data = self._buffer[self._within_block_offset:] 625 size -= len(data) 626 self._load_block() # will reset offsets 627 #TODO - Test with corner case of an empty block followed by 628 #a non-empty block 629 if not self._buffer: 630 return data # EOF 631 elif size: 632 #TODO - Avoid recursion 633 return data + self.read(size) 634 else: 635 #Only needed the end of the last block 636 return data
637
638 - def readline(self):
639 i = self._buffer.find(self._newline, self._within_block_offset) 640 #Three cases to consider, 641 if i==-1: 642 #No newline, need to read in more data 643 data = self._buffer[self._within_block_offset:] 644 self._load_block() # will reset offsets 645 if not self._buffer: 646 return data # EOF 647 else: 648 #TODO - Avoid recursion 649 return data + self.readline() 650 elif i + 1 == len(self._buffer): 651 #Found new line, but right at end of block (SPECIAL) 652 data = self._buffer[self._within_block_offset:] 653 #Must now load the next block to ensure tell() works 654 self._load_block() # will reset offsets 655 assert data 656 return data 657 else: 658 #Found new line, not at end of block (easy case, no IO) 659 data = self._buffer[self._within_block_offset:i+1] 660 self._within_block_offset = i + 1 661 #assert data.endswith(self._newline) 662 return data
663
664 - def next(self):
665 line = self.readline() 666 if not line: 667 raise StopIteration 668 return line
669
670 - def __iter__(self):
671 return self
672
673 - def close(self):
674 self._handle.close() 675 self._buffer = None 676 self._block_start_offset = None 677 self._buffers = None
678
679 - def seekable(self):
680 return True
681
682 - def isatty(self):
683 return False
684
685 - def fileno(self):
686 return self._handle.fileno()
687 688
689 -class BgzfWriter(object):
690
691 - def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
692 if fileobj: 693 assert filename is None 694 handle = fileobj 695 else: 696 if "w" not in mode.lower() \ 697 and "a" not in mode.lower(): 698 raise ValueError("Must use write or append mode, not %r" % mode) 699 if "a" in mode.lower(): 700 handle = __builtin__.open(filename, "ab") 701 else: 702 handle = __builtin__.open(filename, "wb") 703 self._text = "b" not in mode.lower() 704 self._handle = handle 705 self._buffer = _empty_bytes_string 706 self.compresslevel = compresslevel
707
708 - def _write_block(self, block):
709 #print "Saving %i bytes" % len(block) 710 start_offset = self._handle.tell() 711 assert len(block) <= 65536 712 #Giving a negative window bits means no gzip/zlib headers, -15 used in samtools 713 c = zlib.compressobj(self.compresslevel, 714 zlib.DEFLATED, 715 -15, 716 zlib.DEF_MEM_LEVEL, 717 0) 718 compressed = c.compress(block) + c.flush() 719 del c 720 assert len(compressed) < 65536, "TODO - Didn't compress enough, try less data in this block" 721 crc = zlib.crc32(block) 722 #Should cope with a mix of Python platforms... 723 if crc < 0: 724 crc = struct.pack("<i", crc) 725 else: 726 crc = struct.pack("<I", crc) 727 bsize = struct.pack("<H", len(compressed)+25) # includes -1 728 crc = struct.pack("<I", zlib.crc32(block) & 0xffffffffL) 729 uncompressed_length = struct.pack("<I", len(block)) 730 #Fixed 16 bytes, 731 # gzip magic bytes (4) mod time (4), 732 # gzip flag (1), os (1), extra length which is six (2), 733 # sub field which is BC (2), sub field length of two (2), 734 #Variable data, 735 #2 bytes: block length as BC sub field (2) 736 #X bytes: the data 737 #8 bytes: crc (4), uncompressed data length (4) 738 data = _bgzf_header + bsize + compressed + crc + uncompressed_length 739 self._handle.write(data)
740
741 - def write(self, data):
742 #TODO - Check bytes vs unicode 743 data = _as_bytes(data) 744 #block_size = 2**16 = 65536 745 data_len = len(data) 746 if len(self._buffer) + data_len < 65536: 747 #print "Cached %r" % data 748 self._buffer += data 749 return 750 else: 751 #print "Got %r, writing out some data..." % data 752 self._buffer += data 753 while len(self._buffer) >= 65536: 754 self._write_block(self._buffer[:65536]) 755 self._buffer = self._buffer[65536:]
756
757 - def flush(self):
758 while len(self._buffer) >= 65536: 759 self._write_block(self._buffer[:65535]) 760 self._buffer = self._buffer[65535:] 761 self._write_block(self._buffer) 762 self._buffer = _empty_bytes_string 763 self._handle.flush()
764
765 - def close(self):
766 """Flush data, write 28 bytes empty BGZF EOF marker, and close the BGZF file.""" 767 if self._buffer: 768 self.flush() 769 #samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, 770 #and if it is missing warns the BAM file may be truncated. In addition to 771 #samtools writing this block, so too does bgzip - so we should too. 772 self._handle.write(_bgzf_eof) 773 self._handle.flush() 774 self._handle.close()
775
776 - def tell(self):
777 """Returns a BGZF 64-bit virtual offset.""" 778 return make_virtual_offset(self._handle.tell(), len(self._buffer))
779
780 - def seekable(self):
781 #Not seekable, but we do support tell... 782 return False
783
784 - def isatty(self):
785 return False
786
787 - def fileno(self):
788 return self._handle.fileno()
789 790 791 if __name__ == "__main__": 792 import sys 793 if len(sys.argv) > 1: 794 print "Call this with no arguments and pipe uncompressed data in on stdin" 795 print "and it will produce BGZF compressed data on stdout. e.g." 796 print 797 print "./bgzf.py < example.fastq > example.fastq.bgz" 798 print 799 print "The extension convention of *.bgz is to distinugish these from *.gz" 800 print "used for standard gzipped files without the block structure of BGZF." 801 print "You can use the standard gunzip command to decompress BGZF files," 802 print "if it complains about the extension try something like this:" 803 print 804 print "cat example.fastq.bgz | gunzip > example.fastq" 805 print 806 print "See also the tool bgzip that comes with samtools" 807 sys.exit(0) 808 809 sys.stderr.write("Producing BGZF output from stdin...\n") 810 w = BgzfWriter(fileobj=sys.stdout) 811 while True: 812 data = sys.stdin.read(65536) 813 w.write(data) 814 if not data: 815 break 816 #Doing close with write an empty BGZF block as EOF marker: 817 w.close() 818 sys.stderr.write("BGZF data produced\n") 819