Talk:Split fasta file

From Biopython
Revision as of 09:28, 18 April 2009 by Davidw (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

I like the idea for this cookbook example, but I don't like the implementation so much. Mainly because you start by loading the whole file into memory! Something using an iterator approach could keep only one record in memory at at time (or at least, only one batch in memory at a time). Peter

  • That's what happens when you let a lab-rat write code! I should have thought about his but was keen to get something up as an example of how this would work. You approach below is obviously a much better, and more general one. I've added set it up here Split_large_file and made this page redirect to that one. Feel free to comment/edit that entry. --Davidw 09:28, 18 April 2009 (UTC)

We should retitle this example as the idea isn't FASTA specific - any SeqIO format would do.

How about this - note that the idea is actually very general:

def batch_iterator(iterator, batch_size) :
    """Returns lists of length batch_size.
    This can be used on any iterator, for example to batch up
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch
    Alignment objects from Bio.AlignIO.parse(...), or simply
    lines from a file handle.
    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry =
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
        yield batch
from Bio import SeqIO
record_iter = SeqIO.parse(open("SRR014849.fastq"),"fastq")
for i, batch in enumerate(batch_iterator(record_iter, 10000)) :
    filename = "group_%i.fastq" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fastq")
    print "Wrote %i records to %s" % (count, filename)

And the output using SRR014849.fastq from this compressed file at the NCBI.

Wrote 10000 records to group_1.fastq
Wrote 10000 records to group_2.fastq
Wrote 10000 records to group_3.fastq
Wrote 10000 records to group_4.fastq
Wrote 7348 records to group_5.fastq

You could tweak the final section to use filename labelled as in your example if you liked. Peter

Personal tools