PDBParser

From Biopython
Revision as of 14:04, 12 May 2011 by Joaor (Talk | contribs)
Jump to: navigation, search

This is a draft page for the PDBParser class.

Contents

Benchmark

A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed.

Datasets

CATH Domain Collection - 11330 Structures containing only coordinate information (no Element assigned).

Protein Data Bank Collection - 72836 Structures containing both headers and coordinate information.

Versions Tested

Biopython 1.49 (Nov. 2008)

Biopython 1.57+ (May 2011) | Element column auto-assignment.

[1] Biopython 1.58- (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() && Other minor changes

Benchmarking Script

The following script was used to benchmark the parser. The garbage collection module - gc - was necessary to avoid a memory leak that caused the machine to start swapping.

#!/usr/bin/env python
""" Script to Benchmark Bio.PDB PDBParser """
 
import sys, os, warnings
 
# Parsing Function
def parse_structure(path):
    """ Parses a PDB file """
 
    s = P.get_structure('test', path)
 
    return 0
 
def fancy_output(tps):
    """ Outputs the results in a nicer way """
 
    print "# Bio.PDB PDBParser Benchmark"
    print
    print "Structure \tLenght \tTime Spent (ms)"
    for i,s in enumerate(tps):
        print " %s\t (%s) \t%3.3f" %(os.path.basename(pdb_library[i]), pdb_length[i], s)
    print 
    print "Total time spent: %5.3fs" %(sum(tps)/1000)
    print "Average time per structure: %5.3fms" %(sum(tps)/len(tps))
 
if __name__=='__main__':
 
    import time, gc
    from Bio.PDB import PDBParser
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).
 
    library_path = sys.argv[1]
 
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]
    pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues
    sys.stderr.write("Loaded %s structures (Average Length: %4.3f residues)\n" %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))
 
    tps = []
    # Run the Test
    for i, pdb_file in enumerate(pdb_library):    
        sys.stderr.write( "[%s] %i Structure(s) Parsed \n" %(os.path.basename(pdb_file), i+1) )
        a = time.time()
        parse_structure(pdb_file)
        b = time.time()-a
        tps.append(b*1000)
        gc.collect()
    # Output Results
    fancy_output(tps)

Results

CATH Dataset

Average Structure Length: 147 residues

Biopython 1.49

Total Time Spent: 530.686s
Average Time per Structure: 46.84ms/structure
Average Structures per Second: 21.38 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent (ms)
< 100                 3663            25.11
100 =< x < 200        5295            44.68
200 =< x < 500        2328            83.41
500 =< x < 1000       44              180.35
TOTAL                 11330           46.84

Link to full results

Biopython 1.57+

Total Time Spent: 686.176s
Average Time per Structure: 60.56ms/structure
Average Structures per Second: 16.51 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent (ms)
< 100                 3663            32.57
100 =< x < 200        5295            57.76
200 =< x < 500        2328            107.56
500 =< x < 1000       44              242.30
TOTAL                 11330           60.56

Link to full results

Biopython 1.58-

Total Time Spent: 695.405s
Average Time per Structure: 61.37ms/structure
Average Structures per Second: 16.29 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent (ms)
< 100                 3663            33.24
100 =< x < 200        5295            58.46
200 =< x < 500        2328            108.77
500 =< x < 1000       44              247.171
TOTAL                 11330           61.38

Link to full results

PDB Dataset

Average Structure Length: 263 residues

Biopython 1.49

Total Time Spent: 27801.934s (7.72h)
Average Time per Structure: 381.706ms/structure
Average Structures per Second: 2.62 structures/s
Failed to parse 2 structures due to errors:
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs
Length                N. Structures   Average Time Spent (ms)
< 100                 10986           392.227
100 =< x < 200        19641           371.716
200 =< x < 500        35299           301.699
500 =< x < 1000       6291            597.472
>= 1000               619             2881.520
TOTAL                 72836           381.706

Link to full results

Biopython 1.57+ (In progress)

In progress

Biopython 1.58- (In progress)

In progress

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox