PDBParser

From Biopython
(Difference between revisions)
Jump to: navigation, search
m
(CATH Dataset: Corrected statistics for new lengths.)
 
(12 intermediate revisions by 2 users not shown)
Line 17: Line 17:
 
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.
 
[https://github.com/biopython/biopython Biopython 1.57+] (May 2011) | Element column auto-assignment.
  
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements] Biopython 1.58- (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() && Other minor changes
+
[https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Biopython PDB branch] (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() && Other minor changes
  
 
== Benchmarking Script ==
 
== Benchmarking Script ==
  
The following script was used to benchmark the parser. The garbage collection module - <b>gc</b> - was necessary to avoid a memory leak that caused the machine to start swapping.
+
The following script was used to benchmark the parser. The garbage collection module - <b>gc</b> - was necessary to avoid dead objects still in memory causing the machine to start swapping.
  
 
<python>
 
<python>
Line 58: Line 58:
  
 
     pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]
 
     pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]
     pdb_length = [len(set([int(l[23:26]) for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues
+
     pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues
 
     sys.stderr.write("Loaded %s structures (Average Length: %4.3f residues)\n" %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))
 
     sys.stderr.write("Loaded %s structures (Average Length: %4.3f residues)\n" %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))
  
Line 78: Line 78:
 
=== CATH Dataset ===
 
=== CATH Dataset ===
  
Average Structure Length: 147 residues
+
Average Structure Length: 146 residues
  
 
==== Biopython 1.49 ====
 
==== Biopython 1.49 ====
Line 88: Line 88:
 
  Failed to parse 0 structures due to errors.
 
  Failed to parse 0 structures due to errors.
  
  Length                N. Structures  Average Time Spent (ms)
+
  Length                N. Structures  Average Time Spent ms
  < 100                3663           25.11
+
  < 100                3660           25.09
  100 =< x < 200        5295           44.68
+
  100 =< x < 200        5296           44.67
  200 =< x < 500        2328           83.41
+
  200 =< x < 500        2330           83.40
  500 =< x < 1000      44             180.35
+
  500 =< x < 1000      43             177.10
 +
>= 1000              1              320.10
  
 
  TOTAL                11330          46.84
 
  TOTAL                11330          46.84
 +
 +
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.time Link to full results]
 +
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_149.png Plot of the full results]
  
 
==== Biopython 1.57+ ====
 
==== Biopython 1.57+ ====
Line 104: Line 108:
 
  Failed to parse 0 structures due to errors.
 
  Failed to parse 0 structures due to errors.
  
  Length                N. Structures  Average Time Spent (ms)
+
  Length                N. Structures  Average Time Spent ms
  < 100                3663           32.57
+
  < 100                3660           32.55
  100 =< x < 200        5295           57.76
+
  100 =< x < 200        5296           57.75
  200 =< x < 500        2328           107.56
+
  200 =< x < 500        2330           107.54
  500 =< x < 1000      44             242.30
+
  500 =< x < 1000      43             236.62
 +
>= 1000              1              486.602
  
 
  TOTAL                11330          60.56
 
  TOTAL                11330          60.56
  
==== Biopython 1.58- ====
+
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.time Link to full results]
 +
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_current.png Plot of the full results]
 +
 
 +
==== Biopython PDB Branch ====
  
  
Line 120: Line 128:
 
  Failed to parse 0 structures due to errors.
 
  Failed to parse 0 structures due to errors.
  
  Length                N. Structures  Average Time Spent (ms)
+
  Length                N. Structures  Average Time Spent ms
  < 100                3663           33.24
+
  < 100                3660           33.21
  100 =< x < 200        5295           58.46
+
  100 =< x < 200        5296           58.45
  200 =< x < 500        2328           108.77
+
  200 =< x < 500        2330           108.76
  500 =< x < 1000      44             247.171
+
  500 =< x < 1000      43             234.37
 +
>= 1000              1              797.583
  
 
  TOTAL                11330          61.38
 
  TOTAL                11330          61.38
 +
 +
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.time Link to full results]
 +
[http://nmr.chem.uu.nl/~joao/f/benchmark_CATH-biopython_pdb_enhancements.png Plot of the full results]
  
 
=== PDB Dataset ===
 
=== PDB Dataset ===
  
Average Structure Length: 263 residues
+
Average Structure Length: 589 residues
 +
 
 +
Failed to parse 2 structures due to errors:
 +
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN (still being discussed)
 +
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs
  
 
==== Biopython 1.49 ====
 
==== Biopython 1.49 ====
Line 137: Line 153:
 
  Average Time per Structure: 381.706ms/structure
 
  Average Time per Structure: 381.706ms/structure
 
  Average Structures per Second: 2.62 structures/s
 
  Average Structures per Second: 2.62 structures/s
Failed to parse 2 structures due to errors:
 
  1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN
 
  2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs
 
  
  Length                N. Structures  Average Time Spent (ms)
+
  Length                N. Structures  Average Time Spent (ms)
  < 100                10986          392.227
+
  < 100                8402            410.270
  100 =< x < 200        19641           371.716
+
  100 =< x < 200        12226           461.744
  200 =< x < 500        35299           301.699
+
  200 =< x < 500        26493           182.027
  500 =< x < 1000      6291            597.472
+
  500 =< x < 1000      15878          290.693
  >= 1000              619            2881.520
+
  >= 1000              9837            942.513
  
 
  TOTAL                72836          381.706
 
  TOTAL                72836          381.706
  
==== Biopython 1.57+ (In progress) ====
+
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.49.time Link to full results]
 +
 
 +
==== Biopython 1.57+ ====
 +
 
 +
Total Time Spent: 29516.480s  (~8.20h)
 +
Average Time per Structure: 405.246 ms/structure
 +
Average Structures per Second: 2.47 structures/s
 +
 
 +
Length                N. Structures  Average Time Spent  (ms)
 +
< 100                8402            451.819
 +
100 =< x < 200        12226          505.933
 +
200 =< x < 500        26493          190.991
 +
500 =< x < 1000      15878          304.453
 +
>= 1000              9837            980.047
 +
 
 +
TOTAL                72836          405.246
  
In progress
+
[http://nmr.chem.uu.nl/~joao/f/benchmark_PDB-biopython_1.57.time Link to full results]
  
==== Biopython 1.58- (In progress) ====
+
==== Biopython PDB Branch (In progress) ====
  
 
In progress
 
In progress

Latest revision as of 13:46, 16 May 2011

This is a draft page for the PDBParser class.

Contents

Benchmark

A performance benchmark of the parser was carried out to evaluate wether the development of new features degraded the overall parsing speed.

Datasets

CATH Domain Collection - 11330 Structures containing only coordinate information (no Element assigned).

Protein Data Bank Collection - 72836 Structures containing both headers and coordinate information.

Versions Tested

Biopython 1.49 (Nov. 2008)

Biopython 1.57+ (May 2011) | Element column auto-assignment.

Biopython PDB branch (May 2011 @ Github) | Warnings module replaced _handle_pdb_exception() && Other minor changes

Benchmarking Script

The following script was used to benchmark the parser. The garbage collection module - gc - was necessary to avoid dead objects still in memory causing the machine to start swapping.

#!/usr/bin/env python
""" Script to Benchmark Bio.PDB PDBParser """
 
import sys, os, warnings
 
# Parsing Function
def parse_structure(path):
    """ Parses a PDB file """
 
    s = P.get_structure('test', path)
 
    return 0
 
def fancy_output(tps):
    """ Outputs the results in a nicer way """
 
    print "# Bio.PDB PDBParser Benchmark"
    print
    print "Structure \tLenght \tTime Spent (ms)"
    for i,s in enumerate(tps):
        print " %s\t (%s) \t%3.3f" %(os.path.basename(pdb_library[i]), pdb_length[i], s)
    print 
    print "Total time spent: %5.3fs" %(sum(tps)/1000)
    print "Average time per structure: %5.3fms" %(sum(tps)/len(tps))
 
if __name__=='__main__':
 
    import time, gc
    from Bio.PDB import PDBParser
    P = PDBParser(PERMISSIVE=1) # For the pdb_enhancements branch benchmarking, PERMISSIVE was set to 2 (silence warnings).
 
    library_path = sys.argv[1]
 
    pdb_library = [os.path.join(library_path, f) for f in os.listdir(library_path)]
    pdb_length = [len(set([l[17:26] for l in open(f) if l.startswith('ATOM')])) for f in pdb_library] # Unique counting of residues
    sys.stderr.write("Loaded %s structures (Average Length: %4.3f residues)\n" %(len(pdb_length), (sum(pdb_length)/float(len(pdb_length)))))
 
    tps = []
    # Run the Test
    for i, pdb_file in enumerate(pdb_library):    
        sys.stderr.write( "[%s] %i Structure(s) Parsed \n" %(os.path.basename(pdb_file), i+1) )
        a = time.time()
        parse_structure(pdb_file)
        b = time.time()-a
        tps.append(b*1000)
        gc.collect()
    # Output Results
    fancy_output(tps)

Results

CATH Dataset

Average Structure Length: 146 residues

Biopython 1.49

Total Time Spent: 530.686s
Average Time per Structure: 46.84ms/structure
Average Structures per Second: 21.38 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent  ms
< 100                 3660            25.09
100 =< x < 200        5296            44.67
200 =< x < 500        2330            83.40
500 =< x < 1000       43              177.10
>= 1000               1               320.10
TOTAL                 11330           46.84

Link to full results Plot of the full results

Biopython 1.57+

Total Time Spent: 686.176s
Average Time per Structure: 60.56ms/structure
Average Structures per Second: 16.51 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent  ms
< 100                 3660            32.55
100 =< x < 200        5296            57.75
200 =< x < 500        2330            107.54
500 =< x < 1000       43              236.62
>= 1000               1               486.602
TOTAL                 11330           60.56

Link to full results Plot of the full results

Biopython PDB Branch

Total Time Spent: 695.405s
Average Time per Structure: 61.37ms/structure
Average Structures per Second: 16.29 structures/s
Failed to parse 0 structures due to errors.
Length                N. Structures   Average Time Spent  ms
< 100                 3660            33.21
100 =< x < 200        5296            58.45
200 =< x < 500        2330            108.76
500 =< x < 1000       43              234.37
>= 1000               1               797.583
TOTAL                 11330           61.38

Link to full results Plot of the full results

PDB Dataset

Average Structure Length: 589 residues

Failed to parse 2 structures due to errors:

 1. 3NH3 (negative occupancy) - fix: http://bit.ly/ks6PDN (still being discussed)
 2. 2WMW (invalid ANISOU field) - fix: http://bit.ly/ld9BWs

Biopython 1.49

Total Time Spent: 27801.934s (7.72h)
Average Time per Structure: 381.706ms/structure
Average Structures per Second: 2.62 structures/s
Length                N. Structures   Average Time Spent  (ms)
< 100                 8402            410.270
100 =< x < 200        12226           461.744
200 =< x < 500        26493           182.027
500 =< x < 1000       15878           290.693
>= 1000               9837            942.513
TOTAL                 72836           381.706

Link to full results

Biopython 1.57+

Total Time Spent: 29516.480s  (~8.20h)
Average Time per Structure: 405.246 ms/structure
Average Structures per Second: 2.47 structures/s
Length                N. Structures   Average Time Spent  (ms)
< 100                 8402            451.819
100 =< x < 200        12226           505.933
200 =< x < 500        26493           190.991
500 =< x < 1000       15878           304.453
>= 1000               9837            980.047
TOTAL                 72836           405.246

Link to full results

Biopython PDB Branch (In progress)

In progress

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox