Interfacing with PAML

This module provides an interface to the PAML (Phylogenetic Analysis by Maximum Likelihood) package of programs. Currently, interfaces to the programs codeml, baseml and yn00 as well as a Python re-implementation of chi2 have been included.

Availability

This module is available in Biopython 1.58 and later.

To use this module, you must have PAML version 4.1 or greater installed on your system.

codeml

The codeml interface is provided in the following module:

from Bio.Phylo.PAML import codeml

The Codeml object

Initialization

The interface is implemented as an object which maintains program options. In order to run codeml, typically one provides a control file which indicates the locations of an alignment file, a tree file and an output file, along with a series of options dictating how the software is to be run. In the Codeml object, the three file locations are stored as string attributes. The three file locations, as well as the location of the desired working directory, may be specified in the Codeml constructor as follows:

cml = codeml.Codeml(alignment = "align.phylip", tree = "species.tree",
                    out_file = "results.out", working_dir = "./scratch")

They may also be set individually:

cml = codeml.Codeml()
cml.alignment = "align.phylip"
cml.tree = "species.tree"
cml.out_file = "results.out"
cml.working_dir = "./scratch"

Note that all file locations are converted to locations relative to the working directory. PAML programs have a limit on the lengths of file location strings; converting to relative locations will shorten these strings, allowing you generally to sidestep this limitation.

Options

The codeml runtime options are stored in a dictionary object that is keyed by the option names. For more information on the usage of these options, please refer to the PAML user manual.

The complete set of options and their current values may be printed:

>>> cml.print_options()
verbose = None
CodonFreq = None
cleandata = None
fix_blength = None
NSsites = None
fix_omega = None
clock = None
ncatG = None
runmode = None
fix_kappa = None
fix_alpha = None
Small_Diff = None
method = None
Malpha = None
aaDist = None
RateAncestor = None
aaRatefile = None
icode = None
alpha = None
seqtype = None
omega = None
getSE = None
noisy = None
Mgene = None
kappa = None
model = None
ndata = None

Setting an option to a value of None will cause it to be ignored by codeml (it will be omitted from the final control file). Options may be set by the set_option() function and their values may be retrieved by the get_option() function:

>>> cml.set_options(clock=1)
>>> cml.set_options(NSsites=[0, 1, 2])
>>> cml.set_options(aaRatefile="wag.dat")
>>> cml.get_option("NSsites")
[0, 1, 2]

The NSsites option deserves special attention: in a codeml control file, NSsites models are entered as a space-delimited list of numbers, such as 0 1 2, whereas in the Codeml object it is stored as a Python list.

Finally, options may be read in from an existing codeml control file or they may be written to a new file. Writing to a file isn’t necessary, as this is done automatically when executing the run() method (see below). The control file to read is provided as an argument to the read_ctl_file() method, while the write_ctl_file() method writes to the Codeml object’s ctl_file attribute

>>> cml.read_ctl_file("codeml.ctl")
>>> cml.print_options()
verbose = 1
CodonFreq = 2
cleandata = 1
fix_blength = None
NSsites = 0
fix_omega = 0
clock = 0
ncatG = 8
runmode = 0
fix_kappa = 0
fix_alpha = 1
Small_Diff = 5e-07
method = 0
Malpha = 0
aaDist = 0
RateAncestor = 1
aaRatefile = dat/jones.dat
icode = 0
alpha = 0.0
seqtype = 2
omega = 0.4
getSE = 0
noisy = 9
Mgene = 0
kappa = 2
model = 2
ndata = None

>>> cml.ctl_file = "control2.ctl"
>>> cml.write_ctl_file()

Running the program

Executing the object’s run() method will run codeml with the current options and will return the parsed results in a dictionary object (see below). You can also pass a number of optional arguments to the run() method:

If the codeml process exits with an error, a PamlError exception will be raised.

read() and Results

As previously stated, the Codeml.run() method returns a dictionary containing results parsed from codeml’s main output file. Alternatively, an existing output file may be parsed using the read() function:

results = codeml.read()

The results dictionary is organized hierarchically and is (in most cases) keyed in accordance with the terminology used in the output file. Numeric values are automatically converted to numeric Python types. As the output of codeml varies widely depending on the parameters used, the contents of the results dictionary will vary accordingly. Similarly, in the case of a runtime error, the results dictionary may be empty or missing contents. Thus, it is advisable to use Python’s dict.get(key) method rather than dict[key] in order to handle missing elements gracefully.

All possible keys of the results dictionary are organized as follows:

baseml & yn00

Interfaces to baseml and yn00 are provided in the following modules:

from Bio.Phylo.PAML import baseml, yn00

bml = baseml.Baseml()
yn = yn00.Yn00()

Baseml and Yn00 share the same methods and attributes as Codeml, and are thus used in the same manner. It should be noted, however, that Yn00 does not have a tree attribute, as yn00 does not require a tree file.

Baseml

The parameters available in the Baseml options are:

>>> bml.print_options()
verbose = None
cleandata = None
fix_blength = None
nparK = None
model_options = None
clock = None
ncatG = None
runmode = None
fix_kappa = None
fix_alpha = None
noisy = None
method = None
fix_rho = None
Malpha = None
nhomo = None
RateAncestor = None
icode = None
rho = None
alpha = None
getSE = None
Small_Diff = None
Mgene = None
kappa = None
model = None
ndata = None

All of the possible contents of the Baseml results file are:

Yn00

The parameters available in the Yn00 options are:

>>> yn.print_options()
commonf3x4 = None
weighting = None
icode = None
ndata = None
verbose = None

All of the possible contents of the Yn00 results file are:

chi2

The chi2 module offers an easy method to retrieve p-values from a Chi-squared cumulative distribution function for likelihood ratio tests, which are performed frequently when using PAML programs. As of the current version of PAML, the chi2 program does not allow passing both a test statistic and the degrees of freedom as command-line arguments. Thus, the chi2 module currently consists of a re-implementation of Ziheng Yang’s original code in pure Python. For most cases, this should not affect you, however using very large numbers of degrees of freedom, such as in the FMutSel vs FMutSel0 model test in codeml (41 degrees of freedom), this Python version runs significantly slower than the original.

To retrieve a p-value, simply import the module

 from Bio.Phylo.PAML.chi2 import cdf_chi2

And use the cdf_chi2() function:

>>> df = 2
>>> statistic = 7.21
>>> cdf_chi2(df, statistic)
0.027187444813595696