Bio.HMM.Trainer module

Provide trainers which estimate parameters based on training sequences.

These should be used to ‘train’ a Markov Model prior to actually using it to decode state paths. When supplied training sequences and a model to work from, these classes will estimate parameters of the model.

This aims to estimate two parameters:

a_{kl} – the number of times there is a transition from k to l in the training data.
e_{k}(b) – the number of emissions of the state b from the letter k in the training data.

class Bio.HMM.Trainer.TrainingSequence(emissions, state_path)

Bases: object

Hold a training sequence with emissions and optionally, a state path.

__init__(emissions, state_path)

Initialize a training sequence.

Arguments:

emissions - An iterable (e.g., a tuple, list, or Seq object) containing the sequence of emissions in the training sequence.
state_path - An iterable (e.g., a tuple or list) containing the sequence of states. If there is no known state path, then the sequence of states should be an empty iterable.

class Bio.HMM.Trainer.AbstractTrainer(markov_model)

Bases: object

Provide generic functionality needed in all trainers.

__init__(markov_model): Initialize the class.

log_likelihood(probabilities)

Calculate the log likelihood of the training seqs.

Arguments:

probabilities – A list of the probabilities of each training sequence under the current parameters, calculated using the forward algorithm.

estimate_params(transition_counts, emission_counts)

Get a maximum likelihood estimation of transition and emission.

Arguments:

transition_counts – A dictionary with the total number of counts of transitions between two states.
emissions_counts – A dictionary with the total number of counts of emissions of a particular emission letter by a state letter.

This then returns the maximum likelihood estimators for the transitions and emissions, estimated by formulas 3.18 in Durbin et al:

a_{kl} = A_{kl} / sum(A_{kl'})
e_{k}(b) = E_{k}(b) / sum(E_{k}(b'))

Returns: Transition and emission dictionaries containing the maximum likelihood estimators.

ml_estimator(counts)

Calculate the maximum likelihood estimator.

This can calculate maximum likelihoods for both transitions and emissions.

Arguments:

counts – A dictionary of the counts for each item.

See estimate_params for a description of the formula used for calculation.

class Bio.HMM.Trainer.BaumWelchTrainer(markov_model)

Bases: AbstractTrainer

Trainer that uses the Baum-Welch algorithm to estimate parameters.

These should be used when a training sequence for an HMM has unknown paths for the actual states, and you need to make an estimation of the model parameters from the observed emissions.

This uses the Baum-Welch algorithm, first described in Baum, L.E. 1972. Inequalities. 3:1-8 This is based on the description in ‘Biological Sequence Analysis’ by Durbin et al. in section 3.3

This algorithm is guaranteed to converge to a local maximum, but not necessarily to the global maxima, so use with care!

__init__(markov_model)

Initialize the trainer.

Arguments:

markov_model - The model we are going to estimate parameters for. This should have the parameters with some initial estimates, that we can build from.

train(training_seqs, stopping_criteria, dp_method=ScaledDPAlgorithms)

Estimate the parameters using training sequences.

The algorithm for this is taken from Durbin et al. p64, so this is a good place to go for a reference on what is going on.

Arguments:

training_seqs – A list of TrainingSequence objects to be used for estimating the parameters.
stopping_criteria – A function, that when passed the change in log likelihood and threshold, will indicate if we should stop the estimation iterations.
dp_method – A class instance specifying the dynamic programming implementation we should use to calculate the forward and backward variables. By default, we use the scaling method.

update_transitions(transition_counts, training_seq, forward_vars, backward_vars, training_seq_prob)

Add the contribution of a new training sequence to the transitions.

Arguments:

transition_counts – A dictionary of the current counts for the transitions
training_seq – The training sequence we are working with
forward_vars – Probabilities calculated using the forward algorithm.
backward_vars – Probabilities calculated using the backwards algorithm.
training_seq_prob - The probability of the current sequence.

This calculates A_{kl} (the estimated transition counts from state k to state l) using formula 3.20 in Durbin et al.

update_emissions(emission_counts, training_seq, forward_vars, backward_vars, training_seq_prob)

Add the contribution of a new training sequence to the emissions.

Arguments:

emission_counts – A dictionary of the current counts for the emissions
training_seq – The training sequence we are working with
forward_vars – Probabilities calculated using the forward algorithm.
backward_vars – Probabilities calculated using the backwards algorithm.
training_seq_prob - The probability of the current sequence.

This calculates E_{k}(b) (the estimated emission probability for emission letter b from state k) using formula 3.21 in Durbin et al.

__annotations__ = {}

class Bio.HMM.Trainer.KnownStateTrainer(markov_model)

Bases: AbstractTrainer

Estimate probabilities with known state sequences.

This should be used for direct estimation of emission and transition probabilities when both the state path and emission sequence are known for the training examples.

__init__(markov_model): Initialize the class.

train(training_seqs)

Estimate the Markov Model parameters with known state paths.

This trainer requires that both the state and the emissions are known for all of the training sequences in the list of TrainingSequence objects. This training will then count all of the transitions and emissions, and use this to estimate the parameters of the model.

__annotations__ = {}