Cluster analysis
Cluster analysis is the grouping of items into clusters based on the similarity of the items to each other. In bioinformatics, clustering is widely used in gene expression data analysis to find groups of genes with similar gene expression profiles. This may identify functionally related genes, as well as suggest the function of presently unknown genes.
The Biopython module Bio.Cluster
provides commonly used clustering
algorithms and was designed with the application to gene expression data
in mind. However, this module can also be used for cluster analysis of
other types of data. Bio.Cluster
and the underlying C Clustering
Library is described by De Hoon et al. [DeHoon2004].
The following four clustering approaches are implemented in
Bio.Cluster
:
Hierarchical clustering (pairwise centroid-, single-, complete-, and average-linkage);
-means, -medians, and -medoids clustering;Self-Organizing Maps;
Principal Component Analysis.
Data representation
The data to be clustered are represented by a data
. Within the context of gene expression
data clustering, typically the rows correspond to different genes
whereas the columns correspond to different experimental conditions. The
clustering algorithms in Bio.Cluster
can be applied both to rows
(genes) and to columns (experiments).
Missing values
The mask
indicates
if any of the values in data
are missing. If mask[i, j] == 0
,
then data[i, j]
is missing and is ignored in the analysis.
Random number generator
The Bio.Cluster
is
based on the algorithm by L’Ecuyer [Lecuyer1988],
while random numbers following the binomial distribution are generated
using the BTPE algorithm by Kachitvichyanukul and Schmeiser
[Kachitvichyanukul1988]. The random number generator
is initialized automatically during its first call. As this random
number generator uses a combination of two multiplicative linear
congruential generators, two (integer) seeds are needed for
initialization, for which we use the system-supplied random number
generator rand
(in the C standard library). We initialize this
generator by calling srand
with the epoch time in seconds, and use
the first two random numbers generated by rand
as seeds for the
uniform random number generator in Bio.Cluster
.
Distance functions
In order to cluster items into groups based on their similarity, we
should first define what exactly we mean by similar. Bio.Cluster
provides eight distance functions, indicated by a single character, to
measure similarity, or conversely, distance:
'e'
: Euclidean distance;'b'
: City-block distance.'c'
: Pearson correlation coefficient;'a'
: Absolute value of the Pearson correlation coefficient;'u'
: Uncentered Pearson correlation (equivalent to the cosine of the angle between two data vectors);'x'
: Absolute uncentered Pearson correlation;'s'
: Spearman’s rank correlation;'k'
: Kendall’s .
The first two are true distance functions that satisfy the triangle inequality:
and are therefore referred to as metrics. In everyday language, this means that the shortest distance between two points is a straight line.
The remaining six distance measures are related to the correlation
coefficient, where the distance
we find a Pearson distance
Euclidean distance
In Bio.Cluster
, we define the Euclidean distance as
Only those terms are included in the summation for which both
City-block distance
The city-block distance, alternatively known as the Manhattan distance,
is related to the Euclidean distance. Whereas the Euclidean distance
corresponds to the length of the shortest path between two points, the
city-block distance is the sum of distances along each dimension. As
gene expression data tend to have missing values, in Bio.Cluster
we
define the city-block distance as the sum of distances divided by the
number of dimensions:
This is equal to the distance you would have to walk between two points in a city, where you have to walk along city blocks. As for the Euclidean distance, the expression data are subtracted directly from each other, and we should therefore make sure that they are properly normalized.
The Pearson correlation coefficient
The Pearson correlation coefficient is defined as
in which
The Pearson distance is then defined as
As the Pearson correlation coefficient lies between -1 and 1, the Pearson distance lies between 0 and 2.
Absolute Pearson correlation
By taking the absolute value of the Pearson correlation, we find a
number between 0 and 1. If the absolute value is 1, all the points in
the scatter plot lie on a straight line with either a positive or a
negative slope. If the absolute value is equal to zero, there is no
correlation between
The corresponding distance is defined as
where
In the context of gene expression experiments, the absolute correlation is equal to 1 if the gene expression profiles of two genes are either exactly the same or exactly opposite. The absolute correlation coefficient should therefore be used with care.
Uncentered correlation (cosine of the angle)
In some cases, it may be preferable to use the uncentered correlation instead of the regular Pearson correlation coefficient. The uncentered correlation is defined as
where
This is the same expression as for the regular Pearson correlation
coefficient, except that the sample means
The distance corresponding to the uncentered correlation coefficient is defined as
where
The uncentered correlation is equal to the cosine of the angle of the
two data vectors in
Absolute uncentered correlation
As for the regular Pearson correlation, we can define a distance measure using the absolute value of the uncentered correlation:
where
Geometrically, the absolute value of the uncentered correlation is equal to the cosine between the supporting lines of the two data vectors (i.e., the angle without taking the direction of the vectors into consideration).
Spearman rank correlation
The Spearman rank correlation is an example of a non-parametric similarity measure, and tends to be more robust against outliers than the Pearson correlation.
To calculate the Spearman rank correlation, we replace each data value by their rank if we would order the data in each vector by their value. We then calculate the Pearson correlation between the two rank vectors instead of the data vectors.
As in the case of the Pearson correlation, we can define a distance measure corresponding to the Spearman rank correlation as
where
Kendall’s
Kendall’s
We can define a distance measure corresponding to Kendall’s
As Kendall’s
Weighting
For most of the distance functions available in Bio.Cluster
, a
weight vector can be applied. The weight vector contains weights for the
items in the data vector. If the weight for item
Calculating the distance matrix
The distance matrix is a square matrix with all pairwise distances
between the items in data
, and can be calculated by the function
distancematrix
in the Bio.Cluster
module:
>>> from Bio.Cluster import distancematrix
>>> matrix = distancematrix(data)
where the following arguments are defined:
data
(required)Array containing the data for the items.mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.weight
(default:None
)The weights to be used when calculating distances. Ifweight
isNone
, then equal weights are assumed.transpose
(default:0
)Determines if the distances between the rows ofdata
are to be calculated (transpose
isFalse
), or between the columns ofdata
(transpose
isTrue
).dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
To save memory, the distance matrix is returned as a list of 1D arrays. The number of columns in each row is equal to the row number. Hence, the first row has zero elements. For example,
>>> from numpy import array
>>> from Bio.Cluster import distancematrix
>>> data = array([[0, 1, 2, 3],
... [4, 5, 6, 7],
... [8, 9, 10, 11],
... [1, 2, 3, 4]]) # fmt: skip
...
>>> distances = distancematrix(data, dist="e")
yields a distance matrix
>>> distances
[array([], dtype=float64), array([ 16.]), array([ 64., 16.]), array([ 1., 9., 49.])]
which can be rewritten as
[array([], dtype=float64), array([16.0]), array([64.0, 16.0]), array([1.0, 9.0, 49.0])]
This corresponds to the distance matrix:
Calculating cluster properties
Calculating the cluster centroids
The centroid of a cluster can be defined either as the mean or as the
median of each dimension over all cluster items. The function
clustercentroids
in Bio.Cluster
can be used to calculate either:
>>> from Bio.Cluster import clustercentroids
>>> cdata, cmask = clustercentroids(data)
where the following arguments are defined:
data
(required)Array containing the data for the items.mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.clusterid
(default:None
)Vector of integers showing to which cluster each item belongs. Ifclusterid
isNone
, then all items are assumed to belong to the same cluster.method
(default:'a'
)Specifies whether the arithmetic mean (method=='a'
) or the median (method=='m'
) is used to calculate the cluster center.transpose
(default:0
)Determines if the centroids of the rows ofdata
are to be calculated (transpose
isFalse
), or the centroids of the columns ofdata
(transpose
isTrue
).
This function returns the tuple (cdata, cmask)
. The centroid data
are stored in the 2D Numerical Python array cdata
, with missing data
indicated by the 2D Numerical Python integer array cmask
. The
dimensions of these arrays are
transpose
is 0
, or
transpose
is 1
. Each row (if transpose
is 0
) or
column (if transpose
is 1
) contains the averaged data
corresponding to the centroid of each cluster.
Calculating the distance between clusters
Given a distance function between items, we can define the distance
between two clusters in several ways. The distance between the
arithmetic means of the two clusters is used in pairwise
centroid-linkage clustering and in
To calculate the distance between two clusters, use
>>> from Bio.Cluster import clusterdistance
>>> distance = clusterdistance(data)
where the following arguments are defined:
data
(required)Array containing the data for the items.mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.weight
(default:None
)The weights to be used when calculating distances. Ifweight
isNone
, then equal weights are assumed.index1
(default:0
)A list containing the indices of the items belonging to the first cluster. A cluster containing only one item can be represented either as a list[i]
, or as an integeri
.index2
(default:0
)A list containing the indices of the items belonging to the second cluster. A cluster containing only one items can be represented either as a list[i]
, or as an integeri
.method
(default:'a'
)Specifies how the distance between clusters is defined:'a'
: Distance between the two cluster centroids (arithmetic mean);'m'
: Distance between the two cluster centroids (median);'s'
: Shortest pairwise distance between items in the two clusters;'x'
: Longest pairwise distance between items in the two clusters;'v'
: Average over the pairwise distances between items in the two clusters.
dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).transpose
(default:0
)Iftranspose
isFalse
, calculate the distance between the rows ofdata
. Iftranspose
isTrue
, calculate the distance between the columns ofdata
.
Partitioning algorithms
Partitioning algorithms divide items into Bio.Cluster
:
-means clustering -medians clustering -medoids clustering
These algorithms differ in how the cluster center is defined. In
The expectation-maximization (EM) algorithm is used to find this
partitioning into
We then iterate:
Calculate the centroid of each cluster, defined as either the mean, the median, or the medoid of the cluster;
Calculate the distances of each item to the cluster centers;
For each item, determine which cluster centroid is closest;
Reassign each item to its closest cluster, or stop the iteration if no further item reassignments take place.
To avoid clusters becoming empty during the iteration, in
As the initial assignment of items to clusters is done randomly, usually
a different clustering solution is found each time the EM algorithm is
executed. To find the optimal clustering solution, the
How often the EM algorithm should be run depends on the number of items being clustered. As a rule of thumb, we can consider how often the optimal solution was found; this number is returned by the partitioning algorithms as implemented in this library. If the optimal solution was found many times, it is unlikely that better solutions exist than the one that was found. However, if the optimal solution was found only once, there may well be other solutions with a smaller within-cluster sum of distances. If the number of items is large (more than several hundreds), it may be difficult to find the globally optimal solution.
The EM algorithm terminates when no further reassignments take place. We noticed that for some sets of initial cluster assignments, the EM algorithm fails to converge due to the same clustering solution reappearing periodically after a small number of iteration steps. We therefore check for the occurrence of such periodic solutions during the iteration. After a given number of iteration steps, the current clustering result is saved as a reference. By comparing the clustering result after each subsequent iteration step to the reference state, we can determine if a previously encountered clustering result is found. In such a case, the iteration is halted. If after a given number of iterations the reference state has not yet been encountered, the current clustering solution is saved to be used as the new reference state. Initially, ten iteration steps are executed before resaving the reference state. This number of iteration steps is doubled each time, to ensure that periodic behavior with longer periods can also be detected.
-means and -medians
The kcluster
in Bio.Cluster
:
>>> from Bio.Cluster import kcluster
>>> clusterid, error, nfound = kcluster(data)
where the following arguments are defined:
data
(required)Array containing the data for the items.nclusters
(default:2
)The number of clusters .mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.weight
(default:None
)The weights to be used when calculating distances. Ifweight
isNone
, then equal weights are assumed.transpose
(default:0
)Determines if rows (transpose
is0
) or columns (transpose
is1
) are to be clustered.npass
(default:1
)The number of times the -means/-medians clustering algorithm is performed, each time with a different (random) initial condition. Ifinitialid
is given, the value ofnpass
is ignored and the clustering algorithm is run only once, as it behaves deterministically in that case.method
(default:a
)describes how the center of a cluster is found:method=='a'
: arithmetic mean ( -means clustering);method=='m'
: median ( -medians clustering).
For other values of
method
, the arithmetic mean is used.dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions). Whereas all eight distance measures are accepted bykcluster
, from a theoretical viewpoint it is best to use the Euclidean distance for the -means algorithm, and the city-block distance for -medians.initialid
(default:None
)Specifies the initial clustering to be used for the EM algorithm. Ifinitialid
isNone
, then a different random initial clustering is used for each of thenpass
runs of the EM algorithm. Ifinitialid
is notNone
, then it should be equal to a 1D array containing the cluster number (between0
andnclusters-1
) for each item. Each cluster should contain at least one item. With the initial clustering specified, the EM algorithm is deterministic.
This function returns a tuple (clusterid, error, nfound)
, where
clusterid
is an integer array containing the number of the cluster
to which each row or cluster was assigned, error
is the
within-cluster sum of distances for the optimal clustering solution, and
nfound
is the number of times this optimal solution was found.
-medoids clustering
The kmedoids
routine performs
>>> from Bio.Cluster import kmedoids
>>> clusterid, error, nfound = kmedoids(distance)
where the following arguments are defined: , nclusters=2, npass=1, initialid=None)|
distance
(required)The matrix containing the distances between the items; this matrix can be specified in three ways:as a 2D Numerical Python array (in which only the left-lower part of the array will be accessed):
distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])
as a 1D Numerical Python array containing consecutively the distances in the left-lower part of the distance matrix:
distance = array([1.1, 2.3, 4.5])
as a list containing the rows of the left-lower part of the distance matrix:
distance = [array([]), array([1.1]), array([2.3, 4.5])]
These three expressions correspond to the same distance matrix.
nclusters
(default:2
)The number of clusters .npass
(default:1
)The number of times the -medoids clustering algorithm is performed, each time with a different (random) initial condition. Ifinitialid
is given, the value ofnpass
is ignored, as the clustering algorithm behaves deterministically in that case.initialid
(default:None
)Specifies the initial clustering to be used for the EM algorithm. Ifinitialid
isNone
, then a different random initial clustering is used for each of thenpass
runs of the EM algorithm. Ifinitialid
is notNone
, then it should be equal to a 1D array containing the cluster number (between0
andnclusters-1
) for each item. Each cluster should contain at least one item. With the initial clustering specified, the EM algorithm is deterministic.
This function returns a tuple (clusterid, error, nfound)
, where
clusterid
is an array containing the number of the cluster to which
each item was assigned, error
is the within-cluster sum of distances
for the optimal nfound
is
the number of times the optimal solution was found. Note that the
cluster number in clusterid
is defined as the item number of the
item representing the cluster centroid.
Hierarchical clustering
Hierarchical clustering methods are inherently different from the
The first step in hierarchical clustering is to calculate the distance
matrix, specifying all the distances between the items to be clustered.
Next, we create a node by joining the two closest items. Subsequent
nodes are created by pairwise joining of items or nodes based on the
distance between them, until all items belong to the same node. A tree
structure can then be created by retracing which items and nodes were
merged. Unlike the EM algorithm, which is used in
Several flavors of hierarchical clustering exist, which differ in how
the distance between subnodes is defined in terms of their members. In
Bio.Cluster
, pairwise single, maximum, average, and centroid linkage
are available.
In pairwise single-linkage clustering, the distance between two nodes is defined as the shortest distance among the pairwise distances between the members of the two nodes.
In pairwise maximum-linkage clustering, alternatively known as pairwise complete-linkage clustering, the distance between two nodes is defined as the longest distance among the pairwise distances between the members of the two nodes.
In pairwise average-linkage clustering, the distance between two nodes is defined as the average over all pairwise distances between the items of the two nodes.
In pairwise centroid-linkage clustering, the distance between two nodes is defined as the distance between their centroids. The centroids are calculated by taking the mean over all the items in a cluster. As the distance from each newly formed node to existing nodes and items need to be calculated at each step, the computing time of pairwise centroid-linkage clustering may be significantly longer than for the other hierarchical clustering methods. Another peculiarity is that (for a distance measure based on the Pearson correlation), the distances do not necessarily increase when going up in the clustering tree, and may even decrease. This is caused by an inconsistency between the centroid calculation and the distance calculation when using the Pearson correlation: Whereas the Pearson correlation effectively normalizes the data for the distance calculation, no such normalization occurs for the centroid calculation.
For pairwise single-, complete-, and average-linkage clustering, the distance between two nodes can be found directly from the distances between the individual items. Therefore, the clustering algorithm does not need access to the original gene expression data, once the distance matrix is known. For pairwise centroid-linkage clustering, however, the centroids of newly formed subnodes can only be calculated from the original data and not from the distance matrix.
The implementation of pairwise single-linkage hierarchical clustering is based on the SLINK algorithm [Sibson1973], which is much faster and more memory-efficient than a straightforward implementation of pairwise single-linkage clustering. The clustering result produced by this algorithm is identical to the clustering solution found by the conventional single-linkage algorithm. The single-linkage hierarchical clustering algorithm implemented in this library can be used to cluster large gene expression data sets, for which conventional hierarchical clustering algorithms fail due to excessive memory requirements and running time.
Representing a hierarchical clustering solution
The result of hierarchical clustering consists of a tree of nodes, in
which each node joins two items or subnodes. Usually, we are not only
interested in which items or subnodes are joined at each node, but also
in their similarity (or distance) as they are joined. To store one node
in the hierarchical clustering tree, we make use of the class Node
,
which defined in Bio.Cluster
. An instance of Node
has three
attributes:
left
right
distance
Here, left
and right
are integers referring to the two items or
subnodes that are joined at this node, and distance
is the distance
between them. The items being clustered are numbered from 0 to
To create a new Node
object, we need to specify left
and
right
; distance
is optional.
>>> from Bio.Cluster import Node
>>> Node(2, 3)
(2, 3): 0
>>> Node(2, 3, 0.91)
(2, 3): 0.91
The attributes left
, right
, and distance
of an existing
Node
object can be modified directly:
>>> node = Node(4, 5)
>>> node.left = 6
>>> node.right = 2
>>> node.distance = 0.73
>>> node
(6, 2): 0.73
An error is raised if left
and right
are not integers, or if
distance
cannot be converted to a floating-point value.
The Python class Tree
represents a full hierarchical clustering
solution. A Tree
object can be created from a list of Node
objects:
>>> from Bio.Cluster import Node, Tree
>>> nodes = [Node(1, 2, 0.2), Node(0, 3, 0.5), Node(-2, 4, 0.6), Node(-1, -3, 0.9)]
>>> tree = Tree(nodes)
>>> print(tree)
(1, 2): 0.2
(0, 3): 0.5
(-2, 4): 0.6
(-1, -3): 0.9
The Tree
initializer checks if the list of nodes is a valid
hierarchical clustering result:
>>> nodes = [Node(1, 2, 0.2), Node(0, 2, 0.5)]
>>> Tree(nodes)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: Inconsistent tree
Individual nodes in a Tree
object can be accessed using square
brackets:
>>> nodes = [Node(1, 2, 0.2), Node(0, -1, 0.5)]
>>> tree = Tree(nodes)
>>> tree[0]
(1, 2): 0.2
>>> tree[1]
(0, -1): 0.5
>>> tree[-1]
(0, -1): 0.5
As a Tree
object is immutable, we cannot change individual nodes in
a Tree
object. However, we can convert the tree to a list of nodes,
modify this list, and create a new tree from this list:
>>> tree = Tree([Node(1, 2, 0.1), Node(0, -1, 0.5), Node(-2, 3, 0.9)])
>>> print(tree)
(1, 2): 0.1
(0, -1): 0.5
(-2, 3): 0.9
>>> nodes = tree[:]
>>> nodes[0] = Node(0, 1, 0.2)
>>> nodes[1].left = 2
>>> tree = Tree(nodes)
>>> print(tree)
(0, 1): 0.2
(2, -1): 0.5
(-2, 3): 0.9
This guarantees that any Tree
object is always well-formed.
To display a hierarchical clustering solution with visualization
programs such as Java Treeview, it is better to scale all node distances
such that they are between zero and one. This can be accomplished by
calling the scale
method on an existing Tree
object:
>>> tree.scale()
This method takes no arguments, and returns None
.
Before drawing the tree, you may also want to reorder the tree nodes. A
hierarchical clustering solution of tree.sort(order)
method
visits each node in the hierarchical clustering tree and verifies if the
average order value of the left subnode is less than or equal to the
average order value of the right subnode. If not, the left and right
subnodes are exchanged. Here, the order values of the items are given by
the user. In the resulting dendrogram, items in the left-to-right order
will tend to have increasing order values. The method will return the
indices of the elements in the left-to-right order after sorting:
>>> indices = tree.sort(order)
such that item indices[i]
will occur at position
After hierarchical clustering, the items can be grouped into Tree
object by
cutting the tree:
>>> clusterid = tree.cut(nclusters=1)
where nclusters
(defaulting to 1
) is the desired number of
clusters clusterid
containing the number of the cluster to which each item is
assigned. Clusters are numbered
Performing hierarchical clustering
To perform hierarchical clustering, use the treecluster
function in
Bio.Cluster
.
>>> from Bio.Cluster import treecluster
>>> tree = treecluster(data)
where the following arguments are defined:
data
Array containing the data for the items.mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.weight
(default:None
)The weights to be used when calculating distances. Ifweight
isNone
, then equal weights are assumed.transpose
(default:0
)Determines if rows (transpose
isFalse
) or columns (transpose
isTrue
) are to be clustered.method
(default:'m'
)defines the linkage method to be used:method=='s'
: pairwise single-linkage clusteringmethod=='m'
: pairwise maximum- (or complete-) linkage clusteringmethod=='c'
: pairwise centroid-linkage clusteringmethod=='a'
: pairwise average-linkage clustering
dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
To apply hierarchical clustering on a precalculated distance matrix,
specify the distancematrix
argument when calling treecluster
function instead of the data
argument:
>>> from Bio.Cluster import treecluster
>>> tree = treecluster(distancematrix=distance)
In this case, the following arguments are defined:
distancematrix
The distance matrix, which can be specified in three ways:as a 2D Numerical Python array (in which only the left-lower part of the array will be accessed):
distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])
as a 1D Numerical Python array containing consecutively the distances in the left-lower part of the distance matrix:
distance = array([1.1, 2.3, 4.5])
as a list containing the rows of the left-lower part of the distance matrix:
distance = [array([]), array([1.1]), array([2.3, 4.5])]
These three expressions correspond to the same distance matrix. As
treecluster
may shuffle the values in the distance matrix as part of the clustering algorithm, be sure to save this array in a different variable before callingtreecluster
if you need it later.method
The linkage method to be used:method=='s'
: pairwise single-linkage clusteringmethod=='m'
: pairwise maximum- (or complete-) linkage clusteringmethod=='a'
: pairwise average-linkage clustering
While pairwise single-, maximum-, and average-linkage clustering can be calculated from the distance matrix alone, pairwise centroid-linkage cannot.
When calling treecluster
, either data
or distancematrix
should be None
.
This function returns a Tree
object. This object contains
left
and right
each contain the number of one item or subnode, and distance
the
distance between them. Items are numbered from 0 to
Self-Organizing Maps
Self-Organizing Maps (SOMs) were invented by Kohonen to describe neural networks (see for instance Kohonen, 1997 [Kohonen1997]). Tamayo (1999) first applied Self-Organizing Maps to gene expression data [Tamayo1999].
SOMs organize items into clusters that are situated in some topology. Usually a rectangular topology is chosen. The clusters generated by SOMs are such that neighboring clusters in the topology are more similar to each other than clusters far from each other in the topology.
The first step to calculate a SOM is to randomly assign a data vector to each cluster in the topology. If rows are being clustered, then the number of elements in each data vector is equal to the number of columns.
An SOM is then generated by taking rows one at a time, and finding which cluster in the topology has the closest data vector. The data vector of that cluster, as well as those of the neighboring clusters, are adjusted using the data vector of the row under consideration. The adjustment is given by
The parameter
All clusters within a radius
in which the maximum radius is defined as
where
The function somcluster
implements the complete algorithm to
calculate a Self-Organizing Map on a rectangular grid. First it
initializes the random number generator. The node data are then
initialized using the random number generator. The order in which genes
or samples are used to modify the SOM is also randomized. The total
number of iterations in the SOM algorithm is specified by the user.
To run somcluster
, use
>>> from Bio.Cluster import somcluster
>>> clusterid, celldata = somcluster(data)
where the following arguments are defined:
data
(required)Array containing the data for the items.mask
(default:None
)Array of integers showing which data are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. Ifmask
isNone
, then all data are present.weight
(default:None
)contains the weights to be used when calculating distances. Ifweight
isNone
, then equal weights are assumed.transpose
(default:0
)Determines if rows (transpose
is0
) or columns (transpose
is1
) are to be clustered.nxgrid, nygrid
(default:2, 1
)The number of cells horizontally and vertically in the rectangular grid on which the Self-Organizing Map is calculated.inittau
(default:0.02
)The initial value for the parameter that is used in the SOM algorithm. The default value forinittau
is 0.02, which was used in Michael Eisen’s Cluster/TreeView program.niter
(default:1
)The number of iterations to be performed.dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
This function returns the tuple (clusterid, celldata)
:
clusterid
:An array with two columns, where the number of rows is equal to the number of items that were clustered. Each row contains the and coordinates of the cell in the rectangular SOM grid to which the item was assigned.celldata
:An array with dimensions if rows are being clustered, or if columns are being clustered. Each element[ix][iy]
of this array is a 1D vector containing the gene expression data for the centroid of the cluster in the grid cell with coordinates[ix][iy]
.
Principal Component Analysis
Principal Component Analysis (PCA) is a widely used technique for analyzing multivariate data. A practical example of applying Principal Component Analysis to gene expression data is presented by Yeung and Ruzzo (2001) [Yeung2001].
In essence, PCA is a coordinate transformation in which each row in the
data matrix is written as a linear sum over basis vectors called
principal components, which are ordered and chosen such that each
maximally explains the remaining variance in the data vectors. For
example, an
The principal components can be found by calculating the eigenvectors of the covariance matrix of the data. The corresponding eigenvalues determine how much of the variance present in the data is explained by each principal component.
Before applying principal component analysis, typically the mean is subtracted from each column in the data matrix. In the example above, this effectively centers the ellipsoidal cloud around its centroid in 3D space, with the principal components describing the variation of points in the ellipsoidal cloud with respect to their centroid.
The function pca
below first uses the singular value decomposition
to calculate the eigenvalues and eigenvectors of the data matrix. The
singular value decomposition is implemented as a translation in C of the
Algol procedure svd
[Golub1971], which uses
Householder bidiagonalization and a variant of the QR algorithm. The
principal components, the coordinates of each data vector along the
principal components, and the eigenvalues corresponding to the principal
components are then evaluated and returned in decreasing order of the
magnitude of the eigenvalue. If data centering is desired, the mean
should be subtracted from each column in the data matrix before calling
the pca
routine.
To apply Principal Component Analysis to a rectangular matrix data
,
use
>>> from Bio.Cluster import pca
>>> columnmean, coordinates, components, eigenvalues = pca(data)
This function returns a tuple
columnmean, coordinates, components, eigenvalues
:
columnmean
Array containing the mean over each column indata
.coordinates
The coordinates of each row indata
with respect to the principal components.components
The principal components.eigenvalues
The eigenvalues corresponding to each of the principal components.
The original matrix data
can be recreated by calculating
columnmean + dot(coordinates, components)
.
Handling Cluster/TreeView-type files
Cluster/TreeView are GUI-based codes for clustering gene expression
data. They were originally written by Michael
Eisen while at Stanford University
[Eisen1998]. Bio.Cluster
contains functions for
reading and writing data files that correspond to the format specified
for Cluster/TreeView. In particular, by saving a clustering result in
that format, TreeView can be used to visualize the clustering results.
We recommend using Alok Saldanha’s
http://jtreeview.sourceforge.net/Java TreeView program
[Saldanha2004], which can display hierarchical as well
as
An object of the class Record
contains all information stored in a
Cluster/TreeView-type data file. To store the information contained in
the data file in a Record
object, we first open the file and then
read it:
>>> from Bio import Cluster
>>> with open("mydatafile.txt") as handle:
... record = Cluster.read(handle)
...
This two-step process gives you some flexibility in the source of the data. For example, you can use
>>> import gzip # Python standard library
>>> handle = gzip.open("mydatafile.txt.gz", "rt")
to open a gzipped file, or
>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Cluster/cyano.txt"
>>> handle = TextIOWrapper(urlopen(url))
to open a file stored on the Internet before calling read
.
The read
command reads the tab-delimited text file
mydatafile.txt
containing gene expression data in the format
specified for Michael Eisen’s Cluster/TreeView program. In this file
format, rows represent genes and columns represent samples or
observations. For a simple time course, a minimal input file would look
like this:
YORF |
0 minutes |
30 minutes |
1 hour |
2 hours |
4 hours |
YAL001C |
1 |
1.3 |
2.4 |
5.8 |
2.4 |
YAL002W |
0.9 |
0.8 |
0.7 |
0.5 |
0.2 |
YAL003W |
0.8 |
2.1 |
4.2 |
10.1 |
10.1 |
YAL005C |
1.1 |
1.3 |
0.8 |
0.4 |
|
YAL010C |
1.2 |
1 |
1.1 |
4.5 |
8.3 |
Each row (gene) has an identifier that always goes in the first column. In this example, we are using yeast open reading frame codes. Each column (sample) has a label in the first row. In this example, the labels describe the time at which a sample was taken. The first column of the first row contains a special field that tells the program what kind of objects are in each row. In this case, YORF stands for yeast open reading frame. This field can be any alphanumeric value. The remaining cells in the table contain data for the appropriate gene and sample. The 5.8 in row 2 column 4 means that the observed value for gene YAL001C at 2 hours was 5.8. Missing values are acceptable and are designated by empty cells (e.g. YAL004C at 2 hours).
The input file may contain additional information. A maximal input file would look like this:
YORF |
NAME |
GWEIGHT |
GORDER |
0 |
30 |
1 |
2 |
4 |
EWEIGHT |
1 |
1 |
1 |
1 |
0 |
|||
EORDER |
5 |
3 |
2 |
1 |
1 |
|||
YAL001C |
TFIIIC 138 KD SUBUNIT |
1 |
1 |
1 |
1.3 |
2.4 |
5.8 |
2.4 |
YAL002W |
UNKNOWN |
0.4 |
3 |
0.9 |
0.8 |
0.7 |
0.5 |
0.2 |
YAL003W |
ELONGATION FACTOR EF1-BETA |
0.4 |
2 |
0.8 |
2.1 |
4.2 |
10.1 |
10.1 |
YAL005C |
CYTOSOLIC HSP70 |
0.4 |
5 |
1.1 |
1.3 |
0.8 |
0.4 |
The added columns NAME, GWEIGHT, and GORDER and rows EWEIGHT and EORDER are optional. The NAME column allows you to specify a label for each gene that is distinct from the ID in column 1.
A Record
object has the following attributes:
data
The data array containing the gene expression data. Genes are stored row-wise, while samples are stored column-wise.mask
This array shows which elements in thedata
array, if any, are missing. Ifmask[i, j] == 0
, thendata[i, j]
is missing. If no data were found to be missing,mask
is set toNone
.geneid
This is a list containing a unique description for each gene (i.e., ORF numbers).genename
This is a list containing a description for each gene (i.e., gene name). If not present in the data file,genename
is set toNone
.gweight
The weights that are to be used to calculate the distance in expression profile between genes. If not present in the data file,gweight
is set toNone
.gorder
The preferred order in which genes should be stored in an output file. If not present in the data file,gorder
is set toNone
.expid
This is a list containing a description of each sample, e.g. experimental condition.eweight
The weights that are to be used to calculate the distance in expression profile between samples. If not present in the data file,eweight
is set toNone
.eorder
The preferred order in which samples should be stored in an output file. If not present in the data file,eorder
is set toNone
.uniqid
The string that was used instead of UNIQID in the data file.
After loading a Record
object, each of these attributes can be
accessed and modified directly. For example, the data can be
log-transformed by taking the logarithm of record.data
.
Calculating the distance matrix
To calculate the distance matrix between the items stored in the record, use
>>> matrix = record.distancematrix()
where the following arguments are defined:
transpose
(default:0
)Determines if the distances between the rows ofdata
are to be calculated (transpose
isFalse
), or between the columns ofdata
(transpose
isTrue
).dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
This function returns the distance matrix as a list of rows, where the number of columns of each row is equal to the row number (see section Calculating the distance matrix).
Calculating the cluster centroids
To calculate the centroids of clusters of items stored in the record, use
>>> cdata, cmask = record.clustercentroids()
clusterid
(default:None
)Vector of integers showing to which cluster each item belongs. Ifclusterid
is not given, then all items are assumed to belong to the same cluster.method
(default:'a'
)Specifies whether the arithmetic mean (method=='a'
) or the median (method=='m'
) is used to calculate the cluster center.transpose
(default:0
)Determines if the centroids of the rows ofdata
are to be calculated (transpose
isFalse
), or the centroids of the columns ofdata
(transpose
isTrue
).
This function returns the tuple cdata, cmask
; see section
Calculating the cluster centroids for a description.
Calculating the distance between clusters
To calculate the distance between clusters of items stored in the record, use
>>> distance = record.clusterdistance()
where the following arguments are defined:
index1
(default:0
)A list containing the indices of the items belonging to the first cluster. A cluster containing only one item can be represented either as a list[i]
, or as an integeri
.index2
(default:0
)A list containing the indices of the items belonging to the second cluster. A cluster containing only one item can be represented either as a list[i]
, or as an integeri
.method
(default:'a'
)Specifies how the distance between clusters is defined:'a'
: Distance between the two cluster centroids (arithmetic mean);'m'
: Distance between the two cluster centroids (median);'s'
: Shortest pairwise distance between items in the two clusters;'x'
: Longest pairwise distance between items in the two clusters;'v'
: Average over the pairwise distances between items in the two clusters.
dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).transpose
(default:0
)Iftranspose
isFalse
, calculate the distance between the rows ofdata
. Iftranspose
isTrue
, calculate the distance between the columns ofdata
.
Performing hierarchical clustering
To perform hierarchical clustering on the items stored in the record, use
>>> tree = record.treecluster()
where the following arguments are defined:
transpose
(default:0
)Determines if rows (transpose
isFalse
) or columns (transpose
isTrue
) are to be clustered.method
(default:'m'
)defines the linkage method to be used:method=='s'
: pairwise single-linkage clusteringmethod=='m'
: pairwise maximum- (or complete-) linkage clusteringmethod=='c'
: pairwise centroid-linkage clusteringmethod=='a'
: pairwise average-linkage clustering
dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).transpose
Determines if genes or samples are being clustered. Iftranspose
isFalse
, genes (rows) are being clustered. Iftranspose
isTrue
, samples (columns) are clustered.
This function returns a Tree
object. This object contains
left
and right
each contain the number of one item or subnode, and distance
the
distance between them. Items are numbered from 0 to
Performing -means or -medians clustering
To perform
>>> clusterid, error, nfound = record.kcluster()
where the following arguments are defined:
nclusters
(default:2
)The number of clusters .transpose
(default:0
)Determines if rows (transpose
is0
) or columns (transpose
is1
) are to be clustered.npass
(default:1
)The number of times the -means/-medians clustering algorithm is performed, each time with a different (random) initial condition. Ifinitialid
is given, the value ofnpass
is ignored and the clustering algorithm is run only once, as it behaves deterministically in that case.method
(default:a
)describes how the center of a cluster is found:method=='a'
: arithmetic mean ( -means clustering);method=='m'
: median ( -medians clustering).
For other values of
method
, the arithmetic mean is used.dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
This function returns a tuple (clusterid, error, nfound)
, where
clusterid
is an integer array containing the number of the cluster
to which each row or cluster was assigned, error
is the
within-cluster sum of distances for the optimal clustering solution, and
nfound
is the number of times this optimal solution was found.
Calculating a Self-Organizing Map
To calculate a Self-Organizing Map of the items stored in the record, use
>>> clusterid, celldata = record.somcluster()
where the following arguments are defined:
transpose
(default:0
)Determines if rows (transpose
is0
) or columns (transpose
is1
) are to be clustered.nxgrid, nygrid
(default:2, 1
)The number of cells horizontally and vertically in the rectangular grid on which the Self-Organizing Map is calculated.inittau
(default:0.02
)The initial value for the parameter that is used in the SOM algorithm. The default value forinittau
is 0.02, which was used in Michael Eisen’s Cluster/TreeView program.niter
(default:1
)The number of iterations to be performed.dist
(default:'e'
, Euclidean distance)Defines the distance function to be used (see Distance functions).
This function returns the tuple (clusterid, celldata)
:
clusterid
:An array with two columns, where the number of rows is equal to the number of items that were clustered. Each row contains the and coordinates of the cell in the rectangular SOM grid to which the item was assigned.celldata
:An array with dimensions if rows are being clustered, or if columns are being clustered. Each element[ix][iy]
of this array is a 1D vector containing the gene expression data for the centroid of the cluster in the grid cell with coordinates[ix][iy]
.
Saving the clustering result
To save the clustering result, use
>>> record.save(jobname, geneclusters, expclusters)
where the following arguments are defined:
jobname
The stringjobname
is used as the base name for names of the files that are to be saved.geneclusters
This argument describes the gene (row-wise) clustering result. In case of -means clustering, this is a 1D array containing the number of the cluster each gene belongs to. It can be calculated usingkcluster
. In case of hierarchical clustering,geneclusters
is aTree
object.expclusters
This argument describes the (column-wise) clustering result for the experimental conditions. In case of -means clustering, this is a 1D array containing the number of the cluster each experimental condition belongs to. It can be calculated usingkcluster
. In case of hierarchical clustering,expclusters
is aTree
object.
This method writes the text file jobname.cdt
, jobname.gtr
,
jobname.atr
, jobname*.kgg
, and/or jobname*.kag
for
subsequent reading by the Java TreeView program. If geneclusters
and
expclusters
are both None
, this method only writes the text file
jobname.cdt
; this file can subsequently be read into a new
Record
object.
Example calculation
This is an example of a hierarchical clustering calculation, using
single linkage clustering for genes and maximum linkage clustering for
experimental conditions. As the Euclidean distance is being used for
gene clustering, it is necessary to scale the node distances
genetree
such that they are all between zero and one. This is needed
for the Java TreeView code to display the tree diagram correctly. To
cluster the experimental conditions, the uncentered correlation is being
used. No scaling is needed in this case, as the distances in exptree
are already between zero and two.
The example data cyano.txt
can be found in Biopython’s
Tests/Cluster
subdirectory and is from the paper
Hihara et al. 2001 [Hihara2001].
>>> from Bio import Cluster
>>> with open("cyano.txt") as handle:
... record = Cluster.read(handle)
...
>>> genetree = record.treecluster(method="s")
>>> genetree.scale()
>>> exptree = record.treecluster(dist="u", transpose=1)
>>> record.save("cyano_result", genetree, exptree)
This will create the files cyano_result.cdt
, cyano_result.gtr
,
and cyano_result.atr
.
Similarly, we can save a
>>> from Bio import Cluster
>>> with open("cyano.txt") as handle:
... record = Cluster.read(handle)
...
>>> (geneclusters, error, ifound) = record.kcluster(nclusters=5, npass=1000)
>>> (expclusters, error, ifound) = record.kcluster(nclusters=2, npass=100, transpose=1)
>>> record.save("cyano_result", geneclusters, expclusters)
This will create the files cyano_result_K_G2_A2.cdt
,
cyano_result_K_G2.kgg
, and cyano_result_K_A2.kag
.