|
|
Analysis Techniques
This section begins with some gene
filtering tools that are useful in reducing the dimension of the data set
and narrowing the investigation to the most interesting genes. Methods for
finding similar gene expression
patterns and for clustering
using hierarchical clustering, k-means clustering, and self-organizing maps are also discussed.
Simple gene-wise comparison
possibilities are described, as are techniques for locating regulatory sequences and functional similarities for gene lists of
interest. Higher-level analysis are
possible and may require advanced biostatistical help.
Gene Filtering
Typically, about 40% of the genes on a microarray are expressed in a given
sample of interest. Many of the gene expression values from a collection of
chips may therefore be uninformative, and ignoring these genes in the analysis
might be perfectly appropriate. Ad hoc gene filters are often applied to
reduce the number of genes under investigation, enabling the investigator to
focus on those genes that are most likely to be informative in the setting of
interest. Which filtering techniques are appropriate will depend on the design
of the study. Most software has simple filtering tools such as the three
described below.
Presence call > X%
This narrows our analysis to genes with a positive presence call in a certain
percentage (> X%) of the samples.
A < st. dev./mean < B
This requires that a gene be variable enough compared to its mean expression
level to contain interesting information (> A), but not so variable
that nothing can be learned (< B).
Expression level > Y in X%
Since low expression estimates are sometimes unreliable, we may want to limit
our analysis to genes that are expressed above some threshold (>Y) in a
certain percentage (X%) of the samples.
Similar Gene Expression Patterns
One analysis possibility is to find genes with similar expression patterns to a
specified gene of interest. Some type of distance metric (often Pearson's
correlation coefficient) is used to quantify the similarity between two
expression profiles. For example, one might construct a list of genes with
Pearson's correlation coefficient >95% to a gene of interest, and search for
functional similarities in this list. Constructing lists of genes with similar
expression patterns is possible in dChip, Bioconductor, and GeneSpring.
Clustering
Clustering is one of the most widely used techniques in the analysis of
microarray data. It looks to identify groups of genes with similar expression
patterns over a set of observations or groups of samples with similar gene
expression profiles. Three of the most widely used clustering techniques are
hierarchical clustering, k-means clustering, and self-organizing maps.
Hierarchical Clustering
The hierarchical clustering algorithm computes a dendrogram that assembles all
elements into a single tree by specifying a similarity measure such as
Euclidean distance, in which similar elements are placed on closer branches of
the tree.
There are two major hierarchical clustering approaches:
Agglomerative Methods (bottom up):
Start with each object in its own cluster, then agglomerate the closest pair of
clusters at each stage, successively combining clusters until all of the
objects are in one cluster.
Divisive Methods (top-down):
Start with all objects in a single class. Increase the number of classes at
each step of the algorithm by dividing an existing class into (usually two)
sub-classes.
There are many ways of defining an inter-cluster distance:
single linkage based on the shortest distance between members
complete linkage based on the largest distance between members
average linkage based on the average distance between members
Ward's method based on the sum-of-squares
centroid method based on the distance between cluster centroids
Example
Let's use an artificial data set to illustrate agglomerative hierarchical
clustering with single linkage. Consider the hypothetical distances between
five pairs of genes given below.
|
| |
g1 |
0 |
|
|
|
|
| |
g2 |
7 |
0 |
|
|
|
| D = dik = |
g3 |
8 |
10 |
0 |
|
|
| |
g4 |
11 |
12 |
6 |
0 |
|
| |
g5 |
15 |
3 |
17 |
20 |
0 |
|
Since d25=3 is the minimum of dik for all
i, k, combine g2 and g5, and recalculate the
distances:
- d(25)1=min(d12,d15)=min(7,13)=7,
- d(25)3=min(d23,d53)=min(10,17)=10,
- d(25)4=min(d24,d25)=min(12,20)=12.
Now delete the row and column in D containing g2 and g5,
and add a row and column for the cluster (25).
|
| |
g(25) |
0 |
|
|
|
| |
g1 |
7 |
0 |
|
|
| |
g3 |
10 |
8 |
0 |
|
| |
g4 |
12 |
11 |
6 |
0 |
|
Now the minimum distance is between g3 and g4,
d34=6, so merge clusters 3 and 4 and calculate:
d(25)(34)=min(d(25)3,d(25)4)=min(10,12)=10,
d1(34)=min(d13,d14)=min(8,11)=8.
So the next distance matrix becomes:
|
| |
g(34) |
0 |
|
|
| |
g(25) |
10 |
0 |
|
| |
g1 |
8 |
7 |
0 |
|
Now the minimum distance is between g1 and g(25),
d1(25)=7, so merge cluster g(25) and g1 and
calculate:
d(251)(34)=min(d(25)1,d(25)(34))=min(7,10)=7,
and the final matrix becomes:
|
|
|
Finally the two clusters (251) and (34) merge into one single cluster. The
dendrogram for the example is drawn below.
K-Means Clustering
Given a criterion measuring the accuracy of a partition of a set of objects
into k disjoint classes, the k-means algorithm contains two main stages that
search for 'optimal' partitions of the objects.
Specify k initial centroids.
Assign an object to the cluster whose centroid is the nearest. Once
each object is assigned to the cluster whose centroid it is closest to,
recalculate the centroid for each cluster.
Repeat 2 until the cluster membership stabilizes.
Example
Suppose we measured four genes on two individuals. The data is given below:
|
| |
Individuals |
| Genes |
1 |
2 |
| g1 |
5 |
3 |
| g2 |
-1 |
1 |
| g3 |
1 |
-2 |
| g4 |
-3 |
-2 |
|
The object is to divide the data into k=2 clusters such that each gene
within a cluster is more similar than the genes in different clusters. Take
m1=(2,2) and m2=(-1,-2) to be the initial
centroids. Then compute the Euclidean distance of each gene from the centroid
and reassign each gene to the nearest group.
d2(g1,m1)=(5-2)2+(3-2)2=10,
d2(g1,m2)=(5+1)2+(3+2)2=61.
Since
d2(g1,m1)<d2(g1,m2),
g1 is assigned to the cluster whose centroid is at (2,2). A similar
procedure would assign g2 to the cluster whose centroid is at
m2, g3 to m2, and g4 to
m2. Now recalculate the new centroids by taking the mean of
the elements in the cluster, m1=(5,3),
m2=(-1,-1), and repeat the procedure. Now g1 forms
one cluster and g2, g3, g4 again form a second cluster.
Hence the cluster membership stabilizes and the process stops.
Self-Organizing Maps
The self-organizing map (SOM), developed by Kohonen, is a very popular clustering method.
The SOM approach is representative of an unsupervised learning approach, i.e.,
cluster properties are to be estimated or learned without prior information.
SOMs are constructed as follows: One chooses a geometry of "nodes", e.g., a 3x2
grid. The nodes are mapped into k-dimensional space, initially at
random, and then iteratively adjusted. Each iteration involves randomly
selecting a data point P and moving nodes in the direction of P. The closest
node Np is moved the most, whereas other nodes are moved by
smaller amounts depending on their distances from Np in the
initial geometry. In this fashion, neighboring points in the initial geometry
tend to be mapped to nearby points in k-dimensional space. The process
continues for 20,000-50,000 iterations. To be more specific, consider a weight
vector wi associated with note Ni. The
weight vector at lattice element i at iteration number t is
denoted by wi(t). Further, define some metric
||•||, e.g. Euclidean distance. The SOM proceeds as follows:
Step 1: Randomly select an input vector x from the set of input vectors.
Step 2: Determine the node c=i such that ||x -
wi(t)|| is minimized over all i.
Step 3: For node i, set
|
wi(t+1)
|
= wi(t) + α(t)(x-
wi(t))
|
|
if i ∈ Nc(t)
|
| |
= wi(t)
|
|
if i ∉ Nc(t)
|
where α(t) is a small fraction, used for controlling
convergence; and Nc(t) is the neighborhood of lattice element
c.
Step 4: Increment t; return to Step 1, unless the
stopping condition is reached.
Example: GeneCluster software
Tamayo, et al.2 applied SOMs to hematopoetic differentiation in four
well-studied models (HL=60,U937, Jurkat, and NB4 cells). The Affymetrix® HU6800 array contains probe sets for 6416 human genes. The
mapping nodes are adjusted by moving toward data point P by the formula
- fi+1(N) = fi(N) +
τ(d(N,Np),i)(P- fi(N)),
where d(N1,N2) is a distance function, the
position of node N at iteration i is denoted by
fi(N), the initial mapping f0 is random,
and data point P is identified. The point P used at each iteration is
determined by a random ordering of the data points and recycled as needed. The
function τ is defined by τ(x,i)=0.2T/(T+100i) for
x=ρ(i) and τ(x,i)=0 otherwise, where the radius
ρ(i) decreases linearly with i (ρ(0)=3) and
eventually becomes zero, and T is the maximum number of iterations.
This algorithm is implemented in the software GeneCluster written in Java.
SOMs are well suited to exploratory data analysis, allowing one to impose
partial structure on the clusters and facilitating easy visualization and
interpretation. But like hierarchical clustering and k-means clustering, they
tend to find genes that are positively correlated. Clustering results are
usually not very good for negatively correlated genes.
Gene-wise Comparison
When two sample types are being compared, a list of induced or suppressed genes
in the two samples may be constructed by selecting genes based on a certain
fold change ratio or absolute difference. dChip easily accommodates this type
of gene list construction in the Analysis/Compare Samples... menu. In this
menu, you may mark your samples as either Baseline (B) or Experiment (E) and
select genes with certain fold change (E/B or B/E >value) or difference (E-B
or B-E > value) criteria.
Regulatory Sequences
Once a gene list of interest is available, either through clustering or
similarity patterns or other criteria, it may be useful to determine if these
genes have any upstream sequences in common that may serve as regulatory
sequences. GeneSpring currently has this capability for the many genomes.
Functional Similarities
For a gene list of interest, GeneOntology, ProteinDomain, or Cytoband
information can be used to check for known functional similarities among the
genes. This type of analysis is available in dChip, GeneSpring, and GeneSight.
Generally, a p-value is calculated according to the hypergeometric distribution
to measure the statistical significance of the proposed functional similarity.
Higher-Level Analysis
There are several options for higher-level analysis that are not currently
provided by the biostatistics staff in the Microarray Core Facility. Some
examples are discussed in the following references. The Core Facility can make
referrals for advanced bioinformatics help upon request.
- Irene M. Ong, Jeremy D. Glasner, and David Page. Modelling regulatory
pathways in E.coli from time series expression profiles. Bioinformatics 2002
18:241S-248S.
(Available here)
- Xie L. Xu, James M. Olson, and Lue Ping Zhao
A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model
Hum. Mol. Genet. 2002 11: 1977-1985.
|
|
|
|