Analysis Techniques

This section begins with some gene filtering tools that are useful in reducing the dimension of the data set and narrowing the investigation to the most interesting genes. Methods for finding similar gene expression patterns and for clustering using hierarchical clustering, k-means clustering, and self-organizing maps are also discussed. Simple gene-wise comparison possibilities are described, as are techniques for locating regulatory sequences and functional similarities for gene lists of interest. Higher-level analysis are possible and may require advanced biostatistical help.

Gene Filtering

Typically, about 40% of the genes on a microarray are expressed in a given sample of interest. Many of the gene expression values from a collection of chips may therefore be uninformative, and ignoring these genes in the analysis might be perfectly appropriate. Ad hoc gene filters are often applied to reduce the number of genes under investigation, enabling the investigator to focus on those genes that are most likely to be informative in the setting of interest. Which filtering techniques are appropriate will depend on the design of the study. Most software has simple filtering tools such as the three described below.

  • Presence call > X%

    This narrows our analysis to genes with a positive presence call in a certain percentage (> X%) of the samples.

  • A < st. dev./mean < B

    This requires that a gene be variable enough compared to its mean expression level to contain interesting information (> A), but not so variable that nothing can be learned (< B).

  • Expression level > Y in X%

    Since low expression estimates are sometimes unreliable, we may want to limit our analysis to genes that are expressed above some threshold (>Y) in a certain percentage (X%) of the samples.

    Similar Gene Expression Patterns

    One analysis possibility is to find genes with similar expression patterns to a specified gene of interest. Some type of distance metric (often Pearson's correlation coefficient) is used to quantify the similarity between two expression profiles. For example, one might construct a list of genes with Pearson's correlation coefficient >95% to a gene of interest, and search for functional similarities in this list. Constructing lists of genes with similar expression patterns is possible in dChip, Bioconductor, and GeneSpring.

    Clustering

    Clustering is one of the most widely used techniques in the analysis of microarray data. It looks to identify groups of genes with similar expression patterns over a set of observations or groups of samples with similar gene expression profiles. Three of the most widely used clustering techniques are hierarchical clustering, k-means clustering, and self-organizing maps.

    Hierarchical Clustering

    The hierarchical clustering algorithm computes a dendrogram that assembles all elements into a single tree by specifying a similarity measure such as Euclidean distance, in which similar elements are placed on closer branches of the tree.

    There are two major hierarchical clustering approaches:

    • Agglomerative Methods (bottom up): Start with each object in its own cluster, then agglomerate the closest pair of clusters at each stage, successively combining clusters until all of the objects are in one cluster.

    • Divisive Methods (top-down): Start with all objects in a single class. Increase the number of classes at each step of the algorithm by dividing an existing class into (usually two) sub-classes.

    There are many ways of defining an inter-cluster distance:

    • single linkage based on the shortest distance between members

    • complete linkage based on the largest distance between members

    • average linkage based on the average distance between members

    • Ward's method based on the sum-of-squares

    • centroid method based on the distance between cluster centroids

    Example

    Let's use an artificial data set to illustrate agglomerative hierarchical clustering with single linkage. Consider the hypothetical distances between five pairs of genes given below.

  •   g1 0        
      g2 7 0      
    D = dik =  g3 8 10 0    
      g4 11 12 6 0  
      g5 15 3 17 20 0

    Since d25=3 is the minimum of dik for all i, k, combine g2 and g5, and recalculate the distances:

    • d(25)1=min(d12,d15)=min(7,13)=7,
    • d(25)3=min(d23,d53)=min(10,17)=10,
    • d(25)4=min(d24,d25)=min(12,20)=12.

    Now delete the row and column in D containing g2 and g5, and add a row and column for the cluster (25).

      g(25) 0      
      g1 7 0    
      g3 10 8 0  
      g4 12 11 6 0

    Now the minimum distance is between g3 and g4, d34=6, so merge clusters 3 and 4 and calculate:

    • d(25)(34)=min(d(25)3,d(25)4)=min(10,12)=10,
    • d1(34)=min(d13,d14)=min(8,11)=8.

    So the next distance matrix becomes:

      g(34) 0    
      g(25) 10 0  
      g1 8 7 0
    Now the minimum distance is between g1 and g(25), d1(25)=7, so merge cluster g(25) and g1 and calculate:
    • d(251)(34)=min(d(25)1,d(25)(34))=min(7,10)=7,

    and the final matrix becomes:

      g(34) 0    
      g(251) 7   0

    Finally the two clusters (251) and (34) merge into one single cluster. The dendrogram for the example is drawn below.

    q1   
       
    q2       
           
    q5       
         
    q3     
           
    q4       
       

    K-Means Clustering

    Given a criterion measuring the accuracy of a partition of a set of objects into k disjoint classes, the k-means algorithm contains two main stages that search for 'optimal' partitions of the objects.

    1. Specify k initial centroids.
    2. Assign an object to the cluster whose centroid is the nearest. Once each object is assigned to the cluster whose centroid it is closest to, recalculate the centroid for each cluster.
    3. Repeat 2 until the cluster membership stabilizes.
    Example

    Suppose we measured four genes on two individuals. The data is given below:

      Individuals
    Genes 1 2
    g1 5 3
    g2 -1 1
    g3 1 -2
    g4 -3 -2

    The object is to divide the data into k=2 clusters such that each gene within a cluster is more similar than the genes in different clusters. Take m1=(2,2) and m2=(-1,-2) to be the initial centroids. Then compute the Euclidean distance of each gene from the centroid and reassign each gene to the nearest group.

    • d2(g1,m1)=(5-2)2+(3-2)2=10,
    • d2(g1,m2)=(5+1)2+(3+2)2=61.
    Since d2(g1,m1)<d2(g1,m2), g1 is assigned to the cluster whose centroid is at (2,2). A similar procedure would assign g2 to the cluster whose centroid is at m2, g3 to m2, and g4 to m2. Now recalculate the new centroids by taking the mean of the elements in the cluster, m1=(5,3), m2=(-1,-1), and repeat the procedure. Now g1 forms one cluster and g2, g3, g4 again form a second cluster. Hence the cluster membership stabilizes and the process stops.

    Self-Organizing Maps

    The self-organizing map (SOM), developed by Kohonen, is a very popular clustering method. The SOM approach is representative of an unsupervised learning approach, i.e., cluster properties are to be estimated or learned without prior information.

    SOMs are constructed as follows: One chooses a geometry of "nodes", e.g., a 3x2 grid. The nodes are mapped into k-dimensional space, initially at random, and then iteratively adjusted. Each iteration involves randomly selecting a data point P and moving nodes in the direction of P. The closest node Np is moved the most, whereas other nodes are moved by smaller amounts depending on their distances from Np in the initial geometry. In this fashion, neighboring points in the initial geometry tend to be mapped to nearby points in k-dimensional space. The process continues for 20,000-50,000 iterations. To be more specific, consider a weight vector wi associated with note Ni. The weight vector at lattice element i at iteration number t is denoted by wi(t). Further, define some metric ||•||, e.g. Euclidean distance. The SOM proceeds as follows:

    • Step 1: Randomly select an input vector x from the set of input vectors.
    • Step 2: Determine the node c=i such that ||x - wi(t)|| is minimized over all i.
    • Step 3: For node i, set

      wi(t+1) = wi(t) + α(t)(x- wi(t))   if i ∈ Nc(t)
        = wi(t)   if i ∉ Nc(t)

      where α(t) is a small fraction, used for controlling convergence; and Nc(t) is the neighborhood of lattice element c.

    • Step 4: Increment t; return to Step 1, unless the stopping condition is reached.
    Example: GeneCluster software

    Tamayo, et al.2 applied SOMs to hematopoetic differentiation in four well-studied models (HL=60,U937, Jurkat, and NB4 cells). The Affymetrix® HU6800 array contains probe sets for 6416 human genes. The mapping nodes are adjusted by moving toward data point P by the formula

    • fi+1(N) = fi(N) + τ(d(N,Np),i)(P- fi(N)),

    where d(N1,N2) is a distance function, the position of node N at iteration i is denoted by fi(N), the initial mapping f0 is random, and data point P is identified. The point P used at each iteration is determined by a random ordering of the data points and recycled as needed. The function τ is defined by τ(x,i)=0.2T/(T+100i) for x=ρ(i) and τ(x,i)=0 otherwise, where the radius ρ(i) decreases linearly with i (ρ(0)=3) and eventually becomes zero, and T is the maximum number of iterations. This algorithm is implemented in the software GeneCluster written in Java.

    SOMs are well suited to exploratory data analysis, allowing one to impose partial structure on the clusters and facilitating easy visualization and interpretation. But like hierarchical clustering and k-means clustering, they tend to find genes that are positively correlated. Clustering results are usually not very good for negatively correlated genes.

    Gene-wise Comparison

    When two sample types are being compared, a list of induced or suppressed genes in the two samples may be constructed by selecting genes based on a certain fold change ratio or absolute difference. dChip easily accommodates this type of gene list construction in the Analysis/Compare Samples... menu. In this menu, you may mark your samples as either Baseline (B) or Experiment (E) and select genes with certain fold change (E/B or B/E >value) or difference (E-B or B-E > value) criteria.

    Regulatory Sequences

    Once a gene list of interest is available, either through clustering or similarity patterns or other criteria, it may be useful to determine if these genes have any upstream sequences in common that may serve as regulatory sequences. GeneSpring currently has this capability for the many genomes.

    Functional Similarities

    For a gene list of interest, GeneOntology, ProteinDomain, or Cytoband information can be used to check for known functional similarities among the genes. This type of analysis is available in dChip, GeneSpring, and GeneSight. Generally, a p-value is calculated according to the hypergeometric distribution to measure the statistical significance of the proposed functional similarity.

    Higher-Level Analysis

    There are several options for higher-level analysis that are not currently provided by the biostatistics staff in the Microarray Core Facility. Some examples are discussed in the following references. The Core Facility can make referrals for advanced bioinformatics help upon request.

    • Irene M. Ong, Jeremy D. Glasner, and David Page. Modelling regulatory pathways in E.coli from time series expression profiles. Bioinformatics 2002 18:241S-248S. (Available here)

    • Xie L. Xu, James M. Olson, and Lue Ping Zhao A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model Hum. Mol. Genet. 2002 11: 1977-1985.