Data Preparation

Before you begin analyzing your data, it needs to go through some pre-processing. This section describes the procedures available for normalization in both MAS and dChip (other software has its own normalization procedures that may be used.) The algorithms for calculating presence calls in MAS 5.0 and expression estimates in MAS 5.0 and dChip are also described. Once the expression estimates are computed, it is helpful to think of the data as a matrix when beginning your analysis.

Normalization

Normalization of expression values ensures the comparability of gene expression estimates across different samples. There are several different techniques available for normalization, and software generally has its own tools available. Conceptually, normalization corrects for overall chip brightness and other factors that may influence the numerical value of expression intensity, enabling the user to more confidently compare gene expression estimates between samples.

For a comparison analysis, MAS 5.0 handles normalization by simply multiplying the output of an the Experimental array by a Normalization Factor (NF) so that its Average Intensity is the same as that of the Baseline array. MAS also requires Scaling, in which the output of any array is multiplied by a Scaling Factor (SF) to make its Average Intensity equal to an arbitrarily defined Target Intensity. Once normalization and scaling are complete, comparison analyses in MAS can be performed. In the absolute analysis that is run in the Microarray Core, both NF and SF are set to 1.0, indicating no normalization or scaling. It is up to the user to perform these tasks before performing higher-level analyses.

In dChip, normalization is done before the calculation of Model-Based Expression Indices (MBEIs, see Expression Estimates). By default, in a set of samples, the chip with median overall brightness is chosen, and all other arrays are normalized to it. The Invariant Set Normalization method is used in which a subset of Perfect Match (PM) probes with small within-subset rank difference in the two arrays is used as a basis for the normalization. For more information on the smoothing process involved in Invariant Set Normalization, please refer to Li and Wong 2001b.

Presence Call

Affymetrix® MAS 5.0 reports a Presence Call for each gene as a measure of whether or not the mRNA transcript was actually present in the sample. The Presence Call is either Present (P), Marginal (M), or Absent (A); the algorithm to determine these calls proceeds as follows. This information is taken from the Affymetrix® MAS 5.0 User's Guide.

For each of the J PM/MM pairs in a probe set, calculate a Discrimination score Rj:

  • Rj =  (PMj-MMj)/(PMj+MMj)  for  j=1,...,J.

Note that if the PMj intensity is much larger than the MMj intensity, Rj will be close to 1. On the other hand, if the PMj and MMj intensities are close to each other, Rj will be close to 0, and possibly negative. The Discrimination scores are then compared to a user-defined threshold τ (by default τ=0.015) and are ranked according to their distance from τ. The One-Sided Wilcoxon's Signed Rank test is then used to test for a significant difference from τ, yielding a Detection ρ-value for the gene. Probe sets with many Rj values close to 1 will yield lower (more significant) Detection ρ-values, whereas those with many Rj values close to 0 will yield higher (less significant) Detection ρ-values.

Once a Detection ρ-value is calculated for the gene, a Presence call is then made based on user-defined Detection ρ-value cutoffs, α1 and α2 by default α1=0.04 and α2=0.06.

If the Detection ρ-value < α1, the call is Present.
If α1 < Detection ρ-value < α2, the call is Marginal.
If the α2 < Detection ρ-value, the call is Absent.

Presence Calls are often used for gene filtering before performing higher-level analyses.

gene1 and gene2 are Present since ρ-value < α1
gene3 is Marginal since α1 < ρ-value < α2
gene4 and gene 5 are Absent since ρ-value < α2

Expression Estimates

  • MAS 5.0

    MAS 5.0 uses the One-Step Tukey's Biweight Estimate as a quantitative measure of the mean mRNA transcript abundance for each gene. The log of the background-adjusted PM-MM difference for each probe pair is weighted by its distance from the median value for the entire probe set, and then used in calculating the overall mean for each of the weighted log(PM-MM) values. This mean is converted back to the linear scale and is output as the Signal by MAS. The MM value is subtracted from the PM value since it is intended to measure stray signal due to cross-hybridization. This information is taken from the Affymetrix® MAS 5.0 User's Guide.

    For some probe pairs, the MM intensity may be very close to, or even larger than, the PM intensity. When this happens, MAS 5.0 applies the following rules so as to provide a better Signal estimate.

    Rule 1: If the MM value is less than the PM value, then the MM value is considered informative and the intensity value is used directly as an estimate of stray signal.

    Rule 2: If the MM probe cells are generally informative across the probe set except for a few MMs, an adjusted MM value is used for uninformative MMs based on the biweight mean of the PM and MM ratio.

    Rule 3: If the MM probe cells are generally uninformative, the uninformative MMs are replaced with a value that is slightly smaller than the PM. These probe sets generally have an Absent detection call.

  • dChip: PM-MM model

    dChip software, designed by Li and Wong 2001a, provides model-based estimates of expression levels of genes as an alternative to the Signal measure from MAS. Their method is motivated by ANOVA analyses of experiments involving several arrays which demonstrate that the residual mean squares due to the individual gene-specific probes is five times or more that for the actual arrays. Thus, their model-based method accounts for probe variability.

    Suppose I(>1) arrays have been used in an experiment, and each gene has J(>1) probe pairs associated with it. We then have 2xIxJ PM and MM intensity values with which to estimate the amount of mRNA transcript. Let θi represent the expression level for a gene in the ith sample. It is assumed that the intensity value for a particular probe increases linearly with the amount of transcript, but at different rates for different probes. It is also assumed that the rate of increase for the PM probes is higher than that for the MM probes. Thus, we have the following model for the PM and MM intensities:

    MMij = νj + θiαj + ε ,
    PMij = νj + θiαj + θiφj + ε .

    In the equations above, νj represents the background intensity in the jth probe due to non-specific hybridization, αj ≥ 0 is the rate of increase of the jth MM probe, θi ≥ 0 is the additional rate of increase for the jth PM probe (also called the Probe Sensitivity Index or PSI), and ε represents random error. Given these models, suppose we look at PM-MM differences. Specifically,

    yij = PMij - MMij = θiφj + εij

    where εij is N(0,σ2). Experiments suggest that the average difference is linear in the true expression level, and so only the yij values will be considered further (Lockhart et al., 1996). To make the model identifiable, there must be some constraint imposed on the model. If we take

    Σjθj2 = J

    then least squares estimates for the parameters can be obtained by an iterative algorithm.

    If there are multiple arrays from the same experiment available, this model provides an intuitive estimate of the mean and standard error of the θs and φs. The standard error estimates of the θs and φs can be used to identify outlier arrays and probes that will consequently be excluded from the final estimation of the probe response pattern. For each array, this model computes an expression level on the ith array θi. If a specific array has a large standard error relative to other arrays, possibly due to external factors like the imaging process, then this is called an outlier array. Similarly, if the estimate of φj for the jth probe has a large standard error, possibly due to non-specific cross-hybridization, it is called an outlier probe. Individual PM-MM differences might also be identified by large residuals compared with the fit; these single outliers are regarded as missing values in the model-fitting algorithm.

  • dChip: PM-only model

    Cross-hybridization is more likely to occur at the MM probes, rather than the PM probes, and so a PM-only model exists that calculates expression values that are always positive (Li and Wong 2001b). Studies suggest that the PM-only model is more robust to cross-hybridization than the PM-MM difference model.

    For background subtracted and normalized arrays, the multiplicative model below is fit; the expression value for the ith gene, θi as well as the jth probe sensitivity index (PSI) φj are estimated using an iterative algorithm.

    yij = PMij = θiφj + εij

    where εij is N(0,σ2). Similar techniques to those described for the PM-MM model are applied to detect probe, array, and single outliers.

Data as a Matrix

Once you have computed expression estimates for each gene using either MAS 5.0 or dChip, it may be helpful to think of your data as being organized in an nxp matrix where n is the number of genes and p is the number of samples. In the matrix below, yij represents the expression level for the ith gene (i=1,...,n) in the jth sample (j=1,...,p).

    Samples
    S1 S2 S3 ... Sp
  G1 y11 y12 y13 ... y1p
Genes G2 y21 y22 y23 ... y2p
  G3 y31 y32 y33 ... y3p
  ... ... ... ... yij ...
  Gn yn1 yn2 yn3 ... ynp