|
|
Data Preparation
Before you begin analyzing your data, it needs to go through some
pre-processing. This section describes the procedures available for normalization in both MAS and dChip
(other software has its own normalization procedures that may be used.) The
algorithms for calculating presence calls
in MAS 5.0 and expression estimates
in MAS 5.0 and dChip are also described. Once the expression estimates are
computed, it is helpful to think of the data as a matrix when beginning your
analysis.
Normalization
Normalization of expression values ensures the comparability of gene expression
estimates across different samples. There are several different techniques
available for normalization, and software generally has its own tools
available. Conceptually, normalization corrects for overall chip brightness
and other factors that may influence the numerical value of expression
intensity, enabling the user to more confidently compare gene expression
estimates between samples.
For a comparison analysis, MAS 5.0 handles normalization by simply multiplying
the output of an the Experimental array by a Normalization Factor (NF) so that
its Average Intensity is the same as that of the Baseline array. MAS also
requires Scaling, in which the output of any array is multiplied by a Scaling
Factor (SF) to make its Average Intensity equal to an arbitrarily defined
Target Intensity. Once normalization and scaling are complete, comparison
analyses in MAS can be performed. In the absolute analysis that is run in the
Microarray Core, both NF and SF are set to 1.0, indicating no normalization or
scaling. It is up to the user to perform these tasks before performing
higher-level analyses.
In dChip, normalization is done before the calculation of Model-Based
Expression Indices (MBEIs, see Expression
Estimates). By default, in a set of samples, the chip with median overall
brightness is chosen, and all other arrays are normalized to it. The Invariant
Set Normalization method is used in which a subset of Perfect Match (PM) probes
with small within-subset rank difference in the two arrays is used as a basis
for the normalization. For more information on the smoothing process involved
in Invariant Set Normalization, please refer to Li and Wong 2001b.
Presence Call
Affymetrix® MAS 5.0 reports a Presence Call for each gene as a
measure of whether or not the mRNA transcript was actually present in the
sample. The Presence Call is either Present (P), Marginal (M), or Absent (A);
the algorithm to determine these calls proceeds as follows. This information
is taken from the Affymetrix® MAS 5.0 User's Guide.
For each of the J PM/MM pairs in a probe set, calculate a Discrimination
score Rj:
Rj =
(PMj-MMj)/(PMj+MMj) for j=1,...,J.
Note that if the PMj intensity is much larger than the
MMj intensity, Rj will be close to 1. On
the other hand, if the PMj and MMj
intensities are close to each other, Rj will be close to 0,
and possibly negative. The Discrimination scores are then compared to a
user-defined threshold τ (by default τ=0.015) and are ranked according
to their distance from τ. The One-Sided Wilcoxon's Signed Rank test is
then used to test for a significant difference from τ, yielding a Detection
ρ-value for the gene. Probe sets with many Rj
values close to 1 will yield lower (more significant) Detection
ρ-values, whereas those with many Rj values close
to 0 will yield higher (less significant) Detection ρ-values.
Once a Detection ρ-value is calculated for the gene, a Presence call
is then made based on user-defined Detection ρ-value cutoffs,
α1 and α2 by default
α1=0.04 and α2=0.06.
If the Detection ρ-value < α1,
the call is Present.
If α1 < Detection
ρ-value < α2, the call is
Marginal.
If the α2 < Detection ρ-value,
the call is Absent.
Presence Calls are often used for gene filtering before performing higher-level
analyses.
|
|
gene1 and gene2 are Present since
ρ-value < α1
gene3 is Marginal since α1 <
ρ-value < α2
gene4 and gene 5 are Absent since ρ-value < α2
|
Expression Estimates
- MAS 5.0
MAS 5.0 uses the One-Step Tukey's Biweight Estimate as a quantitative measure
of the mean mRNA transcript abundance for each gene. The log of the
background-adjusted PM-MM difference for each probe pair is weighted by its
distance from the median value for the entire probe set, and then used in
calculating the overall mean for each of the weighted log(PM-MM) values. This
mean is converted back to the linear scale and is output as the Signal by MAS.
The MM value is subtracted from the PM value since it is intended to measure
stray signal due to cross-hybridization. This information is taken from the
Affymetrix® MAS 5.0 User's Guide. For some probe pairs, the
MM intensity may be very close to, or even larger than, the PM intensity. When
this happens, MAS 5.0 applies the following rules so as to provide a better
Signal estimate.
Rule 1: If the MM value is less than the PM value, then the MM value is
considered informative and the intensity value is used directly as an estimate
of stray signal.
Rule 2: If the MM probe cells are generally informative across the probe set
except for a few MMs, an adjusted MM value is used for uninformative MMs based
on the biweight mean of the PM and MM ratio.
Rule 3: If the MM probe cells are generally uninformative, the uninformative
MMs are replaced with a value that is slightly smaller than the PM. These
probe sets generally have an Absent detection call.
dChip: PM-MM model
dChip software, designed by Li and Wong
2001a, provides model-based estimates of expression levels of genes as an
alternative to the Signal measure from MAS. Their method is motivated by ANOVA
analyses of experiments involving several arrays which demonstrate that the
residual mean squares due to the individual gene-specific probes is five times
or more that for the actual arrays. Thus, their model-based method accounts
for probe variability.
Suppose I(>1) arrays have been used in an experiment, and each gene
has J(>1) probe pairs associated with it. We then have
2xIxJ PM and MM intensity values with which to estimate the
amount of mRNA transcript. Let θi represent the expression level
for a gene in the ith sample. It is assumed that the intensity value
for a particular probe increases linearly with the amount of transcript, but at
different rates for different probes. It is also assumed that the rate of
increase for the PM probes is higher than that for the MM probes. Thus, we
have the following model for the PM and MM intensities:
MMij = νj + θiαj + ε ,
PMij = νj + θiαj + θiφj + ε .
In the equations above, νj represents the background
intensity in the jth probe due to non-specific hybridization,
αj ≥ 0 is the rate of increase of the jth MM
probe, θi ≥ 0 is the additional rate of increase
for the jth PM probe (also called the Probe Sensitivity Index or PSI),
and ε represents random error. Given these models, suppose we look at
PM-MM differences. Specifically,
yij = PMij - MMij = θiφj + εij
where εij is N(0,σ2).
Experiments suggest that the average difference is linear in the true
expression level, and so only the yij values will be
considered further (Lockhart et al., 1996). To make the model identifiable,
there must be some constraint imposed on the model. If we take
Σjθj2 = J
then least squares estimates for the parameters can be obtained by an iterative
algorithm.
If there are multiple arrays from the same experiment available, this model
provides an intuitive estimate of the mean and standard error of the
θs and φs. The standard error estimates of the
θs and φs can be used to identify outlier arrays and
probes that will consequently be excluded from the final estimation of the
probe response pattern. For each array, this model computes an expression
level on the ith array θi. If a specific array
has a large standard error relative to other arrays, possibly due to external
factors like the imaging process, then this is called an outlier array.
Similarly, if the estimate of φj for the jth
probe has a large standard error, possibly due to non-specific
cross-hybridization, it is called an outlier probe. Individual PM-MM
differences might also be identified by large residuals compared with the fit;
these single outliers are regarded as missing values in the model-fitting
algorithm.
dChip: PM-only model
Cross-hybridization is more likely to occur at the MM probes, rather than the
PM probes, and so a PM-only model exists that calculates expression values that
are always positive (Li and Wong 2001b). Studies suggest that the PM-only
model is more robust to cross-hybridization than the PM-MM difference model.
For background subtracted and normalized arrays, the multiplicative model below
is fit; the expression value for the ith gene,
θi as well as the jth probe sensitivity index
(PSI) φj are estimated using an iterative algorithm.
yij = PMij = θiφj + εij
where εij is N(0,σ2).
Similar techniques to those described for the PM-MM model are applied to detect
probe, array, and single outliers.
Data as a Matrix
Once you have computed expression estimates for each gene using either MAS 5.0
or dChip, it may be helpful to think of your data as being organized in an
nxp matrix where n is the number of genes and p is
the number of samples. In the matrix below, yij represents
the expression level for the ith gene (i=1,...,n) in the
jth sample (j=1,...,p).
|
| |
|
Samples |
| |
|
S1 |
S2 |
S3 |
... |
Sp |
| |
G1 |
y11 |
y12 |
y13 |
... |
y1p |
| Genes |
G2 |
y21 |
y22 |
y23 |
... |
y2p |
| |
G3 |
y31 |
y32 |
y33 |
... |
y3p |
| |
... |
... |
... |
... |
yij |
... |
| |
Gn |
yn1 |
yn2 |
yn3 |
... |
ynp |
|
|
|
|