aboutsummaryrefslogtreecommitdiff
path: root/general/datasets/SA_M2_0405_MC/processing.rtf
blob: 2e46796c8dc77e26ace9b022e97c152728322dcf (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<p><strong>Probe (cell) level data from the CEL file: </strong>These CEL values produced by <a class="fs14" href="http://www.affymetrix.com/support/technical/product_updates/gcos_download.affx" target="_blank">GCOS</a> are 75% quantiles from a set of 91 pixel values per cell.</p>

<ul>
	<li>Step 1: We added an offset of 1.0 unit to each cell signal to ensure that all values could be logged without generating negative values. We then computed the log base 2 of each cell.</li>
	<li>Step 2: We performed a quantile normalization of the log base 2 values for the total set of 105 arrays (processed as two batches) using the same initial steps used by the RMA transform.</li>
	<li>Step 3: We computed the Z scores for each cell value.</li>
	<li>Step 4: We multiplied all Z scores by 2.</li>
	<li>Step 5: We added 8 to the value of all Z scores. The consequence of this simple set of transformations is to produce a set of Z scores that have a mean of 8, a variance of 4, and a standard deviation of 2. The advantage of this modified Z score is that a two-fold difference in expression level corresponds approximately to a 1 unit difference.</li>
	<li>Step 6: We eliminated much of the systematic technical variance introduced by the batches at the probe level. To do this we calculated the ratio of each batch mean to the mean of both batches and used this as a single multiplicative probe-specific batch correction factor. The consequence of this simple correction is that the mean probe signal value for each batch is the same.</li>
	<li>Step 7: Finally, we computed the arithmetic mean of the values for the set of microarrays for each strain. Technical replicates were averaged before computing the mean for independent biological samples. Note, that we have not (yet) corrected for variance introduced by differences in sex or any interaction terms. We have not corrected for background beyond the background correction implemented by Affymetrix in generating the CEL file. We eventually hope to add statistical controls and adjustments for some of these variables.</li>
</ul>

<p><strong>Probe set data: </strong>The expression data were processed by Yanhua Qu (UTHSC). Probe set data were generated from the fully normalized CEL files (quantile and batch corrected) using the standard MAS 5 Tukey biweight procedure. A 1-unit difference represents roughly a two-fold difference in expression level. Expression levels below 5 are usually close to background noise levels. <strong>Data quality control: </strong>A total of 62 samples passed RNA quality control.</p>

<p>Probe level QC: Log2 probe data of all arrays were inspected in DataDesk before and after quantile normalization. Inspection involved examining scatterplots of pairs of arrays for signal homogeneity (i.e., high correlation and linearity of the bivariate plots) and looking at all pairs of correlation coefficients (62x61/2). Arrays with probe data that was not homogeneous when compared to any other arrays was flagged. If the correlation at the probe level was less than approximately 0.92 we deleted that array data set. Three arrays we lost during this process (BXD19_M_Str_Batch03, BXD23_F_Str_Batch03, and BXD24_F_Str_Batch03).</p>

<p>Probe set level QC: The final normalized strain averages were evaluated for outliers. This involved counting the number of times that the probe set value for a particular strain was beyond two standard deviations of the mean of all strains. (We used the PDNN transform as our reference probe set data for this QC step.) Two strains, each represented by single arrays, generated greater than 5,000 outlier counts (10% of the number of probe sets). These two arrays generated a great number of outliers across the entire range of expression and since we do not yet have replicate arrays for either of these two strains we opted to delete them from the final April 2005 striatum data sets.</p>