aboutsummaryrefslogtreecommitdiff
path: root/general/datasets/HC_M2_1206_R/processing.rtf
diff options
context:
space:
mode:
authorBonface2024-02-09 09:41:28 -0600
committerMunyoki Kilyungi2024-08-09 13:30:43 +0300
commitd029d5d7f8ead1f1de8d318045004a4a6f68f5fb (patch)
tree33c7ff40e3f953d030ed08f468f7afb1dfcba9e6 /general/datasets/HC_M2_1206_R/processing.rtf
parent769ff7825f5d8d36d541e90534c07f1985899973 (diff)
downloadgn-docs-d029d5d7f8ead1f1de8d318045004a4a6f68f5fb.tar.gz
Update dataset RTF Files.
Diffstat (limited to 'general/datasets/HC_M2_1206_R/processing.rtf')
-rw-r--r--general/datasets/HC_M2_1206_R/processing.rtf41
1 files changed, 41 insertions, 0 deletions
diff --git a/general/datasets/HC_M2_1206_R/processing.rtf b/general/datasets/HC_M2_1206_R/processing.rtf
new file mode 100644
index 0000000..bfc45d4
--- /dev/null
+++ b/general/datasets/HC_M2_1206_R/processing.rtf
@@ -0,0 +1,41 @@
+<blockquote><a class="fs14" href="http://www.biomedcentral.com/1471-2105/6/65" target="_empty">Harshlight</a> was used to examine the image quality of the array (CEL files). Bad areas (bubbles, scratches, blemishes) of arrays were masked.
+<p>First pass data quality control: Affymetrix GCOS provides useful array quality control data including:</p>
+
+<ol>
+ <li>The scale factor used to normalize mean probe intensity. This averaged 3.3 for the 179 arrays that passed and 6.2 for arrays that were excluded. The scale factor is not a particular critical parameter.</li>
+ <li>The average background level. Values averaged 54.8 units for the data sets that passed and 55.8 for data sets that were excluded. This factor is not important for quality control.</li>
+ <li>The percentage of probe sets that are associated with good signal (&quot;present&quot; calls). This averaged 50% for the 179 data sets that passed and 42% for those that failed. Values for passing data sets extended from 43% to 55%. This is a particularly important criterion.</li>
+ <li>The 3&#39;:5&#39; signal ratios of actin and Gapdh. Values for passing data sets averaged 1.5 for actin and 1.0 for Gapdh. Values for excluded data sets averaged 12.9 for actin and 9.6 for Gapdh. This is a highly discriminative QC criterion, although one must keep in mind that only two transcripts are being tested. Sequence variation among strains (particularly wild derivative strains such as CAST/Ei) may affect these ratios.</li>
+</ol>
+
+<p>The second step in our post-processing QC involves a count of the number of probe sets in each array that are more than 2 standard deviations (z score units) from the mean across the entire 206 array data sets. This was the most important criterion used to eliminate &quot;bad&quot; data sets. All 206 arrays were processed togther using standard RMA and PDNN methods. The count and percentage of probe sets in each array that were beyond the 2 z theshold was computed. Using the RMA transform the average percentage of probe sets beyond the 2 z threshold for the 179 arrays that finally passed of QC procedure was 1.76% (median of 1.18%). In contrast the 2 z percentage was more than 10-fold higher (mean of 22.4% and median 20.2%) for those arrays that were excluded. This method is not very senstive to the transformation method that is used. Using the PDNN transform the average percent of probe sets exceeding was 1.31% for good arrays and was 22.6% for those that were excluded. In our opinion, this 2 z criterion is the most useful criterion for the final decision of whether or not to include arrays, although again, allowances need to be made for wild strains that one expects to be different from the majority of conventional inbred strains. For examploe, if a data set has excellent characteristics on all of the Affymetrix GCOS metrics listed above, but generates a high 2 z percentage, then one whould include the ssample if one can verify that there are no problems in sample and data set identification.</p>
+
+<p>The entire procedure can be reapplied once the initial outlier data sets have been eliminated to detect any remaining outlier data sets.</p>
+
+<p><a class="fs14" href="http://www.datadesk.com/products/data_analysis/datadesk/" target="_empty">DataDesk</a> was used to examine the statistical quality of the of the probe level (CEL) data after step 5 below. DataDesk allows a rapid detection of subsets of probes that are particular sensitive to still unknown factors in array processing. Arrays can then be categorized at the probe level into &quot;reaction classes.&quot; A reaction class is a group of arrays for which the expression of essentially all probes are colinear over the full range of log2 values. A single but large group of arrays (n = 32) processed in essentially the identical manner by a single operator can produce arrays belonging to as many as four different reaction classes. Reaction classes are NOT related to strain, age, sex, treatment, or any known biological parameter (technical replicates can belong to different reaction classes). We do not yet understand the technical origins of reaction classes. The number of probes that contribute to the definition of reaction classes is quite small (&lt;10% of all probes). We have categorized all arrays in this data set into one of 5 reaction classes. These have then been treated as if they were separate batches. Probes in these data type &quot;batches&quot; have been aligned to a common mean as described below.</p>
+
+<p><strong>Probe set data with custom CDF mapping:</strong> The original Affymetrix annotation often has multiple probe sets mapping to a single gene. Some of these redundancies represent alternative splicing products, while some reflect our changing knowledge of the mouse genome. This transformation uses an annotation generated by the <a href="http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp">Microarray Group at the University of Michigan</a> where each probe has been checked against the latest mouse genome build (Build 36, mm8) and then collated into a new probe set based on its placement within a gene sequence in the Entrez Gene database. The following quote from their <a href="http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/cdfreadme.htm">Brainarray</a> website explains in more detail:</p>
+
+<p><q>Affymetrix GeneChips were based on the best UniGene clustering and genomic sequence information available at the time of chip design. Due to the significant increase in EST/cDNA/Genomic sequence information in the last couple of years, some oligonucleotide probes in these old designs can now be assigned to different genes/transcripts based on the current UniGene clustering and genome annotation. While Affymetrix&#39;s current annotation system maps each probe set to the latest UniGene build every couple of months, it does not deal with situations where a subset of oligonucleotide probes in a probe set may be assigned to another gene or more than one gene based on the current UniGene clustering and genome annotation. In addition, a significant portion of UniGene clusters can be represented by more than one oligonucleotide probe set on GeneChips but there is no standard approach to deal with signals from different probe sets representing the same gene. It will be highly desirable to have one probe set-one target relationship for the interpretation of the data. </q></p>
+
+<ol>
+ <li>CEL values produced by <a class="fs14" href="http://www.affymetrix.com/support/technical/product_updates/gcos_download.affx" target="_blank">GCOS</a> are 75% quantiles from a set of 91 pixel values per cell.</li>
+ <li>Probe level data from the CEL files were transformed with the RMA transform using the <a href="http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download_v8.asp">Mm74Bv2_Mm_ENTREZG_8</a> (Version 8) CDF mapping. Data transformation was done in Bioconductor using the <code>affy.justRMA()</code> package and the Mm430_Mm_ENTREZG file as contained in the Bioconductor repository. This yields only one unique probeset for each Entrez GeneID.</li>
+ <li>We computed the Z scores for each array.</li>
+ <li>The arithmetic mean of the values for the set of microarrays for each strain was computed.
+ <ul>
+ <li>The Z scores were recomputed for each strain.</li>
+ <li>We multiplied all Z scores by 2.</li>
+ <li>We added 8 to the value of all Z scores. The consequence of this simple set of transformations is to produce a set of Z scores that have a mean of 8, a variance of 4, and a standard deviation of 2. The advantage of this modified Z score is that a two-fold difference in expression level (probe brightness level) corresponds approximately to a 1 unit difference.</li>
+ </ul>
+ </li>
+</ol>
+
+<p>Probe level QC: Log2 probe data of all arrays were inspected in DataDesk before and after quantile normalization. Inspection involved examining scatterplots of pairs of arrays for signal homogeneity (i.e., high correlation and linearity of the bivariate plots) and looking at all pairs of correlation coefficients. XY plots of probe expression and signal variance were also examined. Probe level array data sets were organized into reaction groups. Arrays with probe data that were not homogeneous when compared to other arrays were flagged.</p>
+
+<p>Probe set level QC: The final normalized individual array data were evaluated for outliers. This involved counting the number of times that the probe set value for a particular array was beyond two standard deviations of the mean. This outlier analysis was carried out using the PDNN, RMA and MAS5 transforms and outliers across different levels of expression. Arrays that were associated with an average of more than 8% outlier probe sets across all transforms and at all expression levels were eliminated. In contrast, most other arrays generated fewer than 5% outliers.</p>
+
+<p>Validation of strains and sex of each array data set: A subset of probes and probe sets with a Mendelian pattern of inheritance were used to construct a expression correlation matrix for all arrays and the ideal Mendelian expectation for each strain constructed from the genotypes. There should naturally be a very high correlation in the expression patterns of transcripts with Mendelian phenotypes within each strain, as well as with the genotype strain distribution pattern of markers for the strain.</p>
+
+<p>Sex of the samples was validated using sex-specific probe sets such as <em>Xist</em> and <em>Dby</em>.</p>
+</blockquote>