# Glossary of Terms and Features
[A](#a) | [B](#b) | [C](#c)| [D](#d) | [E](#e) | [F](#f) | [G](#g) | [H](#h) | [I](#i) | [J](#j) | [K](#k) | [L](#l) | [M](#m) | [N](#n) | [O](#o) | [P](#p) | [Q](#q) | [R](#r) | [S](#s) | [T](#t) | [U](#u) | [V](#v) | [W](#w) | [X](#x) | [Y](#y) | [Z](#z) You are welcome to cite or reproduce these glossary definitions. Please cite or link: Author AA. "Insert Glossary Term Here." From The WebQTL Glossary--A GeneNetwork Resource. gn1.genenetwork.org/glossary.html ## A #### Additive Allele Effect: The additive allele effect is an estimate of the change in the average phenotype that would be produced by substituting a single allele of one type with that of another type (e.g., a replaced by A) in a population. In a standard F2 intercross between two inbred parental lines there are two alleles at every polymorphic locus that are often referred to as the little "a" allele and big "A" allele. F2 progeny inherit the a/a, a/A, or A/A genotypes at every genetic locus in a ratio close to 1:2:1. The additive effect is half of the difference between the mean of all cases that are homozygous for one parental allele (aa) compared to the mean of all cases that are homozygous for the other parental allele (AA): GeneNetwork displays the additive values on the far right of many trait/QTL maps, usually as red or green lines along the maps. The units of measurement of additive effects (and dominance effects) are defined by the trait itself and are shown in **Trait Data and Analysis** windows. For mRNA estimates these units are usually normalized log2 expression values. For this reason an additive effect of 0.5 units indicates that the A/A and a/a genotypes at a particular locus or marker differ by 1 unit (twice the effect of swapping a single A allele for an a allele). On this log2 scale this is equivalent to a 2-fold difference (2 raised to the power of 1). On the QTL map plots the polarity of allele effects is represented by the color of the line. For example, in mouse BXD family maps, if the DBA/2J allele produces higher values than the C57BL/6J allele then the additive effect line is colored in green. In contrast, if the C57BL/6J allele produces higher values then the line is colored in red. For computational purposes, C57BL/6J red values are considered negative. The dominance effects of alleles are also computed on maps for F2 populations (e.g., B6D2F2 and B6BTBRF2). Orange and purple line colors are used to distinguish the polarity of effects. Purple is the positive dominance effect that matches the polarity of the green additive effect, whereas orange is the negative dominance effect that matches the polarity of the red additive effect. [Please also see entry on **Dominance Effects**: Williams RW, Oct 15, 2004; Sept 3, 2005; Dec 4, 2005; Oct 25, 2011] [Go back to index](#index) #### Bootstrap: A [bootstrap sample](http://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29) is a randomly drawn sample (or resample) that is taken from the original data set and that has the same number of samples as the original data set. In a single bootstrap sample, some cases will by chance be represented one or more times; other cases may not be represented at all (in other words, the sampling is done "with replacement" after each selection). To get a better intuitive feel for the method, imagine a bag of 26 Scrabble pieces that contain each letter of the English alphabet. In a bootstrap sample of these 26 pieces, you would shake the bag, insert your hand, and draw out one piece. You would then write down that letter on a piece of paper, and the place that Scrabble piece back in the bag in preparation for the next random selection. You would repeat this process (shake, draw, replace) 25 more times to generate a single bootstrap resample of the alphabet. Some letters will be represented several time in each sample and others will not be represented at al. If you repeat this procedure 1000 times you would have a set of bootstrap resamples of the type that GN uses to remap data sets. Bootstrap resampling is a method that can be used to estimate statistical parameters and error terms. GeneNetwork uses a bootstrap procedure to evaluate approximate confidence limits of QTL peaks using a method proposed by Peter Visscher and colleagues ([1996](http://www.genetics.org/content/143/2/1013.full.pdf)). We generate 2000 bootstraps, remap each, and keep track of the location of the single locus with the highest LRS score locations (equivalent to a "letter" in the Scrabble example). The 2000 "best" locations are used to produce the yellow histograms plotted on some of the QTL maps. If the position of a QTL is firm, then the particular composition of the sample, will not shift the position of the QTL peak by very much. In such a case, the histogram of "best QTLs" (yellow bars in the maps) that is displayed in WebQTL maps will tend to have a sharp peak (the scale is the percentage of bootstrap resamples that fall into each bar of the bootstrap histogram). In contrast, if the the yellow bootstrap histograms are spread out along a chromosome, then the precise location of a QTL may not be accurate, even in the original correct data set. Bootstrap results naturally vary between runs due to the random generation of the samples. See the related entry "Frequency of Peak LRS." KNOWN PROBLEMS and INTERPRETATION of BOOTSTRAP RESULTS: The reliability of bootstrap analysis of QTL confidence intervals has been criticized by Manichaikul and colleagues ([2006](http://www.genetics.org/cgi/content/full/174/1/481)). Their work applies in particular to standard intercrosses and backcrosses in which markers are spaced every 2 cM. They recommend that confidence intervals be estimated either by a conventional 1.5 to 2.0 LOD drop-off interval or by a Bayes credible Interval method. There is a known flaw in the way in which GeneNetwork displays bootstrap results (Sept 2011). If a map has two or more adjacent markers with identical LOD score and identical strain distribution patterns, all of the bootstrap results are assigned incorrectly to just one of the "twin" markers. This results in a false perception of precision. QTL mapping methods can be highly sensitive to cases with very high or very low phenotype values (outliers). The bootstrap method does not provide protection against the effects of outliers and their effects on QTL maps. Make sure you review your data for outliers before mapping. Options include (1) Do nothing, (2) Delete the outliers and see what happens to your maps, (3) [Winsorize](http://en.wikipedia.org/wiki/Winsorising) the values of the outliers. You might try all three options and determine if your main results are stable or not. With small samples or extreme outliers, you may find the mapping results to be quite volatile. In general, if the results (QTL position or value) depend highly on one or two outliers (5-10% of the samples) then you should probably delete or winsorize the outliers. [Williams RW, Oct 15, 2004, Mar 15, 2008, Mar 26, 2008; Sept 2011] [Go back to index](#index) #### CEL and DAT Files (Affymetrix): Array data begin as raw image files that are generated using a confocal microscope and video system. Affymetrix refers to these image data files as DAT files. The DAT image needs to be registered to a template that assigns pixel values to expected array coordinates (cells). The result is an assignment of a set of image intensity values (pixel intensities) to each probe. For example, each cell/probe value generated using Affymetrix arrays is associated with approximately 36 pixels (a 6x6 set of pixels, usually with an effective 11 or 12-bit range of intensity). Affymetrix uses a method that simply ranks the values of these pixels and picks as the "representative value" the pixel that is has rank 24 from low to high. The range of variation in intensity amoung these ranked pixels provides a way to estimate the error of the estimate. The Affymetrix CEL files therefore consist of XY coordinates, the consensus value, and an error term. [Williams RW, April 30, 2005] #### Cluster Map or QTL Cluster Map: Cluster maps are sets of QTL maps for a group of traits. The QTL maps for the individual traits (up to 100) are run side by side to enable easy detection of common and unique QTLs. Traits are clustered along one axis of the map by phenotypic similarity (hierarchical clustering) using the Pearson product-moment correlation r as a measurement of similarity (we plot 1-r as the distance). Traits that are positively correlated will be located near to each other. The genome location is shown along the other, long axis of the cluster map, marker by marker, from Chromosome 1 to Chromosome X. Colors are used to encode the probability of linkage, as well as the additive effect polarity of alleles at each marker. These QTL maps are computed using the fast Marker Regression algorithm. P values for each trait are computed by permuting each trait 1000 times. Cluster maps could be considered trait gels because each lane is loaded with a trait that is run out along the genome. Cluster maps are a unique feature of the GeneNetwork developed by Elissa Chesler and implemented in WebQTL by J Wang and RW Williams, April 2004. [Williams RW, Dec 23, 2004, rev June 15, 2006 RWW]. #### Collections and Trait Collections: One of the most powerful features of GeneNetwork (GN) is the ability to study large sets of traits that have been measured using a common genetic reference population or panel (GRP). This is one of the key requirements of systems genetics--many traits studied in common. Under the main GN menu **Search** heading you will see a link to **Trait Collections**. You can assemble you own collection for any GRP by simply adding items using the Add to Collection button that you will find in many windows. Once you have a collection you will have access to a new set of tools for analysis of your collection, including **QTL Cluster Map, Network Graph, Correlation Matrix**, and **Compare Correlates**. [Williams RW, April 7, 2006] #### Complex Trait Analysis: Complex trait analysis is the study of multiple causes of variation of phenotypes within species. Essentially all traits that vary within a population are modulated by a set of genetic and environmental factors. Finding and characterizing the multiple genetic sources of variation is referred to as "genetic dissection" or "QTL mapping." In comparison, complex trait analysis has a slightly broader focus and includes the analysis of the effects of environmental perturbation, and gene-by-environment interactions on phenotypes; the "norm of reaction." Please also see the glossary term "Systems Genetics." [Williams RW, April 12, 2005] #### Composite Interval Mapping: Composite interval mapping is a method of mapping chromosomal regions that controls for some fraction of the genetic variability in a quantitative trait. Unlike simple interval mapping, composite interval mapping usually controls for variation produced at one or more background marker loci. These background markers are generally chosen because they are already known to be close to the location of a significant QTL. By factoring out a portion of the genetic variance produced by a major QTL, one can occasionally detect secondary QTLs. WebQTL allows users to control for a single background marker. To select this marker, first run the **Marker Regression** analysis (and if necessary, check the box labeled display all LRS, select the appropriate locus, and the click on either **Composite Interval Mapping** or **Composite Regression**. A more powerful and effective alternative to composite interval mapping is pair-scan analysis. This latter method takes into accounts (models) both the independent effects of two loci and possible two-locus epistatic interactions. [Williams RW, Dec 20, 2004] #### Correlations: Pearson and Spearman: GeneNetwork provides tools to compute both Pearson product-moment correlations (the standard type of correlation), Spearman rank order correlations. [Wikipedia](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) and introductory statistics text will have a discussion of these major types of correlation. The quick advice is to use the more robust Spearman rank order correlation if the number of pairs of observations in a data set is less than about 30 and to use the more powerful but much more sensitive Pearson product-moment correlation when the number of observations is greater than 30 AND after you have dealt with any outliers. GeneNetwork automatically flags outliers for you in the Trait Data and Analysis form. GeneNetwork also allows you to modify values by either deleting or winsorising them. That means that you can use Pearson correlations even with smaller sample sizes after making sure that data are well distributed. Be sure to view the scatterplots associated with correlation values (just click on the value to generate a plot). Look for bivariate outliers. #### Cross: The term Cross refers to a group of offspring made by mating (crossing) one strain with another strain. There are several types of crosses including intercrosses, backcrosses, advanced intercrosses, and recombinant inbred intercrosses. Genetic crosses are almost always started by mating two different but fully inbred strains to each other. For example, a B6D2F2 cross is made by breeding C57BL/6J females (B6 or B for short) with DBA/2J males (D2 or D) and then intercrossing their F1 progeny to make the second filial generation (F2). By convention the female is always listed first in cross nomenclature; B6D2F2 and D2B6F2 are therefore so-called reciprocal F2 intercrosses (B6D2F1 females to B6D2F1 males or D2B6F1 females to D2B6F1 males). A cross may also consist of a set of recombinant inbred (RI) strains such as the BXD strains, that are actually inbred progeny of a set of B6D2F2s. Crosses can be thought of as a method to randomize the assignment of blocks of chromosomes and genetic variants to different individuals or strains. This random assignment is a key feature in testing for causal relations. The strength with which one can assert that a causal relation exists between a chromosomal location and a phenotypic variant is measured by the LOD score or the LRS score (they are directly convertable, where LOD = LRS/4.61) [Williams RW, Dec 26, 2004; Dec 4, 2005]. [Go back to index](#index) #### Dominance Effects: The term dominance indicates that the phenotype of intercross progeny closely resemble one of the two parental lines, rather than having an intermediate phenotype. Geneticists commonly refer to an allele as having a dominance effect or dominance deviation on a phenotype. Dominance deviation at a particular marker are calculated as the difference between the average phenotype of all cases that have the Aa genotype at that marker and the expected value half way between the all casese that have the aa genotype and the AA genotype. For example, if the average phenotype value of 50 individuals with the aa genotype is 10 units whereas that of 50 individuals with the AA genotype is 20 units, then we would expect the average of 100 cases with the Aa genotype to be 15 units. We are assuming a linear and perfectly additive model of how the a and A alleles interact. If these 100 Aa cases actually have a mean of 11 units, then this additive model would be inadequate. A non-linear dominance terms is now needed. In this case the low a alleles is almost perfectly dominant (or semi-dominant) and the dominance deviation is -4 units. The dominance effects are computed at each location on the maps generated by the WebQTL module for F2 populations (e.g., B6D2F2 and B6BTBRF2). Orange and purple line colors are used to distinguish the polarity of the dominance effects. Purple is the positive dominance effect that matches the polarity of the green additive effect, whereas orange is the negative dominance effect that matches the polarity of the red additive effect. Note that dominance deviations cannot be computed from a set of recombinant inbred strains because there are only two classes of genotypes at any marker (aa and AA, more usuually written AA and BB). However, when data for F1 hybrids are available one can estimate the dominance of the trait. This global phenotypic dominance has almost nothing to do with the dominance deviation at a single marker in the genome. In other words, the dominance deviation detected at a single marker may be reversed or neutralized by the action of many other polymorphic genes. [Williams RW, Dec 21, 2004; Sept 3, 2005] [Go back to index](#index) #### Epistasis: Epistasis means that combined effects of two or more different loci or polymorphic genes are not what one would expect given the addition of their individual effects. There is, in other words, evidence for non-linear interactions among two or more loci. This is similar to the dominance effects mentioned above, but now generalized to two or more distinct loci, rather than to two or more alleles at a single locus. For example, if QTL 1 has an A allele that has an additive effects of +5 and QTL 2 has an A alleles that has an additive effect of +2, then the two locus genotype combination A/A would be expected to boost the mean by +7 units. But if the value of these A/A individuals was actually -7 we would be quite surprised and would refer to this as an epistatic interaction between QTL 1 and QTL 2. WebQTL will search for all possible epistatic interactions between pairs of loci in the genome. This function is called a **Pair Scan** becasue the software analyzes the LRS score for all possible pairs of loci. Instead of viewing an LRS plot along a single dimension, we now view a two-dimensional plot that shows a field of LRS scores computed for pairs of loci. Pair scan plots are extremely sensitive to outlier data. Be sure to review the primary data carefully using **Basic Statistics**. Also note that this more sophisiticated method also demands a significantly larger sample size. While 25 to 50 cases may be adequate for a conventional LRS plot (sometimes called a "main scan"), a **Pair-Scan** is hard to apply safely with fewer than 60 cases. [Williams RW, Dec 21, 2004; Dec 5, 2005] #### Effect Size of a QTL: QTLs can be ranked by the amount of variance that they explain--their so-called "effect size"--when they are included in a statistical model. The concept of a genetic model may seem odd to some users of GeneNetwork. A model is just an explicit hypothesis of how QTLs and other factors cause variation in a trait. QTL mapping involves comparisons of the explanatory power of different models. Effect sizes can be measured in different units including (1) the percentage of total or genetic variance that is explained by adding the QTL into the model, (2) the mean shift in Z score, or (3) the additive effect size expressed in the original measurement scale. Effects of single QTLs are often dependent on genetic background (i.e., other QTLs and their interactions) and on the numbers and types of cases used in a study. For example, the variance explained is influenced strongly by whether the sample are from a family cohort, a case-control cohort, a group of fully inbred strains such as recombinant inbred lines, an outcross or backcross population. Please note that the functional importance of a locus, QTL, or GWAS hit can not be predicted by the size of its effect on the trait in one environment, at one stage of development, and in one population. Estimates of the effect size of QTLs are usually both noisy and upwardly biased (overestimated), and both of these problems are particularly acute when sample sizes are small. Estimates of effect size for families of inbred lines, such as the BXD, HXB, CC, and hybrid diversity panels (e.g. the hybrid mouse diversity panel and the hybrid rat diversity panel) are typically (and correctly) much higher than those measured in otherwise similar analysis of intercrosses, heterogeneous stock (HS), or diversity outbred stock. Two factors contribute to the much higher level of explained variance of QTLs when using inbred strain panels. 1. **Replication Rate:** The variance that can be explained by a locus is increased by sampling multiple cases that have identical genomes and by using the strain mean for genetic analysis. Increasing replication rates from 1 to 6 can easily double the apparent heritability of a trait and therefore the effect size of a locus. The reason is simple—resampling decrease the standard error of mean, boosting the effective heritability (see Glossary entry on *Heritability* and focus on figure 1 from the Belknap [1998](http://gn1.genenetwork.org/images/upload/Belknap_Heritability_1998.pdf) paper reproduced below).lodToPval <- function(x) { pchisq(x*(2*log(10)),df=1,lower.tail=FALSE)/2 } # (from https://www.biostars.org/p/88495/ )In the two likelihoods, one has maximized over the various nuisance parameters (the mean phenotypes for each genotype group, or overall for the null hypothesis, and the residual variance). Or one can say, one has plugged in the maximum likelihood estimates for these nuisance parameters. With complete genotype data for a marker, the log likelihood for the normal model reduces to (-n/2) times the log of the residual sum of squares. LOD values can be converted to LRS scores (likelihood ratio statistics) by multiplying by 4.61. The LOD is also roughly equivalent to the -log(P) when the degrees of freedom of the mapping has two degrees of freedom, as in a standard F2 intercross. In such as case, where P is the probability of linkage (P = 0.001) the –logP => 3), will also equal a LOD of 3. The LOD itself is not a precise measurement of the probability of linkage, but in general for F2 crosses and RI strains, values above 3.3 will usually be worth attention for simple interval maps. LOD scores and –logP scores are only interchangable when models have two degrees of freedom (2 df). Let us begin with an example. Suppose we have a LOD score of 3 from an F2 cross; this test has 2 df. Let us calculate the p-value (and the logp) corresponding to this LOD score. We know that $$ LOD = \log_{10}(LR),$$ where $LR$ is the likelihood ratio. We also know that $2\log(LR)$ follows a $\chi^2$ distribution with 2 df. Note that when unqualified, we will mean natural logarithms with base $e$. Thus, $$ LOD = \log_{10}(LR) = \log(LR)/log(10).$$ Therefore $$ 2 \times \log(10) \times LOD = 2 \log(LR)$$ which follows a $\chi^2$ distribution with 2 df. Using this information, we will calculate the p-value and logp. In [1]: # load package load package using Distributions # LOD score LOD = 3.0 # p-value from chi-square with 2df calculated from # cdf (cumulative distribution function) p = 1-cdf(Chisq(2),2*log(10)*LOD) Out[1]: 0.0010000000000000009 We can now calculate the logp value which is the negative of the base 10 logarithm of the p-value. In [2]: # now logp -log10(p) Out[2]: 2.9999999999999996 [Williams RW, June 15, 2005, updated with text from Karl Broman, Oct 28, 2010, updated Apr 21, 2020 with Nyholt reference, updated Dec 30, 2020]. #### LRS: In the setting of mapping traits, the likelihood ratio statistic is used as a measurement of the association or linkage between differences in traits and differences in particular genotype markers. LRS or LOD values are usually plotted on the y-axis, whereas chromosomal location of the marker are usually plotted on the x-axis. In the case of a whole genome scan--a sequential analysis of many markers and locations across the entire genome--LRS values above 10 to 15 will usually be worth attention for when mapping with standard experimental crosses (e.g., F2 intercrosses or recombinant inbred strains). The term "likelihood ratio" is used to describe the relative probability (likelihood) of two different explanations of the variation in a trait. The first explanation (or model or hypothesis H1) is that the differences in the trait ARE associated with that particular DNA sequence difference or marker. Very small probability values indicate that H1 is probably true. The second "null" hypothesis (Hnull or H0) is that differences in the trait are NOT associated with that particular DNA sequence. We can use the ratio of these two probabilities and models (H1 divided by H0) as our score. The math is a little bit more complicated and the LRS score is actually equal to -2 times the ratio of the natural logarithms of the two probabilities. For example, if the probability of H0 is 0.05 (only a one-in-twenty probability that the marker is associated with the trait by chance), whereas and the probability of H1 is 1 (the marker is certainly not linked to the trait), then the LRS value is 5.991. In Excel the equation giving the LRS result of 5.991 would look like this "=-2*(LN(0.05)-LN(1)). [Williams RW, Dec 13, 2004, updated Nov 18, 2009, updated Dec 19, 2012] [Go back to index](#index) ## M Marker Regression: The relationship between differences in a trait and differences in alleles at a marker (or gene variants) can be computed using a regression analysis (genotype vs phenotype) or as a simple Pearson product moment correlation. Here is a simple example that you can try in Excel to understand marker-phenotype regression or marker-phenotype correlation: enter a row of phenotype and genotype data for 20 strains in an Excel spreadsheet labeled "Brain weight." The strains are C57BL/6J, DBA/2J, and 20 BXD strains of mice (1, 2, 5, 6, 8, 9, 12, 13, 14, 15, 16, 18, 21, 22, 23, 24, 25, 27, 28, and 29. The brains of these strains weigh an average (in milligrams) of 465, 339, 450, 390, 477, 361, 421, 419, 412, 403, 429, 429, 436, 427, 409, 431, 432, 380, 394, 381, 389, and 375. (These values are taken from BXD Trait 10032; data by John Belknap and colleagues, 1992. Notice that data are missing for several strains including the extinct lines BXD3, 4, and 7. Data for BXD11 and BXD19 (not extinct) are also missing. In the second row enter the genotypes at a single SNP marker on Chr 4 called "rs13478021" for the subset of strains for which we have phenotype data. The genotypes at rs1347801 are as follows for 20 BXDs listed above: D B D B D B D D D D D B D B D B D B D B. This string of alleles in the parents and 20 BXDs is called a strains distribution pattern (SDP). Let's convert these SDP letters into more useful numbers, so that we can "compute" with genotypes. Each B allele gets converted into a -1 and each D allele gets converted into a +1. In the spreadsheet, the data set of phenotypes and genotypes should look like this.
Strain BXD1 BXD2 BXD5 6 8 9 12 13 14 15 16 18 21 22 23 24 25 27 28 29 Brain_weight 450 390 477 361 421 419 412 403 429 429 436 427 409 431 432 380 394 381 389 375 Marker_rs1347801 D B D B D B D D D D D B D B D B D B D B Marker_code 1 -1 1 -1 1 -1 1 1 1 1 1 -1 1 -1 1 -1 1 -1 1 -1To compute the marker regression (or correlation) we just compare values in Rows 2 and 4. A Pearson product moment correlation gives a value of r = 0.494. A regression analysis indicates that on average those strains with a D allele have a heavier brain with roughly a 14 mg increase for each 1 unit change in genotype; that is a total of about 28 mg if all B-type strains are compared to all D-type strains at this particular marker. This difference is associated with a p value of 0.0268 (two-tailed test) and an LRS of about 9.8 (LOD = 9.8/4.6 or about 2.1). Note that the number of strains is modest and the results are therefore not robust. If you were to add the two parent strains (C57BL/6J and DBA/2J) back into this analysis, which is perfectly fair, then the significance of this maker is lost (r = 0.206 and p = 0.3569). Bootstrap and permutation analyses can help you decide whether results are robust or not and whether a nominally significant p value for a single marker is actually significant when you test many hundreds of markers across the whole genome (a so-called genome-wide test with a genome-wide p value that is estimated by permutation testing). [RWW, Feb 20, 2007, Dec 14, 2012] [Go back to index](#index) ## N #### Normal Probability Plot: A [normal probability plot](http://en.wikipedia.org/wiki/Normal_probability_plot) is a powerful tool to evaluate the extent to which a distribution of values conforms to (or deviates from) a normal Gaussian distribution. The Basic Statistics tools in GeneNetwork provides these plots for any trait. If a distribution of numbers is normal then the actual values and the predicted values based on a z score (units of deviation from the mean measured in standard deviation units) will form a nearly straight line. These plots can also be used to efficiently flag outlier samples in either tail of the distribution. In genetic studies, the probability plot can be used to detect the effects of major effect loci. A classical Mendelian locus will typically be associated with either a bimodal or trimodal distribution. In the plot below based on 99 samples, the points definitely do not fall on a single line. Three samples (green squares) have unusually high values; the majority of samples fall on a straight line between z = -0.8 to z = 2; and 16 values have much lower trait values than would be predicted based on a single normal distribution (a low mode group). The abrupt discontinuity in the distribution at -0.8 z is due to the effect of a single major Mendelian effect. Deviations from normality of the sort in the figure below should be considered good news from the point of view of likely success of tracking down the locations of QTLs. However, small numbers of outliers may require special statistical handling, such as their exclusion or [winsorising](http://en.wikipedia.org/wiki/Winsorising) (see more below on "Winsorizing"). [RWW June 2011] [Go back to index](#index) ## O #### Outliers: (also see [Wikipedia](http://en.wikipedia.org/wiki/Outlier)) Statistical methods often assume that the distribution of trait values is close to a Gaussian normal bell-shaped curve and that there are no outlier values that are extremely high or low compared to the average. Some traits can be clearly split into two or more groups (affected cases and unaffected cases) and this is not a problem as long as the number of cases in each group is close to the number that you expected by chance and that your sample size is reasonable high (40 or more for recombinant inbred strains). Mapping functions and most statistical procedure in GeneNetwork should work reasonable well (the pair scan function for epistatic interactions is one possible exception). However, correlations and QTL mapping methods can be highly sensitive to outlier values. Make sure you review your data for outliers before mapping. GeneNetwork flags all outliers for you in the Trait Data and Analysis window and gives you the option of zapping these extreme values. Options include (1) do nothing, (2) delete the outliers and see what happens to your maps, (3) [Winsorize](http://en.wikipedia.org/wiki/Winsorising) the values of the outliers. You might try all three options and determine if your main results are stable or not. With small samples or extreme outliers, you may find the correlation and mapping results to be volatile. In general, if results (correlations, QTL positions or QTL LRS score) depend highly on one or two outliers (5-10% of the samples) then you should probably delete or winsorize the outliers. In order to calculate outliers, we first determine the Q1(25%) and Q3(75%) values and then multiply by a constant (in our case 1.5; a higher constant is less sensitive to outliers). This value is then subtracted from the Q1 value and added to the Q3 value in order to determine the lower and upper bounds. Values that fall above the upper bound or below the lower bound are considered outliers. The method is summarized [here](http://www.wikihow.com/Calculate-Outliers). [Sloan ZA, Oct 2013] [Go back to index](#index) ## P #### Pair-Scan, 2D Genome Scan, or Two-QTL Model: The pair scan function evaluates pairs of intervals (loci) across the genome to determine how much of the variability in the trait can be explained jointly by two putative QTLs. The pair scan function in GeneNetwork is used to detect effects of pairs of QTLs that have epistatic interactions, although this function also evaluates summed additive effects of two loci. Trait variance is evaluated using a general linear model that has this structure (called a "model"): (where the = sign should be read "a function of" This model is also known as the Full Model (LRS Full in the output table), where QTL1 and QTL2 are the independent additive effects associated with two unlinked loci (the so-called main effects) and QTL1xQTL2 is the interaction term (LRS Interact in the output table). An LRS score is computed for this full model. This is computation identical to computing an ANOVA that allows for an interaction term between two predictors. The additive model that neglects the QTL1XQTL2 term is also computed. The output table in GeneNetwork list the the two intervals at the top of the table (Interval 1 to the left and Interval 2 to the far right). The LRS values for different components of the model are shown in the middle of the table (LRS Full, LRS Additive, LRS Interact, LRS 1, and LRS 2). Note that LRS 1 and LRS 2 will usually NOT sum to LRS Additive. CAUTIONS and LIMITATIONS: Pair-scan is only implemented for recombinant inbred strains. We do not recommend the use of this function with sample sizes of less than 60 recombinant inbred strains. Pair-scan procedures need careful diagnostics and an be very sensitive to outliers and to the balance among the four possible two-locus genotype classes among a set of RI strains. Pair-scan is not yet implemented for F2 progeny. GeneNetwork implements a rapid but non-exhaustive DIRECT algorithm (Lundberg et al., [2004](http://bioinformatics.oxfordjournals.org/content/20/12/1887.full.pdf)) that efficiently searches for epistatic interactions. This method is so fast that it is possible to compute 500 permutations to evaluate non-parametric significance of the joint LRS value within a minute. This makes DIRECT ideal for an interactive web service. Karl Broman's [R/qtl](http://www.rqtl.org/tutorials/rqtltour.pdf) implements an exhaustive search using the "scantwo" function. [RWW, May 2011] #### Partial Correlation: Partial correlation is the correlation between two variables that remains after controlling for one or more other variables. Idea and techniques used to compute partial correlations are important in testing causal models ([Cause and Correlation in Biology](http://www.amazon.com/Cause-Correlation-Biology-Structural-Equations/dp/0521529212), Bill Shipley, 2000). For instance, r1,2||3,4 is the partial correlation between variables 1 and 2, while controlling for variables 3 and 4 (the || symbol is equivalent to "while controlling for"). We can compare partial correlations (e.g., r1,2||3,4) with original correlations (e.g., r1,2). If there is an insignificant difference, we infer that the controlled variables have minimal effect and may not influence the variables or even be part of the model. In contrast, if the partial correlations change significantly, the inference is that the causal link between the two variables is dependent to some degree on the controlled variables. These control variables are either anteceding causes or intervening variables. (text adapted from D Garson's original by RWW). For more on [partial correlation](http://faculty.chass.ncsu.edu/garson/PA765/partialr.htm) please link to this great site by David Garson at NC State. For more on dependence separation ([d-separation](http://www.andrew.cmu.edu/user/scheines/tutor/d-sep.html)) and constructing causal models see Richard Scheines' site. Why would you use of need partial correlations in GeneNetwork? It is often useful to compute correlations among traits while controlling for additional variables. Partial correlations may reveal more about the causality of relations. In a genetic context, partial correlations can be used to remove much of the variance associated with linkage and linkage disequilibrium. You can also control for age, age, and other common cofactors. Please see the related Glossary terms "Tissue Correlation". [RWW, Aug 21, 2009; Jan 30, 2010] #### PCA Trait or Eigentrait: If you place a number of traits in a Trait Collection you can carry out some of the key steps of a principal component analysis, including defining the variance directed along specific principal component eigenvectors. You can also plot the positions of cases against the first two eigenvectors; in essence a type of scatterplot. Finally, GeneNetwork allows you to exploit PCA methods to make new "synthetic" eigentraits from collections of correlated traits. These synthetic traits are the values of cases along specific eigenvectors and they may be less noisy than single traits. If this seems puzzling, then have a look at these useful PCA explanation by [G. Dallas](http://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/) and by [Powell and Lehe](http://setosa.io/ev/principal-component-analysis/). **How to do it:** You can select and assemble many different traits into a single **Trait Collection** window using the check boxes and **Add To Collection** buttons. One of the most important function buttons in the **Collection** window is labeled **Correlation Matrix**. This function computes Pearson product moment correlations and Spearman rank order correlations for all possible pairs of traits in the Collection window. It also perfoms a principal component or factor analysis. For example, if you have 20 traits in the Collection window, the correlation matrix will consist of 20*19 or 190 correlations and the identity diagonal. Principal components analysis is a linear algebraic procedure that finds a small number of independent factors or principal components that efficiently explain variation in the original 20 traits. It is a effective method to reduce the dimensionality of a group of traits. If the 20 traits share a great deal of variation, then only two or three factors may explain variation among the traits. Instead of analyzing 20 traits as if they were independent, we can now analyze the main principal components labeled PC01, PC02, etc. PC01 and PC02 can be treated as new synthetic traits that represent the main sources of variation among original traits. You can treat a PC trait like any other trait except that it is not stored permanently in a database table. You can put a PC trait in your Collection window and see how well correlated each of the 20 original traits is with this new synthetic trait. You can also map a PC trait. [RWW, Aug 23, 2005] #### Permutation Test: A permutation test is a computationally intensive but conceptually simple method used to evaluate the statisical significance of findings. Permutation tests are often used to evaluate QTL significance. _Some background_: In order to detect parts of chromosomes that apparently harbor genes that contribute to differences in a trait's value, it is common to search for associations (linkage) across the entire genome. This is referred to as a "whole genome" scan, and it usually involves testing hundreds of independently segregating regions of the genome using hundreds, or even thousands of genetic markers (SNPs and microsatellites). A parametric test such as a conventional t test of F test can be used to estimate the probability of the null hypothesis at any single location in the genome (the null hypothesis is that there is no QTL at this particular location). But a parametric test of this type makes assumptions about the distribution of the trait (its normality), and also does not provide a way to correct for the large number of independent tests that are performed while scanning the whole genome. We need protection against many false discoveries as well as some assurance that we are not neglecting truly interesting locations. A permutation test is an elegant solution to both problems. The procedure involves randomly reassigning (permuting) traits values and genotypes of all cases used in the analysis. The permuted data sets have the same set of phenotypes and genotypes (in other words, distributions are the same), but obviously the permutation procedure almost invariably obliterates genuine gene-to-phenotype relation in large data sets. We typically generate several thousand permutations of the data. Each of these is analyzed using precisely the same method that was used to analyze the correctly ordered data set. We then compare statistical results of the original data set with the collection of values generated by the many permuted data sets. The hope is that the correctly ordered data are associated with larger LRS and LOD values than more than 95% of the permuted data sets. This is how we define the p = .05 whole genome significance threshold for a QTL. Please see the related Glossary terms "Significant threshold" and "Suggestive threshold". [RWW, July 15, 2005] #### Power to detect QTLs: An analysis of statistical power is useful to estimate numbers of replicates and strains needed to detect and resolve major sources of trait variance and covariance. A versatile method has been developed by Sen and colleagues (Sen et al., 2007) and implemented in the R program. qtlDesign. David Ashbrook implemented a version of this within Shiny that can help you estimate power for different QTL effect sizes, cohort sizes, and replication rates: #### [Power Calculator (D. Ashbrook)](https://dashbrook1.shinyapps.io/bxd_power_calculator_app/) We can see that in all situations power is increased more by increasing the number of lines than by increasing the number of biological replicates. Dependent upon the heritability of the trait, there is little gain in power when going above 4-6 biological replicates. [DGA, Mar 3, 2018] #### Probes and Probe Sets: In microarray experiments the probe is the immobilized sequence on the array that is complementary to the target message washed over the array surface. Affymetrix probes are 25-mer DNA sequences synthesized on a quartz substrate. There are a few million of these 25-mers in each 120-square micron cell of the array. The abundance of a single transcript is usualy estimated by as many as 16 perfect match probes and 16 mismatch probes. The collection of probes that targets a particular message is called a probe set. [RWW, Dec 21, 2004] [Go back to index](#index) ## Q #### QTL: A quantitative trait locus is a chromosome region that contains one or more sequence variants that modulates the distribution of a variable trait measured in a sample of genetically diverse individuals from an interbreeding population. Variation in a quantitative trait may be generated by a single QTL with the addition of some environmental noise. Variation may be oligogenic and be modulated by a few independently segregating QTLs. In many cases however, variation in a trait will be polygenic and influenced by large number of QTLs distributed on many chromosomes. Environment, technique, experimental design and a host of other factors also affect the apparent distribution of a trait. Most quantitative traits are therefore the product of complex interactions of genetic factors, developmental and epigenetics factors, environmental variables, and measurement error. [Williams RW, Dec 21, 2004] [Go back to index](#index) ## R #### Recombinant Inbred Strain (RI or RIS) or Recombinant Inbred Line (RIL): An inbred strain whose chromosomes incorporate a fixed and permanent set of recombinations of chromosomes originally descended from two or more parental strains. Sets of RI strains (from 10 to 5000) are often used to map the chromosomal positions of polymorphic loci that control variance in phenotypes. For a terrific short summary of the uses of RI strains see [2007](http://www.informatics.jax.org/silverbook/chapters/9-2.shtml)). Chromosomes of RI strains typically consist of alternating haplotypes of highly variable length that are inherited intact from the parental strains. In the case of a typical rodent RI strain made by crossing maternal strain C with paternal strain B (called a CXB RI strain), a chromosome will typically incorporate 3 to 5 alternating haplotype blocks with a structure such as BBBBBCCCCBBBCCCCCCCC, where each letter represents a genotype, series of similar genotype represent haplotypes, and where a transition between haplotypes represents a recombination. Both pairs of each chromosome will have the same alternating pattern, and all markers will be homozygous. Each of the different chromosomes (Chr 1, Chr 2, etc.) will have a different pattern of haplotypes and recombinations. The only exception is that the Y chromosome and the mitochondial genome, both of which are inherited intact from the paternal and maternal strain, respectively. For an RI strain to be useful for mapping purposes, the approximate position of recombinations along each chromsome need to be well defined either in terms of centimorgan or DNA basepair position. The precision with which these recombinations are mapped is a function of the number and position of the genotypes used to type the chromosomes--20 in the example above. Because markers and genotypes are often space quite far apart, often more than 500 Kb, the actual data entered into GeneNetwork will have some ambiguity at each recombination locus. The haplotype block BBBBBCCCCBBBCCCCCCCC will be entered as BBBBB?CCCC?BBB?CCCCCCCC where the ? mark indicates incomplete information over some (we hope) short interval. RI strains are almost always studied in sets or panels. All else being equal, the larger the set of RI strains, the greater the power and precision with which phenotypes can be mapped to chromosomal locations. The first set of eight RIs, the CXB RIs, were generated by Donald Bailey (By) from an intercross between a female BALB/cBy mouse (abbreviated C) and a male C57BL/6By mouse in the 1960s. The small panel of 8 CXB strains was originally used to determine if the major histocompatibility (MHC) locus on proximal Chr 17 was a key factor accounting for different immune responses such as tissue rejection. The methods used to determine the locations of recombinations relied on visible markers (coat color phenotypes such as the C and B loci) and the electrophoretic mobility of proteins. Somewhat larger RI sets were generated by Benjamin Taylor to map Mendelian and other major effect loci. In the 1990s the utility of RI sets for mapping was significantly improved thanks to higher density genotypes made possible by the use of microsatellite markers. Between 2005 and 2017, virtually all extant mouse and rat RI strains were regenotyped at many thousands of SNP markers, providing highly accurate maps of recombinations. While the potential utility of RI strains in mapping complex polygenic traits was obvious from the outset, the small number of strains only made it feasible to map quantitative traits with large effects. The first large RI sets were generated by plant geneticists (Burr et al. [2000](http://demeter.bio.bnl.gov/RIchap_rev.pdf)) and this the plant genetics community holds a strong lead in the production of very large RI sets to study multigenic and polygenic traits and trait covariance and pleiotropy. By 2010 the number of mouse RI strains had increased to the point where defining causal gene and sequence variant was more practical. As of 2018 there are about 150 BXD strains (152 have been fully sequenced), ~100 Collaborative Cross strains (also all fully sequenced), and at least another 100 RI strains belonging to smaller sets that have been extremely well genotyped. **Making RI strains**: The usual procedure typically involves sib mating of the progeny of an F1 intercross for more than 20 generations. Even by the 5th filial (F) generation of successive matings, the RI lines are homozygous at 50% of loci and by F13, the value is above 90%. At F20 the lines are nearly fully inbred (~98%) and by convention are now referred to as inbred strains rather than inbred lines. [Go back to index](#index) Legend:from [Silver, L. (1995) Oxford University Press](http://www.informatics.jax.org/silverbook/frames/frame3-3.shtml) [Williams RW, June 20, 2005; significant extension, Sept 21, 2007, added Crow ref, Oct 2009] ## S #### Scree Plots: GeneNetwork will often automatically generate a [Scree Plot](http://www.improvedoutcomes.com/docs/WebSiteDocs/PCA/Creating_a_Scree_Plot.htm) and the associated principal components (PCs) when you compute a Correlation Matrix for a group of traits that you have placed in your Trait Collection (a set of phenotypes and/or expression data for a specific population). Here is a nice definition of what a Scree plot is trying to tell you adopted and adapted from IOS (www.improvedoutcomes.com). A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each PC. The PCs are ordered, and by definition are therefore assigned a number label, by decreasing order of contribution to total variance. The PC with the largest fraction contribution is labeled PC01. Such a plot when read left-to-right across the abscissa can often show a clear separation in fraction of total variance where the 'most important' components cease and the 'least important' components begin. The point of separation is often called the 'elbow'. (In the PCA literature, the plot is called a 'Scree' Plot because it often looks like a 'scree' slope, where rocks have fallen down and accumulated on the side of a mountain.) [Williams RW, Dec 20, 2008] #### Significant threshold: The significant threshold represents the approximate LRS value that corresponds to a genome-wide p-value of 0.05, or a 5% probability of falsely rejecting the null hypothesis that there is no linkage anywhere in the genome. This threshold is computed by evaluating the distribution of highest LRS scores generated by a set of 2000 random permutations of strain means. For example, a random permutation of the correctly ordered data may give a peak LRS score of 10 somewhere across the genome. The set of 1000 or more of these highest LRS scores is then compared to the actual LRS obtained for the correctly ordered (real) data at any location in the genome. If fewer than 50 (5%) of the 1000 permutations have peak LRS scores anywhere in the genome that exceed that obtained at a particular locus using the correctly ordered data, then one can usually claim that a QTL has been defined at a genome-wide p-value of .05. The threshold will vary slightly each time it is recomputed due to the random generation of the permutations. You can view the actual histogram of the permutation results by selecting the "Marker Regression" function in the **Analysis Tools** area of the **Trait Data and Editing Form**. WebQTL does make it possible to search through hundreds of traits for those that may have significant linkage somewhere in the genome. Keep in mind that this introduces a second tier of multiple testing problems for which the permutation test will not usually provide adequate protection. If you anticipate mapping many independent traits, then you will need to correct for the number of traits you have tested. [Williams RW, Nov 14, 2004] #### SNP Seismograph Track: SNP is an acronym for single nucleotide polymorphisms (SNPs). SNPs are simple one base pair variants that distinguish individuals and strains. The SNP Seismograph track is a unique feature of physical maps in the GeneNetwork. Each track is customized for a particular cross and shows only those SNPs that differ between the two parental strains. For example, on mouse BXD maps, only the SNPs that differ between C57BL/6J and DBA/2J will be displayed. Regions with high numbers of SNPs are characterised by wider excursions of the yellow traces that extends along the x axis. Since these regions have many SNPs they have a higher prior probability of containing functional sequence differences that might have downstream effects on phenotypes. Large genes with many SNPs close to the peak LRS and that also have a biological connection with the trait ypu are studying are high priority candidate genes. The SNP track in WebQTL exploits the complete Celera Discovery System SNP set but adds an additional 500,000 inferred SNPs in both BXD and AXB/BXA crosses. These SNPs were inferred based on common haplotype structure using an Monte Carlo Markov chain algorithm developed by Gary Churchill and Natalie Blades and implemented by Robert Crowell, and RWW in July 2004. Raw data used to generate the SNP seismograph track were generated by Alex Williams and Chris Vincent, July 2003. The BXD track exploits a database of 1.75 million B vs D SNPs, whereas the AXB/BXA track exploits a database of 1.80 million A vs B SNPs. The names, sequences, and precise locations of most of these SNPs are the property of Celera Discovery Systems, whom we thank for allowing us to provide this level of display in WebQTL. Approximately 2.8 million additional SNPs generated by Perlegen for the NIEHS have been added to the SNP track by Robert Crowell (July-Aug 2005). We have also added all Wellcome-CTC SNPs and all relevant mouse SNPs from dbSNP. [Williams RW, Dec 25, 2004; Sept 3, 2005] #### Standard Error of the Mean (SE or SEM): In most GeneNetwork data sets, the SEM is computed as: Standard Deviation (SD) divided by the square root of n - 1 where n is the number of independent biological samples used to estimate the population mean. What this means in practice is that when n = 2 (as in many microarray data sets), the SEM and the SD are identical. This method of computing the SEM is conservative, but corrects to some extent for well known bias of the SEM discussed by Gurland and Tripathi (1971, A simple approximation for unbiased estimation of the standard deviation. Amer Stat 25:30-32). [Williams RW, Dec 17, 2008] #### Strain Distribution Pattern: A marker such as a SNP or microsatellite is genotyped using DNA obtained from each member of the mapping population. In the case of a genetic reference population, such as the BXD strains or the BayXSha Arabadopsis lines, this results in a text string of genotypes (e.g., BDDDBDBBBBDDBDDDBBBB... for BXD1 through BXD100). Each marker is associated with its own particular text string of genotypes that is often called the **strain distribution pattern** of the marker. (A more appropriate term would be the **marker genotype string**.) This string is converted to a numerical version, a genotype vector: -1111-11-1-1-1-111-1111-1-1-1-1..., where D=1, B=-1, H=0. Mapping a trait boils down to performing correlations between each trait and all of the genotype vectors. The genotype vector with the highest correlation (absolute value) is a good candidate for a QTL. [Williams RW, June 18, 2005] #### Suggestive Threshold: The suggestive threshold represents the approximate LRS value that corresponds to a genome-wide p-value of 0.63, or a 63% probability of falsely rejecting the null hypothesis that there is no linkage anywhere in the genome. This is not a typographical error. The Suggestive LRS threshold is defined as that which yields, on average, one false positive per genome scan. That is, roughly one-third of scans at this threshold will yield no false positive, one-third will yield one false positive, and one-third will yield two or more false positives. This is a very permissive threshold, but it is useful because it calls attention to loci that may be worth follow-up. Regions of the genome in which the LRS exceeds the suggestive threshold are often worth tracking and screening. They are particularly useful in combined multicross metaanalysis of traits. If two crosses pick up the same suggestive locus, then that locus may be significant when the joint probability is computed. The suggestive threshold may vary slightly each time it is recomputed due to the random generation of permutations. You can view the actual histogram of the permutation results by selecting the "Marker Regression" function in the **Analysis Tools** area of the **Trait Data and Editing Form**. [Williams RW and Manly KF, Nov 15, 2004] #### Systems Genetics: Systems genetics or "network genetics" is an emerging new branch of genetics that aims to understand complex causal networks of interactions at multiple levels of biological organization. To put this in a simple context: Mendelian genetics can be defined as the search for linkage between a single trait and a single gene variant (1 to 1); complex trait analysis can be defined as the search for linkage between a single trait and a set of gene variants (QTLs, QTGs, and QTNs) and environmental cofactors (1 to many); and systems genetics can be defined as the search for linkages among networks of traits and networks of gene and environmental variants (many to many). A hallmark of systems genetics is the simultaneous consideration of groups (systems) of phenotypes from the primary level of molecular and cellular interactions that ultimately modulate global phenotypes such as blood pressure, behavior, or disease resistance. Changes in environment are also often important determinants of multiscalar phenotypes; reversing the standard notion of causality as flowing inexorably upward from the genome. Scientists who use a systems genetics approach often have a broad interest in modules of linked phenotypes. Causality in these complex dynamic systems is often contingent on environmental or temporal context, and often will involve feedback modulation. A systems genetics approach can be unusually powerful, but does require the use of large numbers of observations (large sample size), and more advanced statistical and computational models. Systems genetics is not really a new field and traces back to [Sewall Wright's](http://www.amphilsoc.org/library/mole/w/wrights.htm) classical paper (Wright, 1921, "Correlation and Causation") that introduced path analysis to study systems of related phenotypes. Two factors have invigorated this field. The first factor is the advent of more sophisticated statistical methods including Structural [Equation Modeling](http://userwww.sfsu.edu/~efc/classes/biol710/path/SEMwebpage.htm) (SEM), [System Dynamics Modeling](http://www.public.asu.edu/~kirkwood/sysdyn/SDIntro/SDIntro.htm), and [Bayesian Network Modeling](http://bnj.sourceforge.net/) combined with powerful computer systems and efficient algorithms. The second factor is the relative ease with which it is now possible to acquire extensive and diverse phenotype data sets across genetic reference populations such as the BXD set of mice, the HXB set of rats, and the BayXSha lines of Arabidopsis (data are incorporated in the GeneNetwork). In the case of the BXD strains, a large research community has collectively generated hundreds of thousands of transcript phenotypes in different tissues and cells (level of expression), as well as hundreds of protein, cellular, pharmacological, and behavioral data types across a single genetic reference panel. Evaluating and modeling the associative and causal relations among these phenotypes is a major, and still relatively new area of research. Complex trait analysis and QTL mapping are both part of systems genetics in which causality is inferred using conventional genetic linkage (Li et al., [2005](http://hmg.oupjournals.org/cgi/content/abstract/ddi124v1)). One can often assert with confidence that a particular module of phenotypes (component of the variance and covariance) is modulated by sequence variants at a common locus. This provides a causal constraint that can be extremely helpful in more accurately modeling network architecture. Most models are currently static, but as the field matures, more sophisticated dynamic models will supplant steady-state models. The term "systems genetics" was coined by Grant Morahan, October 2004, during a visit to Memphis, as a more general and appropriate term to use instead of "genetical genomics." [Williams RW, April 11, 2005, revised Oct 22, 2005, April, 2008] [Go back to index](#index) ## T #### Tissue Correlation: The tissue correlation is an estimate of the similarity of expression of two genes across different cells, tissues, or organs. In order to compute this type of correlation we first generate expression data for multiple different cell types, tissues, or whole organs from a single individual. There will be significant differences in gene expression across this sample and this variation can then be used to compute either Pearson product-moment correlations (r) or Spearman rank order correlations (rho) between any pair of genes, transcripts, or even exons. Since the samples are ideally all from one individual there should not be any genetic or environmental differences among samples. The difficulty in computing tissue correlations is that samples are not independent. For example, three samples of the small intestine (jejunum, ilieum, and duodenum) will have expression patterns that are quite similar to each other in comparison to three other samples, such as heart, brain, and bone. For this reason the nature of the sampling and how those samples are combined will greatly affect the correlation values. The tissue correlations in GeneNetwork were computed in a way that attempts to reduce the impact of this fact by combining closely related sample types. For example multiple data sets for different brain region were combined to generate a single average CNS tissue sample (generating a whole brain sample would have been an alternative method). However, there is really not optimal way to minimize the effects of this type of non-independence of samples. Some genes will have high expression in only a few tissues, for example the cholinergic receptor, nicotinic, alpha polypeptide 1 gene Chrna1 has high expression in muscle tissues (skeletal muscle = Mus, tongue = Ton, and esophagus = Eso) but lower expression in most other tissues. The very high correlation between Chrna1 and other genes with high expression only in muscle reflects their joint bimodality of expression. It does not mean that these genes or their proteins necessarily cooperate directly in molecular processes. [Williams RW, Dec 26, 2008] #### Transcript Location: The small orange triangle on the x-axis indicates the approximate position of the gene that corresponds to the transcript. These values were taken from the latest assembly of genome of the particular species. #### Transform: Most of the data sets in the GeneNetwork are ultimately derived from high resolution images of the surfaces of microarrays. Estimates the gene expression therefore involves extensive low-level image analysis. These processesing steps attempt to compensate for low spatial frequency "background" variation in image intensity that cannot be related to the actual hybridization signal, for example, a gradation of intensity across the whole array surface due to illumination differences, uneven hybridization, optical performance, scanning characteristics, etc. High spatial frequeny artifacts are also removed if they are likely to be artifacts: dust, scrathes on the array surface, and other "sharp" blemishes. The raw image data (for example, the Affymetrix DAT file) also needs to be registered to a template that assigns pixel values to expected array spots (cells). This image registration is an important process that users can usually take for granted. The end result is the reliable assignment of a set of image intensity values (pixels) to each probe. Each cell value generated using the Affymetrix U74Av2 array is associated with approximately 36 pixel intensity values (a 6x6 set of pixels, usually an effective 11 or 12-bit range of intensity). Affymetrix uses a method that simply ranks the values of these pixels and picks as the "representative value" the pixel that is closest to a particular rank order value, for example, the 24th highest of 36 pixels. The range of variation in intensity values amoung these ranked pixels provides a way to estimate the error of the estimate. The Affymetrix CEL files therefore consist of XY coordinates, the consensus value, and an error term. [Williams RW, April 30, 2005] #### Transgression: Most of us are familiar with the phrase "regression toward the mean." This refers to the tendency of progeny of a cross to have phenotype that are intermediate to those of the parents. Transgression refers to the converse: progeny that have more phenotypes that are higher and lower than those of either parent. Transgression is common, and provided that a trait is influenced by many independent sequence variants (a polygenic trait), transgression is the expectation. This is particularly true if the parents are different genetically, but by chance have similar phenotypes. Consider a trait that is controlled by six independent genes, A through F. The "0" allele at these size genes lowers body weight whereas the "1" allele increases body weight. If one parent has a 000111 6-locus genotype and the other parent has 111000 genotype, then they will have closely matched weight. But their progeny may inherit combinations as extreme as 000000 and 111111. Transgression means that you can rarely predict the distribution of phenotypes among a set of progeny unless you already have a significant amount of information about the genetic architecture of a trait (numbers of segregating variants that affect the trait, either interactions, and GXE effects). In practical terms this means that if the parents of a cross do NOT differ and you have good reasons to believe that the trait you are interested in is genetically complex, then you can be fairly confident that the progeny will display a much wider range of variation that the parents. [May 2011 by RWW]. [Go back to index](#index) ## U [Go back to index](#index) ## V [Go back to index](#index) ## W #### Winsorize, Winsorise: QTL mapping results can be greatly affected by inclusion of outlier data. GeneNetwork will do its best to flag outliers for you in the **Trait Data and Analysis** pages (yellow highlighting). Before mapping, review the data, and if necessary, change values. Options for handling outliers include: (1) do nothing, (2) delete the outliers (trimming), (3) transform the data (e.g., logarithmic, arcsine, or logistic regression transforms), or (4) [winsorize](http://en.wikipedia.org/wiki/Winsorising) the distribution of values. Winsorizing is usually the easiest method to implement directly in GeneNetwork. **How to winsorize**: First review the distribution of values and define outliers. You should only do this one time, so think before you leap. Look at the **Probability Plot** of the trait by going to **Trait Data and Analysis** page and selecting **Basic Statistics**). For example, the figure below from GeneNetwork shows that at many as seven cases have relatively high values and as many as three have relatively low values (this trait is taken from Species = Mouse, Group = LXS, Type = Phenotype, Trait 10182). GeneNetwork code only declares the highest two values to be outliers, but you can use a more liberal definition and give all seven high values a haircut. It is advisable to winsorizes equal numbers of cases on each side of the distribution (high and low cases). In this case, the seven highest values were changed to match that of the 8th highest value (0.860). To retain the original rank order I added an incremental value of 0.01 to each (0.861, 0.862, etc). I did the same thing to the lowest seven values. Adding this increment is not necessary. The result in this case: a suggestive QTL on Chr 16 now reaches the significance threshold. The **danger of winsorizing** is doing it multiple times in different ways. You should transform or winsorize the data before mapping. And you should ideally only do any transformation/correction one time. If you fool around with different methods of transforming your data then you are asking for trouble by adding yet another level of multiple testing. If you feel compelled to experiment with different transforms, then you should/must report this in publications and explain why you did so. Demonstrating that mapping results are robust even using multiple transforms is one good excuse. [Williams RW, Jan 2, 2014] [Go back to index](#index) ## X [Go back to index](#index) ## Y [Go back to index](#index) ## Z [Go back to index](#index)