From 334017cd1659437c8710ae7a88fdac1a666c3164 Mon Sep 17 00:00:00 2001 From: Rob Williams Date: Fri, 14 Nov 2025 18:03:45 +0000 Subject: Fixed error. * Commit made via the GN Markdown Editor --- general/glossary/glossary.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/general/glossary/glossary.md b/general/glossary/glossary.md index 01ca605..b717f11 100644 --- a/general/glossary/glossary.md +++ b/general/glossary/glossary.md @@ -108,7 +108,7 @@ Please note that the functional importance of a locus, QTL, or GWAS hit can not Estimates of effect size for families of inbred lines, such as the BXD, HXB, CC, and hybrid diversity panels (e.g. the hybrid mouse diversity panel and the hybrid rat diversity panel) are typically (and correctly) much higher than those measured in otherwise similar analysis of intercrosses, heterogeneous stock (HS), or diversity outbred stock. Two factors contribute to the much higher level of explained variance of QTLs when using inbred strain panels. -1. **Replication Rate:** The variance that can be explained by a locus is increased by sampling multiple cases that have identical genomes and by using the strain mean for genetic analysis. Increasing replication rates from 1 to 6 can easily double the apparent heritability of a trait and therefore the effect size of a locus. The reason is simple—resampling decrease the standard error of mean, boosting the effective heritability (see Glossary entry on *Heritability* and focus on figure 1 from the Belknap [1998](http://gn1.genenetwork.org/images/upload/Belknap_Heritability_1998.pdf) paper reproduced below).
Compare the genetically explained variance (labeled h2RI in this figure) of a single case (no replication) on the x-axis with the function at a replication rate of 4 on the y-axis. If the explained variance is 0.1 (10% of all variance explained) then the value is boosted to 0.3 (30% of strain mean variance explained) with n = 4. +1. **Replication Rate:** The variance that can be explained by a locus is increased by sampling multiple cases that have identical genomes and by using the strain mean for genetic analysis. Increasing replication rates from 1 to 6 can easily double the apparent heritability of a trait and therefore the effect size of a locus. The reason is simple???resampling decrease the standard error of mean, boosting the effective heritability (see Glossary entry on *Heritability* and focus on figure 1 from the Belknap [1998](http://gn1.genenetwork.org/images/upload/Belknap_Heritability_1998.pdf) paper reproduced below).
Compare the genetically explained variance (labeled h2RI in this figure) of a single case (no replication) on the x-axis with the function at a replication rate of 4 on the y-axis. If the explained variance is 0.1 (10% of all variance explained) then the value is boosted to 0.3 (30% of strain mean variance explained) with n = 4. 2. **Homozygosity:** The second factor has to do with the inherent genetic variance of populations. Recombinant inbred lines are homozygous at nearly all loci. This doubles the genetic variance in a family of recombinant inbred lines compared to a matched number of F2s. This also quadruples the variance compared to a matched number of backcross cases. As a result 40 BXDs sampled just one per genometype will average 2X the genetic variance and 2X the heritability of 40 BDF2 cases. Note that panels made up of isogenic F1 hybrids (so-called diallel crosses, DX) made by crossing recombinant inbred strains (BXD, CC, or HXB) are no longer homozygous at all loci, and while they do expose important new sources of variance associated with dominance, they do not benefit from the 2X gain in genetic variance relative to an F2 intercross. @@ -191,7 +191,7 @@ Text here [Williams RW, July 15, 2010] #### Heritability, h2: -Heritability is a rough measure of the ability to use genetic information to predict the level of variation in phenotypes among progeny. Values range from 0 to 1 (or 0 to 100%). A value of 1 or 100% means that a trait is entirely predictable based on paternal/materinal and genetic data (in other words, a Mendelian trait), whereas a value of 0 means that a trait is not at all predictable from information on gene variants. Estimates of heritability are highly dependent on the environment, stage, and age. +Heritability is a rough measure of the ability to use genetic information to predict the level of variation in phenotypes among progeny. Values range from 0 to 1 (or 0 to 100%). A value of 1 or 100% means that a trait is entirely predictable based on paternal/materinal and genetic data (for example, a strong Mendelian trait), whereas a value of 0 means that a trait is not at all predictable from information on gene variants. Estimates of heritability are highly dependent on the environment, stage, and age. Important traits that affect fitness often have low heritabilities because stabilizing selection reduces the frequency of DNA variants that produce suboptimal phenotypes. Conversely, less critical traits for which substantial phenotypic variation is well tolerated, may have high heritability. The environment of laboratory rodents is unnatural, and this allows the accumulation of somewhat deleterious mutations (for example, mutations that lead to albinism). This leads to an upward trend in heritability of unselected traits in laboratory populations--a desirable feature from the point of view of the biomedical analysis of the genetic basis of trait variance. Heritability is a useful parameter to measure at an early stage of a genetic analysis, because it provides a rough gauge of the likelihood of successfully understanding the allelic sources of variation. Highly heritable traits are more amenable to mapping studies. There are numerous ways to estimate heritability, a few of which are described below. [Williams RW, Dec 23, 2004] @@ -217,10 +217,10 @@ However, this estimate of h2 cannot be compared directly to those calculated usi The factor 0.5 is applied to Va to adjust for the overestimation of additive genetic variance among inbred strains. This estimate of heritability also does not make allowances for the within-strain error term. The 0.5 adjustment factor is not recommended any more because h2 is severely **underestimated**. This adjustment is really only needed if the goal is to compare h2 between intercrosses and those generated using panels of inbred strains. -#### h2RIx̅ +#### h2RIx?? -Finally, heritability calculations using strain means, such as those listed above, do not provide estimates of the effective heritability achieved by resampling a given line, strain, or genometype many times. Belknap ([1998](http://gn1.genenetwork.org/images/upload/Belknap_Heritability_1998.pdf)) provides corrected estimates of the effective heritability. Figure 1 from his paper (reproduced below) illustrates how resampling helps a great deal. Simply resampling each strain 8 times can boost the effective heritability from 0.2 to 0.8. The graph also illustrates why it often does not make sense to resample much beyond 4 to 8, depending on heritability. Belknap used the term h2RIx̅ in this figure and paper, since he was focused on data generated using recombinant inbred (RI) strains, but the logic applies equally well to any panel of genomes for which replication of individual genometypes is practical. This h2RIx̅ can be calculated simply by: -h2RIx̅ = Va / (Va+(Ve/n)) where Va is the genetic variability (variability between strains), Ve is the environmental variability (variability within strains), and n is the number of within strain replicates. Of course, with many studies the number of within strain replicates will vary between strains, and this needs to be dealt with. A reasonable approach is to use the harmonic mean of n across all strains. +Finally, heritability calculations using strain means, such as those listed above, do not provide estimates of the effective heritability achieved by resampling a given line, strain, or genometype many times. Belknap ([1998](http://gn1.genenetwork.org/images/upload/Belknap_Heritability_1998.pdf)) provides corrected estimates of the effective heritability. Figure 1 from his paper (reproduced below) illustrates how resampling helps a great deal. Simply resampling each strain 8 times can boost the effective heritability from 0.2 to 0.8. The graph also illustrates why it often does not make sense to resample much beyond 4 to 8, depending on heritability. Belknap used the term h2RIx?? in this figure and paper, since he was focused on data generated using recombinant inbred (RI) strains, but the logic applies equally well to any panel of genomes for which replication of individual genometypes is practical. This h2RIx?? can be calculated simply by: +h2RIx?? = Va / (Va+(Ve/n)) where Va is the genetic variability (variability between strains), Ve is the environmental variability (variability within strains), and n is the number of within strain replicates. Of course, with many studies the number of within strain replicates will vary between strains, and this needs to be dealt with. A reasonable approach is to use the harmonic mean of n across all strains. Homozygosity @@ -270,7 +270,7 @@ The interquartile range is the difference between the 75% and 25% percentiles of #### Interval Mapping: -Interval mapping is a process in which the statistical significance of a hypothetical QTL is evaluated at regular points across a chromosome, even in the absence of explicit genotype data at those points. In the case of WebQTL, significance is calculated using an efficient and very rapid regression method, the Haley-Knott regression equations ([Haley CS, Knott SA. 1992. A simple regression method for mapping quantitative trait loci in line crosses using flanking markers; Heredity 69:315–324](http://www.ncbi.nlm.nih.gov/pubmed/16718932)), in which trait values are compared to the known genotype at a marker or to the probability of a specific genotype at a test location between two flanking markers. (The three genotypes are coded as -1, 0, and +1 at known markers, but often have fractional values in the intervals between markers.) The inferred probability of the genotypes in regions that have not been genotyped can be estimated from genotypes of the closest flanking markers. GeneNetwork/WebQTL compute linkage at intervals of 1 cM or less. As a consequence of this approach to computing linkage statistics, interval maps often have a characteristic shape in which the markers appear as sharply defined inflection points, and the intervals between nodes are smooth curves. [Chesler EJ, Dec 20, 2004; RWW April 2005; RWW Man 2014] +Interval mapping is a process in which the statistical significance of a hypothetical QTL is evaluated at regular points across a chromosome, even in the absence of explicit genotype data at those points. In the case of WebQTL, significance is calculated using an efficient and very rapid regression method, the Haley-Knott regression equations ([Haley CS, Knott SA. 1992. A simple regression method for mapping quantitative trait loci in line crosses using flanking markers; Heredity 69:315???324](http://www.ncbi.nlm.nih.gov/pubmed/16718932)), in which trait values are compared to the known genotype at a marker or to the probability of a specific genotype at a test location between two flanking markers. (The three genotypes are coded as -1, 0, and +1 at known markers, but often have fractional values in the intervals between markers.) The inferred probability of the genotypes in regions that have not been genotyped can be estimated from genotypes of the closest flanking markers. GeneNetwork/WebQTL compute linkage at intervals of 1 cM or less. As a consequence of this approach to computing linkage statistics, interval maps often have a characteristic shape in which the markers appear as sharply defined inflection points, and the intervals between nodes are smooth curves. [Chesler EJ, Dec 20, 2004; RWW April 2005; RWW Man 2014] #### Interval Mapping Options: @@ -310,7 +310,7 @@ Interval mapping is a process in which the statistical significance of a hypothe #### Literature Correlation: -The literature correlation is a unique feature in GeneNetwork that quantifies the similarity of words used to describe genes and their functions. Sets of words associated with genes were extracted from MEDLINE/PubMed abstracts (Jan 2017 by Ramin Homayouni, Diem-Trang Pham, and Sujoy Roy). For example, about 2500 PubMed abstracts contain reference to the gene "Sonic hedgehog" (Shh) in mouse, human, or rat. The words in all of these abstracts were extracted and categorize by their information content. A word such as "the" is not interesting, but words such as "dopamine" or "development" are useful in quantifying similarity. Sets of informative words are then compared—one gene's word set is compared the word set for all other genes. Similarity values are computed for a matrix of about 20,000 genes using latent semantic indexing [(see Xu et al., 2011)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018851). Similarity values are also known as literature correlations. These values are always positive and range from 0 to 1. Values between 0.5 and 1.0 indicate moderate-to-high levels of overlap of vocabularies. +The literature correlation is a unique feature in GeneNetwork that quantifies the similarity of words used to describe genes and their functions. Sets of words associated with genes were extracted from MEDLINE/PubMed abstracts (Jan 2017 by Ramin Homayouni, Diem-Trang Pham, and Sujoy Roy). For example, about 2500 PubMed abstracts contain reference to the gene "Sonic hedgehog" (Shh) in mouse, human, or rat. The words in all of these abstracts were extracted and categorize by their information content. A word such as "the" is not interesting, but words such as "dopamine" or "development" are useful in quantifying similarity. Sets of informative words are then compared???one gene's word set is compared the word set for all other genes. Similarity values are computed for a matrix of about 20,000 genes using latent semantic indexing [(see Xu et al., 2011)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018851). Similarity values are also known as literature correlations. These values are always positive and range from 0 to 1. Values between 0.5 and 1.0 indicate moderate-to-high levels of overlap of vocabularies. The literature correlation can be used to compare the "semantic" signal-to-noise of different measurements of gene, mRNA, and protein expression. Consider this common situation:There are three probe sets that measure Kit gene expression (1459588\_at, 1415900\_a\_at, and 1452514\_a\_at) in the Mouse BXD Lung mRNA data set (HZI Lung M430v2 (Apr08) RMA). Which one of these three gives the best measurement of Kit expression? It is impractical to perform quantitative rtPCR studies to answer this question, but there is a solid statistical answer that relies on **Literature Correlation**. Do the following: For each of the three probe sets, generate the top 1000 literature correlates. This will generate three apparently identical lists of genes that are known from the PubMed literature to be associated with the Kit oncogene. But the three lists are NOT actually identical when we look at the **Sample Correlation** column. To answer the question "which of the three probe sets is best", review the actual performance of the probe sets against this set of 1000 "friends of Kit". Do this by sorting all three lists by their Sample Correlation column (high to low). The clear winner is probe set 1415900_a_at. The 100th row in this probe set's list has a Sample Correlation of 0.620 (absolute value). In comparison, the 100th row for probe set 1452514_a_at has a Sample Correlation of 0.289. The probe set that targets the intron comes in last at 0.275. In conclusion, the probe set that targets the proximal half of the 3' UTR (1415900_a_at) has the highest "agreement" between Literature Correlation and Sample Correlation, and is our preferred measurement of Kit expression in the lung in this data set. (Updated by RWW and Ramin Homayouni, April 2017.) @@ -334,9 +334,9 @@ In the two likelihoods, one has maximized over the various nuisance parameters ( With complete genotype data for a marker, the log likelihood for the normal model reduces to (-n/2) times the log of the residual sum of squares. -LOD values can be converted to LRS scores (likelihood ratio statistics) by multiplying by 4.61. The LOD is also roughly equivalent to the -log(P) when the degrees of freedom of the mapping has two degrees of freedom, as in a standard F2 intercross. In such as case, where P is the probability of linkage (P = 0.001) the –logP => 3), will also equal a LOD of 3. The LOD itself is not a precise measurement of the probability of linkage, but in general for F2 crosses and RI strains, values above 3.3 will usually be worth attention for simple interval maps. +LOD values can be converted to LRS scores (likelihood ratio statistics) by multiplying by 4.61. The LOD is also roughly equivalent to the -log(P) when the degrees of freedom of the mapping has two degrees of freedom, as in a standard F2 intercross. In such as case, where P is the probability of linkage (P = 0.001) the ???logP => 3), will also equal a LOD of 3. The LOD itself is not a precise measurement of the probability of linkage, but in general for F2 crosses and RI strains, values above 3.3 will usually be worth attention for simple interval maps. -LOD scores and –logP scores are only interchangable when models have two degrees of freedom (2 df). +LOD scores and ???logP scores are only interchangable when models have two degrees of freedom (2 df). Let us begin with an example. Suppose we have a LOD score of 3 from an F2 cross; this test has 2 df. Let us calculate the p-value (and the logp) corresponding to this LOD score. -- cgit 1.4.1 From 0f7057bd38451ebc9c6ab96eec48b40aee4c2194 Mon Sep 17 00:00:00 2001 From: Rob Williams Date: Fri, 14 Nov 2025 18:05:10 +0000 Subject: More typos * Commit made via the GN Markdown Editor --- general/glossary/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/general/glossary/glossary.md b/general/glossary/glossary.md index b717f11..883e338 100644 --- a/general/glossary/glossary.md +++ b/general/glossary/glossary.md @@ -191,7 +191,7 @@ Text here [Williams RW, July 15, 2010] #### Heritability, h2: -Heritability is a rough measure of the ability to use genetic information to predict the level of variation in phenotypes among progeny. Values range from 0 to 1 (or 0 to 100%). A value of 1 or 100% means that a trait is entirely predictable based on paternal/materinal and genetic data (for example, a strong Mendelian trait), whereas a value of 0 means that a trait is not at all predictable from information on gene variants. Estimates of heritability are highly dependent on the environment, stage, and age. +Heritability is a rough measure of the ability to use genetic information to predict the level of variation in phenotypes among progeny. Values range from 0 to 1 (or 0 to 100%). A value of 1 or 100% means that a trait is entirely predictable based on paternal and maternal genetic information (for example, a strong Mendelian trait), whereas a value of 0 means that a trait is not at all predictable from information on gene variants. Estimates of heritability are highly dependent on the environment, stage, and age. Important traits that affect fitness often have low heritabilities because stabilizing selection reduces the frequency of DNA variants that produce suboptimal phenotypes. Conversely, less critical traits for which substantial phenotypic variation is well tolerated, may have high heritability. The environment of laboratory rodents is unnatural, and this allows the accumulation of somewhat deleterious mutations (for example, mutations that lead to albinism). This leads to an upward trend in heritability of unselected traits in laboratory populations--a desirable feature from the point of view of the biomedical analysis of the genetic basis of trait variance. Heritability is a useful parameter to measure at an early stage of a genetic analysis, because it provides a rough gauge of the likelihood of successfully understanding the allelic sources of variation. Highly heritable traits are more amenable to mapping studies. There are numerous ways to estimate heritability, a few of which are described below. [Williams RW, Dec 23, 2004] -- cgit 1.4.1 From 6022ffa41857c0cb6949c9d43c595357087418ac Mon Sep 17 00:00:00 2001 From: Rob Williams Date: Thu, 8 Jan 2026 14:34:53 +0000 Subject: Rob with. Claude Opus 4.5 help using Pjotr's letter on MouseFS of 08Jan2026 * Commit made via the GN Markdown Editor --- general/help/facilities.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/general/help/facilities.md b/general/help/facilities.md index 5047155..aef64f6 100644 --- a/general/help/facilities.md +++ b/general/help/facilities.md @@ -1,6 +1,6 @@ # Equipment -The core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome team](https://github.com/pangenome) at UTHSC maintains modern Linux servers and storage systems for genetic, genomic, pangenome, pangenetics and phenome analyses. +The core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome team](https://github.com/pangenome) at UTHSC maintains modern Linux servers and storage systems for genetic, genomic, pangenome, pangenetics, and phenome analyses. Machines are located in in the main UTHSC machine room of the Lamar Alexander Building at UTHSC (Memphis TN campus). This is a physically secure location with raised floors and an advanced fire extinguishing system. @@ -20,13 +20,13 @@ In 2023 we added two machines to upgrade from tux01 and tux02 -- named tux04 and In 2020 we installed a powerful HPC cluster (Octopus) dedicated to [pangenomic](https://www.biorxiv.org/content/10.1101/2021.11.10.467921v1) and [genetic](https://genenetwork.org/) computations, consisting of 11 PowerEdge R6515 AMD EPYC 7402P 24-core CPUs (264 real cores). In 2023 we added 4 new R6625 AMD Genoa machines adding a total of 192 real CPU cores running at 4GHz (total of 438 real CPU cores). Nine of these machines are equipped with 378 GB RAM, four R6625 have 768 GB and two have 1 TB of memory. -All machines have large SSD storage (~10TB) driving the lizard shared network storage. +All machines have large SSD storage (~10TB) driving the MooseFS shared network storage. MooseFS is configured with three storage classes: 2CP (default) with one copy on SSD and one on RAID5 spinning HDD for redundancy, scratch for fast non-redundant SSD access, and raid5 for archival storage on spinning disks. All Octopus nodes run Debian and GNU Guix and use Slurm for batch submission. -We run lizardfs for distributed network file storage and we run the common workflow language (CWL) and Docker containers. +We run MooseFS for distributed network file storage and we run the common workflow language (CWL), Docker, and Apptainer containers. The racks have dedicated 10Gbs high-speed Cisco switches and firewalls that are maintained by UTHSC IT staff. This heavily used cluster, notably, is almost self-managed by its users and features on the GNU Guix High Performance Computing [2020](https://hpc.guix.info/blog/2021/02/guix-hpc-activity-report-2020/) and [2022](https://hpc.guix.info/blog/2023/02/guix-hpc-activity-report-2022/) activity reports! -The total number of cores for Octopus has essentially doubled to a total of 456 real CPU cores and the Lizardfs SSD distributed network storage is getting close to 200TB with fiber optic interconnect. +The total number of cores for Octopus has essentially doubled to a total of 456 real CPU cores and the MooseFS SSD distributed network storage is getting close to 200TB with fiber optic interconnect. -- cgit 1.4.1 From 079f92f1278166940a1c0e8b1e42e3bcd642010a Mon Sep 17 00:00:00 2001 From: Rob Williams Date: Thu, 8 Jan 2026 14:42:03 +0000 Subject: Just some more minor fixes * Commit made via the GN Markdown Editor --- general/help/facilities.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/general/help/facilities.md b/general/help/facilities.md index aef64f6..7283027 100644 --- a/general/help/facilities.md +++ b/general/help/facilities.md @@ -38,7 +38,7 @@ The total number of cores for Octopus has essentially doubled to a total of 456 ## Lambda server -We also run a 128 real core AMD EPYC 7713 Lambda server (2023) with 1TB RAM, 40TB nvme storage AND 8x NVIDIA RTX6000: a total of approx. 144,000 compute cores for large language models (LLMs) and AI. +We also run a 128 real core AMD EPYC 7713 Lambda server (2023) with 1TB RAM, 40TB nvme storage AND 8x NVIDIA RTX6000: a total of approx. 145,000 compute cores for large language models (LLMs) and AI. ## Backups @@ -55,7 +55,7 @@ have also two RISC-V [SiFive](https://www.sifive.com/blog/the-heart-of-risc-v-development-is-unmatched) computers for development purposes. -Additionally, together with Chris Batten of Cornell and Michael Taylor of the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs responsible for leading the NSF-funded [RISC-V supercomputer for pangenomics](https://news.cornell.edu/stories/2021/11/5m-grant-will-tackle-pangenomics-computing-challenge). This RISC-V supercomputer 'in a rack' will come online in 2025. +Additionally, together with Chris Batten of Cornell and Michael Taylor of the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs responsible for leading the NSF-funded [RISC-V supercomputer for pangenomics](https://news.cornell.edu/stories/2021/11/5m-grant-will-tackle-pangenomics-computing-challenge). This RISC-V supercomputer 'in a rack' will come online in 2026. ## ISAAC access -- cgit 1.4.1 From 5635cba9ce2abb7aa7806edf79af2b7d4580c7bd Mon Sep 17 00:00:00 2001 From: pjotr.pbl@gmail.com Date: Fri, 9 Jan 2026 13:07:07 +0000 Subject: test * Commit made via the GN Markdown Editor --- general/help/facilities.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/general/help/facilities.md b/general/help/facilities.md index 7283027..093f42e 100644 --- a/general/help/facilities.md +++ b/general/help/facilities.md @@ -1,6 +1,6 @@ # Equipment -The core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome team](https://github.com/pangenome) at UTHSC maintains modern Linux servers and storage systems for genetic, genomic, pangenome, pangenetics, and phenome analyses. +The x core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome team](https://github.com/pangenome) at UTHSC maintains modern Linux servers and storage systems for genetic, genomic, pangenome, pangenetics, and phenome analyses. Machines are located in in the main UTHSC machine room of the Lamar Alexander Building at UTHSC (Memphis TN campus). This is a physically secure location with raised floors and an advanced fire extinguishing system. -- cgit 1.4.1 From b43a5dfc0431a838aa201e457a813a461a2f9736 Mon Sep 17 00:00:00 2001 From: pjotr.pbl@gmail.com Date: Fri, 9 Jan 2026 13:09:28 +0000 Subject: Added Rob's changes * Commit made via the GN Markdown Editor --- general/help/facilities.md | 138 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 104 insertions(+), 34 deletions(-) diff --git a/general/help/facilities.md b/general/help/facilities.md index 093f42e..7234059 100644 --- a/general/help/facilities.md +++ b/general/help/facilities.md @@ -1,75 +1,145 @@ # Equipment -The x core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome team](https://github.com/pangenome) at UTHSC maintains modern Linux servers and storage systems for genetic, genomic, pangenome, pangenetics, and phenome analyses. -Machines are located in in the main UTHSC machine room of the Lamar Alexander Building at UTHSC (Memphis TN campus). This is a physically secure location with raised +The core [GeneNetwork team](https://github.com/genenetwork/) and [Pangenome +team](https://github.com/pangenome) at UTHSC maintains modern Linux servers +and storage systems for genetic, genomic, pangenome, pangenetics, and +phenome analyses. +Machines are located in in the main UTHSC machine room of the Lamar +Alexander Building at UTHSC (Memphis TN campus). This is a physically +secure location with raised floors and an advanced fire extinguishing system. We have access to this space for upgrades and hardware maintenance. -We use remote racadm and/or ipmi to all machines for out-of-band maintenance. -Issues and work packages are tracked through our 'tissue' [tracker board](https://issues.genenetwork.org/) and we use git repositories for documentation, issue tracking and planning (mostly public and some private repos available on request). -We also run [continuous integration](https://ci.genenetwork.org/) and [continuous deployment](https://cd.genenetwork.org/) services online (CI and CD). At FOSDEM 2023 -Arun Isaac presented tissue, our [minimalist git+plain text issue tracker](https://archive.fosdem.org/2023/schedule/event/tissue/) that allows us to move away from github soure code hosting and issue trackers. - -The computing facility has four computer racks dedicated to GeneNetwork-related work and pangenomics. -Each rack has a mix of Dell PowerEdge servers (from a few older low-end R610s, R6515, and two R7425 AMD Epyc 64-core 256GB RAM systems - tux01 and tux02 - running the GeneNetwork web services). -We also support several experimental systems, including a 40-core R7425 system with 196 GB RAM and 2x NVIDIA V100 GPU (tux03), and one Penguin Computing Relion 2600GT systems (Penguin2) with NVIDIA Tesla K80 GPU used for software development and to serve outside-facing less secure R/shiny and Python services that run in isolated containers. Effectively, we have three outward facing servers that are fully utilized by the GeneNetwork team with a total of 64+64+40+28 = 196 real cores. -In 2023 we added two machines to upgrade from tux01 and tux02 -- named tux04 and tux05 resp. --- that have the latest Dell Poweredge R6625 AMD Genoa EPYC processors adding a total of 96 real CPU cores running at 4GHz. These two machines have 768Gb RAM each. +We use remote racadm and/or ipmi to all machines for out-of-band +maintenance. +Issues and work packages are tracked through our 'tissue' [tracker board]( +https://issues.genenetwork.org/) and we use git repositories for +documentation, issue tracking and planning (mostly public and some private +repos available on request). +We also run [continuous integration](https://ci.genenetwork.org/) and +[continuous deployment](https://cd.genenetwork.org/) services online (CI +and CD). At FOSDEM 2023 +Arun Isaac presented tissue, our [minimalist git+plain text issue tracker](https://archive.fosdem.org/2023/schedule/event/tissue/) that allows us to +move away from github soure code hosting and issue trackers. + +The computing facility has four computer racks dedicated to +GeneNetwork-related work and pangenomics. +Each rack has a mix of Dell PowerEdge servers (from a few older low-end +R610s, R6515, and two R7425 AMD Epyc 64-core 256GB RAM systems - tux01 and +tux02 - running the GeneNetwork web services). +We also support several experimental systems, including a 40-core R7425 +system with 196 GB RAM and 2x NVIDIA V100 GPU (tux03), and one Penguin +Computing Relion 2600GT systems (Penguin2) with NVIDIA Tesla K80 GPU used +for software development and to serve outside-facing less secure R/shiny +and Python services that run in isolated containers. Effectively, we have +three outward facing servers that are fully utilized by the GeneNetwork +team with a total of 64+64+40+28 = 196 real cores. +In 2023 we added two machines to upgrade from tux01 and tux02 -- named +tux04 and tux05 resp. --- that have the latest Dell Poweredge R6625 AMD +Genoa EPYC processors adding a total of 96 real CPU cores running at 4GHz. +These two machines have 768Gb RAM each. ## Octopus HPC cluster -In 2020 we installed a powerful HPC cluster (Octopus) dedicated to [pangenomic](https://www.biorxiv.org/content/10.1101/2021.11.10.467921v1) and [genetic](https://genenetwork.org/) computations, consisting of 11 PowerEdge R6515 AMD EPYC 7402P 24-core CPUs (264 real cores). -In 2023 we added 4 new R6625 AMD Genoa machines adding a total of 192 real CPU cores running at 4GHz (total of 438 real CPU cores). -Nine of these machines are equipped with 378 GB RAM, four R6625 have 768 GB and two have 1 TB of memory. -All machines have large SSD storage (~10TB) driving the MooseFS shared network storage. MooseFS is configured with three storage classes: 2CP (default) with one copy on SSD and one on RAID5 spinning HDD for redundancy, scratch for fast non-redundant SSD access, and raid5 for archival storage on spinning disks. -All Octopus nodes run Debian and GNU Guix and use Slurm for batch submission. -We run MooseFS for distributed network file storage and we run the common workflow language (CWL), Docker, and Apptainer containers. -The racks have dedicated 10Gbs high-speed Cisco switches and firewalls that are maintained by UTHSC IT staff. -This heavily used cluster, notably, is almost self-managed by its users and features on the GNU Guix High Performance Computing [2020](https://hpc.guix.info/blog/2021/02/guix-hpc-activity-report-2020/) and [2022](https://hpc.guix.info/blog/2023/02/guix-hpc-activity-report-2022/) activity reports! - -The total number of cores for Octopus has essentially doubled to a total of 456 real CPU cores and the MooseFS SSD distributed network storage is getting close to 200TB with fiber optic interconnect. +In 2020 we installed a powerful HPC cluster (Octopus) dedicated to +[pangenomic](https://www.biorxiv.org/content/10.1101/2021.11.10.467921v1) +and [genetic](https://genenetwork.org/) computations, consisting of 11 +PowerEdge R6515 AMD EPYC 7402P 24-core CPUs (264 real cores). +In 2023 we added 4 new R6625 AMD Genoa machines adding a total of 192 real +CPU cores running at 4GHz (total of 438 real CPU cores). +Nine of these machines are equipped with 378 GB RAM, four R6625 have 768 GB +and two have 1 TB of memory. +All machines have large SSD storage (~10TB) driving the MooseFS shared +network storage. MooseFS is configured with three storage classes: 2CP +(default) with one copy on SSD and one on RAID5 spinning HDD for +redundancy, scratch for fast non-redundant SSD access, and raid5 for +archival storage on spinning disks. +All Octopus nodes run Debian and GNU Guix and use Slurm for batch +submission. +We run MooseFS for distributed network file storage and we run the common +workflow language (CWL), Docker, and Apptainer containers. +The racks have dedicated 10Gbs high-speed Cisco switches and firewalls that +are maintained by UTHSC IT staff. +This heavily used cluster, notably, is almost self-managed by its users and + features on the GNU Guix High Performance Computing [2020](https://hpc.guix.info/blog/2021/02/guix-hpc-activity-report-2020/), + [2022](https://hpc.guix.info/blog/2023/02/guix-hpc-activity-report-2022/), + [2023](https://hpc.guix.info/blog/2024/02/guix-hpc-activity-report-2023/), + and + [2024](https://hpc.guix.info/blog/2025/02/guix-hpc-activity-report-2024/) +activity reports! + +The total number of cores for Octopus has essentially doubled to a total of +456 real CPU cores and the MooseFS SSD distributed network storage is +getting close to 200TB with fiber optic interconnect.
- Octopus HPC + Octopus HPC
## Lambda server -We also run a 128 real core AMD EPYC 7713 Lambda server (2023) with 1TB RAM, 40TB nvme storage AND 8x NVIDIA RTX6000: a total of approx. 145,000 compute cores for large language models (LLMs) and AI. +We also run a 128 real core AMD EPYC 7713 Lambda server (2023) with 1TB +RAM, 40TB nvme storage AND 8x NVIDIA RTX6000: a total of approx. 145,000 +compute cores for large language models (LLMs) and AI. ## Backups For backups we run three Synology servers with a total of 300TB of storage. -On demand we also deploy an off-site fallback server and encrypted backups in the Amazon cloud for the main web-service databases and files. +On demand we also deploy an off-site fallback server and encrypted backups +in the Amazon cloud for the main web-service databases and files. ## Specials We run some 'specials' including an ARM-based NVIDIA Jetson and a RISC-V [PolarFire -SOC](https://www.cnx-software.com/2020/07/20/polarfire-soc-icicle-64-bit-risc-v-and-fpga-development-board-runs-linux-or-freebsd/). +SOC](https://www.cnx-software.com/2020/07/20/polarfire-soc-icicle-64-bit-risc-v-and-fpga-development-board-runs-linux-or-freebsd/ +). We have also two RISC-V [SiFive](https://www.sifive.com/blog/the-heart-of-risc-v-development-is-unmatched) computers for development purposes. -Additionally, together with Chris Batten of Cornell and Michael Taylor of the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs responsible for leading the NSF-funded [RISC-V supercomputer for pangenomics](https://news.cornell.edu/stories/2021/11/5m-grant-will-tackle-pangenomics-computing-challenge). This RISC-V supercomputer 'in a rack' will come online in 2026. +Additionally, together with Chris Batten of Cornell and Michael Taylor of +the University of Washington, Erik Garrison and Pjotr Prins are UTHSC PIs +responsible for leading the NSF-funded [RISC-V supercomputer for +pangenomics](https://news.cornell.edu/stories/2021/11/5m-grant-will-tackle-pangenomics-computing-challenge). +This RISC-V supercomputer 'in a rack' will come online in 2026. ## ISAAC access -In addition to above hardware the GeneNetwork team has batch submission access to the HIPAA complient cluster computing resource at the ISAAC computing facility operated by the UT Joint Institute for Computational Sciences in a secure setup at the DOE Oak Ridge National Laboratory (ORNL) and on the UT Knoxville campus. -We have a 10 Gbit connection from the machine room at UTHSC to data transfer nodes at ISAAC. ISAAC is continuously being upgraded (see [ISAAC system overview](https://oit.utk.edu/hpsc/available-resources/)) and has over 7 PB of high-performance Lustre DDN storage and contains over 20,000 cores with some large RAM nodes and 32 GPU nodes. -Drs. Prins, Garrison, Colonna, Chen, Ashbrook and other team members use ISAAC systems to analyze genomic and genetic data sets. -Note that we can not use ISAAC and storage facilities for public-facing web services because of stringent security requirements. -ISAAC however, can be highly useful for precomputed genomics and genetics results using standardized pipelines. +In addition to above hardware the GeneNetwork team has batch submission +access to the HIPAA complient cluster computing resource at the ISAAC +computing facility operated by the UT Joint Institute for Computational +Sciences in a secure setup at the DOE Oak Ridge National Laboratory (ORNL) +and on the UT Knoxville campus. +We have a 10 Gbit connection from the machine room at UTHSC to data +transfer nodes at ISAAC. ISAAC is continuously being upgraded (see [ISAAC +system overview](https://oit.utk.edu/hpsc/available-resources/)) and has +over 7 PB of high-performance Lustre DDN storage and contains over 20,000 +cores with some large RAM nodes and 32 GPU nodes. +Drs. Prins, Garrison, Colonna, Chen, Ashbrook and other team members use +ISAAC systems to analyze genomic and genetic data sets. +Note that we can not use ISAAC and storage facilities for public-facing web +services because of stringent security requirements. +ISAAC however, can be highly useful for precomputed genomics and genetics +results using standardized pipelines. ## Deployment -The software stack is maintained and deployed throughout with GNU Guix, a modern software package manager that allows running Docker and Apptainer (formerly Singularity) containers as well as full system containers and VMs. -All current tools are maintained on [https://gitlab.com/genenetwork/guix-bioinformatics](https://gitlab.com/genenetwork/guix-bioinformatics). Dr Garrison's pangenome tools are packaged on [https://github.com/ekg/guix-genomics](https://github.com/ekg/guix-genomics). +The software stack is maintained and deployed throughout with GNU Guix, a +modern software package manager that allows running Docker and Apptainer +(formerly Singularity) containers as well as full system containers and VMs. +All current tools are maintained on [ +https://gitlab.com/genenetwork/guix-bioinformatics](https://gitlab.com/genenetwork/guix-bioinformatics). +Dr Garrison's pangenome tools are packaged on [https://github.com/ekg/guix-genomics](https://github.com/ekg/guix-genomics). ## Cloud computing -In addition the the "bare metal" described above we increasingly use cloud services for running VMs for teaching and fallbacks, as well as for storing data, including backups. +In addition the the "bare metal" described above we increasingly use cloud +services for running VMs for teaching and fallbacks, as well as for +storing data, including backups. -- cgit 1.4.1