Doc: remove old files

author: Pjotr Prins 2024-12-09 11:40:27 +0100
committer: Pjotr Prins 2024-12-09 11:40:33 +0100
commit: 2001612005c6c37ef7f87582ecbce63576c88c9e (patch)
tree: 41b90c32e6efa63b5f103aebd9e883eca7710aa4 /doc
parent: 50ee8aff3335d15081680bb0e60016fd8be3d127 (diff)
download: pangemma-2001612005c6c37ef7f87582ecbce63576c88c9e.tar.gz
3 files changed, 0 insertions, 388 deletions
diff --git a/doc/developers/design.org b/doc/developers/design.org
deleted file mode 100644
index 859e3f6..0000000
--- a/doc/developers/design.org
+++ /dev/null
@@ -1,52 +0,0 @@
-* GEMMA Design Document
-
-** Introduction
-
-With the v0.98 release GEMMA has stabilized and contains extensive
-error checking. To move faster we are moving towards integrating the
-faster-lmm-d code base which is written in the D programming
-language. We are also add interfaces for Python and R (and other
-languages). We will try to keep a legacy C++ based GEMMA as long as
-possible, but for performance and features it is likely a D compiler
-is required. The good news is that most distributions contain D
-compilers today.
-
-** Faster-lmm-d integration
-
-Faster-lmm-d is mostly a rewrite of GEMMA univariate LMM and
-multivariate LMM resolvers. We compile faster-lmm-d as a library that
-can be linked against GEMMA. For computing K, for example, there are
-two modes: (1) that has all genotype data in RAM and (2) that loads
-the genotype data directly from a geno file.
-
-** Improved data formats
-
-The original data formats are somewhat lacking because they make error
-correction hard. In collaboration with the R/qtl2 project we aim for
-supporting newer formats.
-
-*** Kinship format
-
-As the kinship mastrix K is symmetric we only need to store half the
-data. Also we want to be able to filter and validate on the names of
-individuals/samples. Next we compress it. A comparison of formats is
-[[https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO][here]]. Decompression speed is most critical and [[https://github.com/lz4/lz4][lz4]] does a great job
-there (lz4 is used in CRAM and sambamba). According to [[https://www.dummeraugust.com/main/content/blog/posts.php?pid=173][this comparison]]
-text processing is fairly similar between gzip and lz4. lz4 files are
-a bit larger, so decompression gains may be offset by network speeds.
-
-To recognise the tab dilimited file we'll add a header with nind's:
-
-#+BEGIN_SRC
-# GRMv1.0
-# nind=900
-ind1 0.1436717816 0.006341902008 0.007596806816 ...
-ind2 0.007996662028 0.008741860935 0.008489758779 ...
-...
-ind900 0.002311556029
-#+END_SRC
-
-where each row is one value shorter describing the right top half of
-K. This setup allows one to use a K with for exaple ind1 missing -
-just remove that row and column. The data will be stored as
-name.cXX.txt.lz4 (later add the alternative name.cxx.txt.gz).
diff --git a/doc/developers/profiling.md b/doc/developers/profiling.md
deleted file mode 100644
index 0d26453..0000000
--- a/doc/developers/profiling.md
+++ /dev/null
@@ -1,30 +0,0 @@
-# Profiling
-
-gperftools (formerly the Google profiler) is included in the .guix-dev
-startup script. Compile gemma for profiling:
-
-    make clean
-    make profile
-
-Run the profiler
-
-     env CPUPROFILE=/tmp/prof.out ./bin/gemma -g ./example/mouse_hs1940.geno.txt.gz -p ./example/mouse_hs1940.pheno.txt     -gk -o mouse_hs1940
-     pprof ./bin/gemma /tmp/prof.out
-
-and `top` shows
-
-```
-Welcome to pprof!  For help, type 'help'.
-(pprof) top
-Total: 720 samples
-     103  14.3%  14.3%      103  14.3% dgemm_kernel_ZEN
-      39   5.4%  19.7%       79  11.0% ____strtod_l_internal
-      37   5.1%  24.9%       53   7.4% __printf_fp_l
-      36   5.0%  29.9%       36   5.0% __sched_yield
-      34   4.7%  34.6%       34   4.7% __strlen_avx2
-      31   4.3%  38.9%       31   4.3% __strspn_sse42
-      26   3.6%  42.5%      116  16.1% ReadFile_geno
-      25   3.5%  46.0%       26   3.6% _int_malloc
-      23   3.2%  49.2%       23   3.2% gsl_vector_set
-      18   2.5%  51.7%       18   2.5% __strcspn_sse42
-```
diff --git a/doc/example/data-munging.org b/doc/example/data-munging.org
deleted file mode 100644
index 62dfb24..0000000
--- a/doc/example/data-munging.org
+++ /dev/null
@@ -1,306 +0,0 @@
-* Preparing Data for GEMMA
-
-GEMMA requires data to be presented in a certain way to run
-successfully. In this document we describe how you can prepare
-genotype and phenotype data so it passes through GEMMA correctly.
-
-Note that GEMMA uses spaces, tabs and comma's as valid field
-separators. Here we use tabs.
-
-** Genotypes
-
-In this example we have a genotype spreadsheet containing something like this
-
-```
-@type:riset
-@mat:B
-@pat:D
-@het:H
-@unk:U
-Chr     Locus   cM      Mb      BXD1    BXD2    BXD5    BXD6    BXD8    BXD9    BXD11   ...
-1       rs31443144      1.50    3.010274        B       B       D       D       D       ...
-1       rs6269442       1.50    3.492195        B       B       D       D       D       ...
-(...)
-```
-
-GEMMA, instead, requires a BIMBAM 'mean genotyped' formatted file
-which looks like
-
-```
-rs31443144, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
-rs6269442, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
-(...)
-```
-
-where the first three positions refer to SNP id, minor allele, major
-allele (both unused) and the mean genotypes where B has the value 1, D
-has the value 0, H has the value 0.5 and U is marked as NA. Note there
-is no header line.
-
-At this point you may decide to use a spreadsheet or editor to modify
-these by hand. The problem is that such a procedure is error prone and
-hard to reproduce. Always choose a programming method if you can! For
-further information check out these
-[suggestions](http://kbroman.org/steps2rr/pages/scripts.html).
-
-For most people using R or Python would be the best way to convert the
-data using a small script. We will add those instructions later, but
-here I am going to use a few Unix tools and the
-[bio-table](https://github.com/pjotrp/bioruby-table) tool to manage
-this conversion from the command line. First, make sure Ruby is
-available
-
-```sh
-ruby -v
-ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
-```
-
-and install bioruby-table
-
-```sh
-gem install bio-table
-```
-
-On success you should be able to run
-
-```
-bio-table
-```
-
-In the first step we need to strip the header lines. This can be done with --skip, e.g.
-
-```sh
-bio-table --skip 20 BXD.geno > BXD_noheader.geno
-```
-
-We'll keep the header line for some column processing.  Next we drop
-the first column and keep the rest with --columns. For this exercise
-we are only interested in a subset of BXDs, so
-
-```sh
-bio-table --columns Locus,BXD100,BXD101,BXD102,BXD24,BXD32,BXD34,BXD40,BXD43,BXD44,BXD45,BXD48,BXD48a,BXD49,BXD50,BXD51,BXD55,BXD60,BXD61,BXD62,BXD63,BXD65,BXD65b,BXD66,BXD68,BXD69,BXD70,BXD71,BXD73,BXD73a,BXD73b,BXD75,BXD77,BXD78,BXD79,BXD83,BXD86,BXD87,BXD90,BXD98,C57BL/6J,DBA/2J BXD_noheader.geno > BXD_cols.geno
-```
-
-selects those columns by name. If a name is a mis-match bio-table will
-balk. In this case the parents C57BL/6J,DBA/2J are missing and had to
-be taken out.
-
-It is also possible to use regex's to select the columns (with
---column-filter). Note that bio-table takes tab delimited files by
-default, to use a different delimiter use something like --in-format
-split --split-on ',' or a regex. Next we inject columns 1 and 2 by
-
-```sh
-bio-table --rewrite 'field[0] = "X\tY\t" + field[0]' BXD_cols.geno > BXD_cols2.geno
-```
-
-Next rewrite the field contents with something like
-
-```sh
-bio-table BXD_cols2.geno --rewrite 'fields = fields.map { |f| h={"X"=>"X", "Y"=>"Y", "D"=> 0, "B"=>1, "H"=>0.5} ; h[f] }' > BXD_mine.geno
-```
-
-And the resulting BIMBAM format looks like
-
-```
-rs31443144      X       Y       B       B       B       B       B       B       B       B       D       B       B       B       D       D
-       B       B       B       D       D       D       D       D       B       D       B       B       B       D       D       D       D       D       D       D       B       D       B       B       B
-rs6269442       X       Y       1       1       1       1       1       1       1       1       0       1       1       1       0       0
-       1       1       1       0       0       0       0       0       1       0       1       1       1       0       0       0       0       0       0       0       1       0       1       1       1
-```
-
-[bio-table](https://github.com/pjotrp/bioruby-table) has an extensive
-README and many other options, check it out.
-
-** Annotation
-
-To run gemma an annotation file is optional. This file again has no
-header and looks like
-
-```
-rs31443144      3010274 1
-rs6269442       3492195 1
-rs32285189      3511204 1
-```
-
-where columns refer to the SNP name (same as in the genotype file),
-the position and the chromosome number.  If you don't supply this
-information and run gemma with the -debug switch it will complain
-
-```
-**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
-**** DEBUG: Can't figure out position for rs6269442 in src/io.cpp at line 683 in ReadFile_geno
-```
-
-This is harmless until you want to run LOCO.
-
-** Phenotypes
-
-The phenotypes are simply a matrix. Again, without headers. Missing
-values are marked as NA. Save the file so it looks like
-[this](https://github.com/genetics-statistics/GEMMA/blob/master/example/mouse_hs1940.pheno.txt).
-When running GEMMA the column number gets passed in.
-
-** Running Gemma
-
-*** Relatedness/kinship matrix
-
-The first step is to compute a kinship matrix K. With our files we try
-
-```
-gemma -g BXD_mine.geno -p BXD_pheno.csv -gk
-GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
-Reading Files ...
-Segmentation fault
-```
-
-Unfortunately there is an immediate problem. Try running with -debug
-
-```
-gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -debug
-GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
-Reading Files ...
-**** DEBUG: entered in src/io.cpp at line 353 in ReadFile_pheno
-**** DEBUG: entered in src/io.cpp at line 608 in ReadFile_geno
-**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
-strtok failed in ReadFile_geno in src/io.cpp at line 706
-```
-
-and we get a slightly more informative error. There is a parsing
-problem. GEMMA is not very good at data format checking so we need to
-do some checks. First see if the number of individuals matches in the
-phenotype file and genotype file. In this case the phenotype file was
-too large. In still included a header line and the parent phenotypes.
-
-After removing those it was fine.
-
-Gemma contains helpful information, particularly
-
-```sh
-gemma -h 2
-```
-
-where we can select the column in the phenotype file with -n. This is important
-because (at this point) K accounts for missing values:
-
-```sh
-gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -n 2
-```
-
-by default it writes K to output/result.cXX.txt.
-
-Gemma also has a leave one chromosoe out (LOCO) option, but it is best
-to use
-[gemma-wrapper](https://github.com/genetics-statistics/gemma-wrapper)
-for that because gemma-wrapper iterates through all chromosomes. More
-on that below.
-
-Now we have K, let's run an LMM:
-
-** LMM
-
-Running an LMM on the phenotype in column 2 (they are numbered 1,2,...)
-
-```
-gemma -g BXD_mine.geno \
-    -p BXD_pheno.csv \
-    -a example/BXD_snps.txt \
-    -k output/result.cXX.txt \
-    -lmm 1 -maf 0.1 -n 2 \
-    -o n2_assoc.txt
-```
-
-gets a result file in output/result.assoc.txt
-
-** LOCO
-
-For LOCO we use gemma-wrapper because it facilitates running through
-the chromosomes. Essentially the command is the same (after -- gets
-passed to gemma):
-
-```
-gemma-wrapper --json \
-    --loco 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X -- \
-    -g BXD_mine.geno \
-    -p BXD_pheno.csv \
-    -a example/BXD_snps.txt \
-    -gk \
-    -n 2 \
-    -debug > K.json
-```
-
-Note that the annotation file now is *required* to find the chromosomes. Gemma-wrapper computes
-the K's for every chromosome and the filenames are passed into K.json. A file which can be fed
-to the LMM:
-
-```
-gemma-wrapper --loco --input K.json -- \
-    -g BXD_mine.geno \
-    -p BXD_pheno.csv \
-    -a example/BXD_snps.txt \
-    -lmm 1 -maf 0.1 -n 2 \
-    -debug > GWA.json
-```
-
-Inside the resulting GWA.json file there is a list of results. Check
-out the paths and compile the result into one file with, for example,
-
-```
-bio-table /tmp/aa2a1c1a67fe2289d6a23afcc025818402f97521.*.assoc.txt.assoc.txt > LOCO_n2_assoc.txt
-```
-
-** Permutations
-
-To get an idea of what is a significant hit we can also use run gemma
-1000x after shuffling the phenotypes. gemma-wrapper also has an option
-for that. First create K
-
-```
-gemma-wrapper --json -- \
-    -g BXD_mine.geno \
-    -p BXD_pheno.csv \
-    -a example/BXD_snps.txt \
-    -gk \
-    -n 2 \
-    -debug > K.json
-```
-
-Now run once with
-
-```
-gemma-wrapper --json --input K.json -- \
-    -g BXD_mine.geno \
-    -p BXD_pheno.csv \
-    -a example/BXD_snps.txt \
-    -lmm 2 -maf 0.1 -n 2 \
-    -debug > GWA.json
-```
-
-GWA.json should point to the result set.
-
-To run the permutations add one option and move the -p option to
---phenotypes *before* the double dash
-
-```
-gemma-wrapper --permutate 1000 --phenotypes BXD_pheno.csv --input K.json -- \
-    -g BXD_mine.geno \
-    -a example/BXD_snps.txt \
-    -lmm 1 -maf 0.1 -n 2 \
-    -debug > GWA.json
-```
-
-gemma-wrapper prints out the 95 and 67 percentiles where the first may
-be considered the 'significant' threshold and the latter is the
-'suggestive' threshold. For example
-
-```
-["95 percentile (significant) ", 3.382905e-05, 4.5]
-["67 percentile (suggestive)  ", 0.0003651852, 3.4]
-```
-
-where 4.5 is the LOD, i.e., -log10(3.382905e-05)
-
-** Annotate
-
-Once there is a list of associations out of GEMMA you may use the gemma-annotate
-tool as described [here](https://github.com/pjotrp/gemma-annotate).
author	Pjotr Prins	2024-12-09 11:40:27 +0100
committer	Pjotr Prins	2024-12-09 11:40:33 +0100
commit	2001612005c6c37ef7f87582ecbce63576c88c9e (patch)
tree	41b90c32e6efa63b5f103aebd9e883eca7710aa4 /doc
parent	50ee8aff3335d15081680bb0e60016fd8be3d127 (diff)
download	pangemma-2001612005c6c37ef7f87582ecbce63576c88c9e.tar.gz