1 files changed, 306 insertions, 0 deletions
diff --git a/doc/example/data-munging.org b/doc/example/data-munging.org
new file mode 100644
index 0000000..62dfb24
--- /dev/null
+++ b/doc/example/data-munging.org
@@ -0,0 +1,306 @@
+* Preparing Data for GEMMA
+
+GEMMA requires data to be presented in a certain way to run
+successfully. In this document we describe how you can prepare
+genotype and phenotype data so it passes through GEMMA correctly.
+
+Note that GEMMA uses spaces, tabs and comma's as valid field
+separators. Here we use tabs.
+
+** Genotypes
+
+In this example we have a genotype spreadsheet containing something like this
+
+```
+@type:riset
+@mat:B
+@pat:D
+@het:H
+@unk:U
+Chr     Locus   cM      Mb      BXD1    BXD2    BXD5    BXD6    BXD8    BXD9    BXD11   ...
+1       rs31443144      1.50    3.010274        B       B       D       D       D       ...
+1       rs6269442       1.50    3.492195        B       B       D       D       D       ...
+(...)
+```
+
+GEMMA, instead, requires a BIMBAM 'mean genotyped' formatted file
+which looks like
+
+```
+rs31443144, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
+rs6269442, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
+(...)
+```
+
+where the first three positions refer to SNP id, minor allele, major
+allele (both unused) and the mean genotypes where B has the value 1, D
+has the value 0, H has the value 0.5 and U is marked as NA. Note there
+is no header line.
+
+At this point you may decide to use a spreadsheet or editor to modify
+these by hand. The problem is that such a procedure is error prone and
+hard to reproduce. Always choose a programming method if you can! For
+further information check out these
+[suggestions](http://kbroman.org/steps2rr/pages/scripts.html).
+
+For most people using R or Python would be the best way to convert the
+data using a small script. We will add those instructions later, but
+here I am going to use a few Unix tools and the
+[bio-table](https://github.com/pjotrp/bioruby-table) tool to manage
+this conversion from the command line. First, make sure Ruby is
+available
+
+```sh
+ruby -v
+ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
+```
+
+and install bioruby-table
+
+```sh
+gem install bio-table
+```
+
+On success you should be able to run
+
+```
+bio-table
+```
+
+In the first step we need to strip the header lines. This can be done with --skip, e.g.
+
+```sh
+bio-table --skip 20 BXD.geno > BXD_noheader.geno
+```
+
+We'll keep the header line for some column processing.  Next we drop
+the first column and keep the rest with --columns. For this exercise
+we are only interested in a subset of BXDs, so
+
+```sh
+bio-table --columns Locus,BXD100,BXD101,BXD102,BXD24,BXD32,BXD34,BXD40,BXD43,BXD44,BXD45,BXD48,BXD48a,BXD49,BXD50,BXD51,BXD55,BXD60,BXD61,BXD62,BXD63,BXD65,BXD65b,BXD66,BXD68,BXD69,BXD70,BXD71,BXD73,BXD73a,BXD73b,BXD75,BXD77,BXD78,BXD79,BXD83,BXD86,BXD87,BXD90,BXD98,C57BL/6J,DBA/2J BXD_noheader.geno > BXD_cols.geno
+```
+
+selects those columns by name. If a name is a mis-match bio-table will
+balk. In this case the parents C57BL/6J,DBA/2J are missing and had to
+be taken out.
+
+It is also possible to use regex's to select the columns (with
+--column-filter). Note that bio-table takes tab delimited files by
+default, to use a different delimiter use something like --in-format
+split --split-on ',' or a regex. Next we inject columns 1 and 2 by
+
+```sh
+bio-table --rewrite 'field[0] = "X\tY\t" + field[0]' BXD_cols.geno > BXD_cols2.geno
+```
+
+Next rewrite the field contents with something like
+
+```sh
+bio-table BXD_cols2.geno --rewrite 'fields = fields.map { |f| h={"X"=>"X", "Y"=>"Y", "D"=> 0, "B"=>1, "H"=>0.5} ; h[f] }' > BXD_mine.geno
+```
+
+And the resulting BIMBAM format looks like
+
+```
+rs31443144      X       Y       B       B       B       B       B       B       B       B       D       B       B       B       D       D
+       B       B       B       D       D       D       D       D       B       D       B       B       B       D       D       D       D       D       D       D       B       D       B       B       B
+rs6269442       X       Y       1       1       1       1       1       1       1       1       0       1       1       1       0       0
+       1       1       1       0       0       0       0       0       1       0       1       1       1       0       0       0       0       0       0       0       1       0       1       1       1
+```
+
+[bio-table](https://github.com/pjotrp/bioruby-table) has an extensive
+README and many other options, check it out.
+
+** Annotation
+
+To run gemma an annotation file is optional. This file again has no
+header and looks like
+
+```
+rs31443144      3010274 1
+rs6269442       3492195 1
+rs32285189      3511204 1
+```
+
+where columns refer to the SNP name (same as in the genotype file),
+the position and the chromosome number.  If you don't supply this
+information and run gemma with the -debug switch it will complain
+
+```
+**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
+**** DEBUG: Can't figure out position for rs6269442 in src/io.cpp at line 683 in ReadFile_geno
+```
+
+This is harmless until you want to run LOCO.
+
+** Phenotypes
+
+The phenotypes are simply a matrix. Again, without headers. Missing
+values are marked as NA. Save the file so it looks like
+[this](https://github.com/genetics-statistics/GEMMA/blob/master/example/mouse_hs1940.pheno.txt).
+When running GEMMA the column number gets passed in.
+
+** Running Gemma
+
+*** Relatedness/kinship matrix
+
+The first step is to compute a kinship matrix K. With our files we try
+
+```
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk
+GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
+Reading Files ...
+Segmentation fault
+```
+
+Unfortunately there is an immediate problem. Try running with -debug
+
+```
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -debug
+GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
+Reading Files ...
+**** DEBUG: entered in src/io.cpp at line 353 in ReadFile_pheno
+**** DEBUG: entered in src/io.cpp at line 608 in ReadFile_geno
+**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
+strtok failed in ReadFile_geno in src/io.cpp at line 706
+```
+
+and we get a slightly more informative error. There is a parsing
+problem. GEMMA is not very good at data format checking so we need to
+do some checks. First see if the number of individuals matches in the
+phenotype file and genotype file. In this case the phenotype file was
+too large. In still included a header line and the parent phenotypes.
+
+After removing those it was fine.
+
+Gemma contains helpful information, particularly
+
+```sh
+gemma -h 2
+```
+
+where we can select the column in the phenotype file with -n. This is important
+because (at this point) K accounts for missing values:
+
+```sh
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -n 2
+```
+
+by default it writes K to output/result.cXX.txt.
+
+Gemma also has a leave one chromosoe out (LOCO) option, but it is best
+to use
+[gemma-wrapper](https://github.com/genetics-statistics/gemma-wrapper)
+for that because gemma-wrapper iterates through all chromosomes. More
+on that below.
+
+Now we have K, let's run an LMM:
+
+** LMM
+
+Running an LMM on the phenotype in column 2 (they are numbered 1,2,...)
+
+```
+gemma -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -k output/result.cXX.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -o n2_assoc.txt
+```
+
+gets a result file in output/result.assoc.txt
+
+** LOCO
+
+For LOCO we use gemma-wrapper because it facilitates running through
+the chromosomes. Essentially the command is the same (after -- gets
+passed to gemma):
+
+```
+gemma-wrapper --json \
+    --loco 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -gk \
+    -n 2 \
+    -debug > K.json
+```
+
+Note that the annotation file now is *required* to find the chromosomes. Gemma-wrapper computes
+the K's for every chromosome and the filenames are passed into K.json. A file which can be fed
+to the LMM:
+
+```
+gemma-wrapper --loco --input K.json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+Inside the resulting GWA.json file there is a list of results. Check
+out the paths and compile the result into one file with, for example,
+
+```
+bio-table /tmp/aa2a1c1a67fe2289d6a23afcc025818402f97521.*.assoc.txt.assoc.txt > LOCO_n2_assoc.txt
+```
+
+** Permutations
+
+To get an idea of what is a significant hit we can also use run gemma
+1000x after shuffling the phenotypes. gemma-wrapper also has an option
+for that. First create K
+
+```
+gemma-wrapper --json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -gk \
+    -n 2 \
+    -debug > K.json
+```
+
+Now run once with
+
+```
+gemma-wrapper --json --input K.json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -lmm 2 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+GWA.json should point to the result set.
+
+To run the permutations add one option and move the -p option to
+--phenotypes *before* the double dash
+
+```
+gemma-wrapper --permutate 1000 --phenotypes BXD_pheno.csv --input K.json -- \
+    -g BXD_mine.geno \
+    -a example/BXD_snps.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+gemma-wrapper prints out the 95 and 67 percentiles where the first may
+be considered the 'significant' threshold and the latter is the
+'suggestive' threshold. For example
+
+```
+["95 percentile (significant) ", 3.382905e-05, 4.5]
+["67 percentile (suggestive)  ", 0.0003651852, 3.4]
+```
+
+where 4.5 is the LOD, i.e., -log10(3.382905e-05)
+
+** Annotate
+
+Once there is a list of associations out of GEMMA you may use the gemma-annotate
+tool as described [here](https://github.com/pjotrp/gemma-annotate).