doc: data-munging

author: Pjotr Prins 2018-05-02 11:23:45 +0000
committer: Pjotr Prins 2018-05-02 11:23:45 +0000
commit: cee92876e456299e14afd51e0b719384c48c5286 (patch)
tree: f96b864fa3668618408a559e86c7030257279e78
parent: 135c977cf25a688846c0f8180720c7880a90a7dd (diff)
download: pangemma-cee92876e456299e14afd51e0b719384c48c5286.tar.gz
2 files changed, 321 insertions, 4 deletions
diff --git a/README.md b/README.md
index 355a3a9..c48d0d6 100644
--- a/README.md
+++ b/README.md
@@ -159,6 +159,16 @@ hardware. See [INSTALL.md](INSTALL.md) for speeding up tips.
 More information on source code, dependencies and installation can be
 found in [INSTALL.md](INSTALL.md).
 
+## Input data formats
+
+Currently GEMMA takes two types of input formats
+
+1. BIMBAM format (preferred)
+2. PLINK format
+
+See this [example](./doc/example/data-munging.org) where we convert some
+spreadsheets for use in GEMMA.
+
 ## Reporting a GEMMA bug or issue
 
 For bugs GEMMA has an
@@ -170,10 +180,11 @@ Before posting an issue search the issue tracker and mailing list
 first. It is likely someone may have encountered something
 similiar. Also try running the latest version of GEMMA to make sure it
 has not been fixed already. Support/installation questions should be
-aimed at the mailing list. The issue tracker is for development issues
-around the software itself. When reporting an issue include the output
-of the program and the contents of the .log.txt file in the output
-directory.
+aimed at the mailing list - it is the best resource to get answers.
+
+The issue tracker is specifically meant for development issues around
+the software itself. When reporting an issue include the output of the
+program and the contents of the .log.txt file in the output directory.
 
 ### Check list:
 
diff --git a/doc/example/data-munging.org b/doc/example/data-munging.org
new file mode 100644
index 0000000..62dfb24
--- /dev/null
+++ b/doc/example/data-munging.org
@@ -0,0 +1,306 @@
+* Preparing Data for GEMMA
+
+GEMMA requires data to be presented in a certain way to run
+successfully. In this document we describe how you can prepare
+genotype and phenotype data so it passes through GEMMA correctly.
+
+Note that GEMMA uses spaces, tabs and comma's as valid field
+separators. Here we use tabs.
+
+** Genotypes
+
+In this example we have a genotype spreadsheet containing something like this
+
+```
+@type:riset
+@mat:B
+@pat:D
+@het:H
+@unk:U
+Chr     Locus   cM      Mb      BXD1    BXD2    BXD5    BXD6    BXD8    BXD9    BXD11   ...
+1       rs31443144      1.50    3.010274        B       B       D       D       D       ...
+1       rs6269442       1.50    3.492195        B       B       D       D       D       ...
+(...)
+```
+
+GEMMA, instead, requires a BIMBAM 'mean genotyped' formatted file
+which looks like
+
+```
+rs31443144, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
+rs6269442, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
+(...)
+```
+
+where the first three positions refer to SNP id, minor allele, major
+allele (both unused) and the mean genotypes where B has the value 1, D
+has the value 0, H has the value 0.5 and U is marked as NA. Note there
+is no header line.
+
+At this point you may decide to use a spreadsheet or editor to modify
+these by hand. The problem is that such a procedure is error prone and
+hard to reproduce. Always choose a programming method if you can! For
+further information check out these
+[suggestions](http://kbroman.org/steps2rr/pages/scripts.html).
+
+For most people using R or Python would be the best way to convert the
+data using a small script. We will add those instructions later, but
+here I am going to use a few Unix tools and the
+[bio-table](https://github.com/pjotrp/bioruby-table) tool to manage
+this conversion from the command line. First, make sure Ruby is
+available
+
+```sh
+ruby -v
+ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
+```
+
+and install bioruby-table
+
+```sh
+gem install bio-table
+```
+
+On success you should be able to run
+
+```
+bio-table
+```
+
+In the first step we need to strip the header lines. This can be done with --skip, e.g.
+
+```sh
+bio-table --skip 20 BXD.geno > BXD_noheader.geno
+```
+
+We'll keep the header line for some column processing.  Next we drop
+the first column and keep the rest with --columns. For this exercise
+we are only interested in a subset of BXDs, so
+
+```sh
+bio-table --columns Locus,BXD100,BXD101,BXD102,BXD24,BXD32,BXD34,BXD40,BXD43,BXD44,BXD45,BXD48,BXD48a,BXD49,BXD50,BXD51,BXD55,BXD60,BXD61,BXD62,BXD63,BXD65,BXD65b,BXD66,BXD68,BXD69,BXD70,BXD71,BXD73,BXD73a,BXD73b,BXD75,BXD77,BXD78,BXD79,BXD83,BXD86,BXD87,BXD90,BXD98,C57BL/6J,DBA/2J BXD_noheader.geno > BXD_cols.geno
+```
+
+selects those columns by name. If a name is a mis-match bio-table will
+balk. In this case the parents C57BL/6J,DBA/2J are missing and had to
+be taken out.
+
+It is also possible to use regex's to select the columns (with
+--column-filter). Note that bio-table takes tab delimited files by
+default, to use a different delimiter use something like --in-format
+split --split-on ',' or a regex. Next we inject columns 1 and 2 by
+
+```sh
+bio-table --rewrite 'field[0] = "X\tY\t" + field[0]' BXD_cols.geno > BXD_cols2.geno
+```
+
+Next rewrite the field contents with something like
+
+```sh
+bio-table BXD_cols2.geno --rewrite 'fields = fields.map { |f| h={"X"=>"X", "Y"=>"Y", "D"=> 0, "B"=>1, "H"=>0.5} ; h[f] }' > BXD_mine.geno
+```
+
+And the resulting BIMBAM format looks like
+
+```
+rs31443144      X       Y       B       B       B       B       B       B       B       B       D       B       B       B       D       D
+       B       B       B       D       D       D       D       D       B       D       B       B       B       D       D       D       D       D       D       D       B       D       B       B       B
+rs6269442       X       Y       1       1       1       1       1       1       1       1       0       1       1       1       0       0
+       1       1       1       0       0       0       0       0       1       0       1       1       1       0       0       0       0       0       0       0       1       0       1       1       1
+```
+
+[bio-table](https://github.com/pjotrp/bioruby-table) has an extensive
+README and many other options, check it out.
+
+** Annotation
+
+To run gemma an annotation file is optional. This file again has no
+header and looks like
+
+```
+rs31443144      3010274 1
+rs6269442       3492195 1
+rs32285189      3511204 1
+```
+
+where columns refer to the SNP name (same as in the genotype file),
+the position and the chromosome number.  If you don't supply this
+information and run gemma with the -debug switch it will complain
+
+```
+**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
+**** DEBUG: Can't figure out position for rs6269442 in src/io.cpp at line 683 in ReadFile_geno
+```
+
+This is harmless until you want to run LOCO.
+
+** Phenotypes
+
+The phenotypes are simply a matrix. Again, without headers. Missing
+values are marked as NA. Save the file so it looks like
+[this](https://github.com/genetics-statistics/GEMMA/blob/master/example/mouse_hs1940.pheno.txt).
+When running GEMMA the column number gets passed in.
+
+** Running Gemma
+
+*** Relatedness/kinship matrix
+
+The first step is to compute a kinship matrix K. With our files we try
+
+```
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk
+GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
+Reading Files ...
+Segmentation fault
+```
+
+Unfortunately there is an immediate problem. Try running with -debug
+
+```
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -debug
+GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
+Reading Files ...
+**** DEBUG: entered in src/io.cpp at line 353 in ReadFile_pheno
+**** DEBUG: entered in src/io.cpp at line 608 in ReadFile_geno
+**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
+strtok failed in ReadFile_geno in src/io.cpp at line 706
+```
+
+and we get a slightly more informative error. There is a parsing
+problem. GEMMA is not very good at data format checking so we need to
+do some checks. First see if the number of individuals matches in the
+phenotype file and genotype file. In this case the phenotype file was
+too large. In still included a header line and the parent phenotypes.
+
+After removing those it was fine.
+
+Gemma contains helpful information, particularly
+
+```sh
+gemma -h 2
+```
+
+where we can select the column in the phenotype file with -n. This is important
+because (at this point) K accounts for missing values:
+
+```sh
+gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -n 2
+```
+
+by default it writes K to output/result.cXX.txt.
+
+Gemma also has a leave one chromosoe out (LOCO) option, but it is best
+to use
+[gemma-wrapper](https://github.com/genetics-statistics/gemma-wrapper)
+for that because gemma-wrapper iterates through all chromosomes. More
+on that below.
+
+Now we have K, let's run an LMM:
+
+** LMM
+
+Running an LMM on the phenotype in column 2 (they are numbered 1,2,...)
+
+```
+gemma -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -k output/result.cXX.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -o n2_assoc.txt
+```
+
+gets a result file in output/result.assoc.txt
+
+** LOCO
+
+For LOCO we use gemma-wrapper because it facilitates running through
+the chromosomes. Essentially the command is the same (after -- gets
+passed to gemma):
+
+```
+gemma-wrapper --json \
+    --loco 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -gk \
+    -n 2 \
+    -debug > K.json
+```
+
+Note that the annotation file now is *required* to find the chromosomes. Gemma-wrapper computes
+the K's for every chromosome and the filenames are passed into K.json. A file which can be fed
+to the LMM:
+
+```
+gemma-wrapper --loco --input K.json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+Inside the resulting GWA.json file there is a list of results. Check
+out the paths and compile the result into one file with, for example,
+
+```
+bio-table /tmp/aa2a1c1a67fe2289d6a23afcc025818402f97521.*.assoc.txt.assoc.txt > LOCO_n2_assoc.txt
+```
+
+** Permutations
+
+To get an idea of what is a significant hit we can also use run gemma
+1000x after shuffling the phenotypes. gemma-wrapper also has an option
+for that. First create K
+
+```
+gemma-wrapper --json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -gk \
+    -n 2 \
+    -debug > K.json
+```
+
+Now run once with
+
+```
+gemma-wrapper --json --input K.json -- \
+    -g BXD_mine.geno \
+    -p BXD_pheno.csv \
+    -a example/BXD_snps.txt \
+    -lmm 2 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+GWA.json should point to the result set.
+
+To run the permutations add one option and move the -p option to
+--phenotypes *before* the double dash
+
+```
+gemma-wrapper --permutate 1000 --phenotypes BXD_pheno.csv --input K.json -- \
+    -g BXD_mine.geno \
+    -a example/BXD_snps.txt \
+    -lmm 1 -maf 0.1 -n 2 \
+    -debug > GWA.json
+```
+
+gemma-wrapper prints out the 95 and 67 percentiles where the first may
+be considered the 'significant' threshold and the latter is the
+'suggestive' threshold. For example
+
+```
+["95 percentile (significant) ", 3.382905e-05, 4.5]
+["67 percentile (suggestive)  ", 0.0003651852, 3.4]
+```
+
+where 4.5 is the LOD, i.e., -log10(3.382905e-05)
+
+** Annotate
+
+Once there is a list of associations out of GEMMA you may use the gemma-annotate
+tool as described [here](https://github.com/pjotrp/gemma-annotate).
author	Pjotr Prins	2018-05-02 11:23:45 +0000
committer	Pjotr Prins	2018-05-02 11:23:45 +0000
commit	cee92876e456299e14afd51e0b719384c48c5286 (patch)
tree	f96b864fa3668618408a559e86c7030257279e78
parent	135c977cf25a688846c0f8180720c7880a90a7dd (diff)
download	pangemma-cee92876e456299e14afd51e0b719384c48c5286.tar.gz