aboutsummaryrefslogtreecommitdiff
path: root/doc/example/data-munging.org
blob: 62dfb24238033bcd8fa9c6992524cb72080cae0b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
* Preparing Data for GEMMA

GEMMA requires data to be presented in a certain way to run
successfully. In this document we describe how you can prepare
genotype and phenotype data so it passes through GEMMA correctly.

Note that GEMMA uses spaces, tabs and comma's as valid field
separators. Here we use tabs.

** Genotypes

In this example we have a genotype spreadsheet containing something like this

```
@type:riset
@mat:B
@pat:D
@het:H
@unk:U
Chr     Locus   cM      Mb      BXD1    BXD2    BXD5    BXD6    BXD8    BXD9    BXD11   ...
1       rs31443144      1.50    3.010274        B       B       D       D       D       ...
1       rs6269442       1.50    3.492195        B       B       D       D       D       ...
(...)
```

GEMMA, instead, requires a BIMBAM 'mean genotyped' formatted file
which looks like

```
rs31443144, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
rs6269442, X, Y, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, ...
(...)
```

where the first three positions refer to SNP id, minor allele, major
allele (both unused) and the mean genotypes where B has the value 1, D
has the value 0, H has the value 0.5 and U is marked as NA. Note there
is no header line.

At this point you may decide to use a spreadsheet or editor to modify
these by hand. The problem is that such a procedure is error prone and
hard to reproduce. Always choose a programming method if you can! For
further information check out these
[suggestions](http://kbroman.org/steps2rr/pages/scripts.html).

For most people using R or Python would be the best way to convert the
data using a small script. We will add those instructions later, but
here I am going to use a few Unix tools and the
[bio-table](https://github.com/pjotrp/bioruby-table) tool to manage
this conversion from the command line. First, make sure Ruby is
available

```sh
ruby -v
ruby 2.4.0p0 (2016-12-24 revision 57164) [x86_64-linux]
```

and install bioruby-table

```sh
gem install bio-table
```

On success you should be able to run

```
bio-table
```

In the first step we need to strip the header lines. This can be done with --skip, e.g.

```sh
bio-table --skip 20 BXD.geno > BXD_noheader.geno
```

We'll keep the header line for some column processing.  Next we drop
the first column and keep the rest with --columns. For this exercise
we are only interested in a subset of BXDs, so

```sh
bio-table --columns Locus,BXD100,BXD101,BXD102,BXD24,BXD32,BXD34,BXD40,BXD43,BXD44,BXD45,BXD48,BXD48a,BXD49,BXD50,BXD51,BXD55,BXD60,BXD61,BXD62,BXD63,BXD65,BXD65b,BXD66,BXD68,BXD69,BXD70,BXD71,BXD73,BXD73a,BXD73b,BXD75,BXD77,BXD78,BXD79,BXD83,BXD86,BXD87,BXD90,BXD98,C57BL/6J,DBA/2J BXD_noheader.geno > BXD_cols.geno
```

selects those columns by name. If a name is a mis-match bio-table will
balk. In this case the parents C57BL/6J,DBA/2J are missing and had to
be taken out.

It is also possible to use regex's to select the columns (with
--column-filter). Note that bio-table takes tab delimited files by
default, to use a different delimiter use something like --in-format
split --split-on ',' or a regex. Next we inject columns 1 and 2 by

```sh
bio-table --rewrite 'field[0] = "X\tY\t" + field[0]' BXD_cols.geno > BXD_cols2.geno
```

Next rewrite the field contents with something like

```sh
bio-table BXD_cols2.geno --rewrite 'fields = fields.map { |f| h={"X"=>"X", "Y"=>"Y", "D"=> 0, "B"=>1, "H"=>0.5} ; h[f] }' > BXD_mine.geno
```

And the resulting BIMBAM format looks like

```
rs31443144      X       Y       B       B       B       B       B       B       B       B       D       B       B       B       D       D
       B       B       B       D       D       D       D       D       B       D       B       B       B       D       D       D       D       D       D       D       B       D       B       B       B
rs6269442       X       Y       1       1       1       1       1       1       1       1       0       1       1       1       0       0
       1       1       1       0       0       0       0       0       1       0       1       1       1       0       0       0       0       0       0       0       1       0       1       1       1
```

[bio-table](https://github.com/pjotrp/bioruby-table) has an extensive
README and many other options, check it out.

** Annotation

To run gemma an annotation file is optional. This file again has no
header and looks like

```
rs31443144      3010274 1
rs6269442       3492195 1
rs32285189      3511204 1
```

where columns refer to the SNP name (same as in the genotype file),
the position and the chromosome number.  If you don't supply this
information and run gemma with the -debug switch it will complain

```
**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
**** DEBUG: Can't figure out position for rs6269442 in src/io.cpp at line 683 in ReadFile_geno
```

This is harmless until you want to run LOCO.

** Phenotypes

The phenotypes are simply a matrix. Again, without headers. Missing
values are marked as NA. Save the file so it looks like
[this](https://github.com/genetics-statistics/GEMMA/blob/master/example/mouse_hs1940.pheno.txt).
When running GEMMA the column number gets passed in.

** Running Gemma

*** Relatedness/kinship matrix

The first step is to compute a kinship matrix K. With our files we try

```
gemma -g BXD_mine.geno -p BXD_pheno.csv -gk
GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
Segmentation fault
```

Unfortunately there is an immediate problem. Try running with -debug

```
gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -debug
GEMMA 0.97 (2017/12/19) by Xiang Zhou and team (C) 2012-2017
Reading Files ...
**** DEBUG: entered in src/io.cpp at line 353 in ReadFile_pheno
**** DEBUG: entered in src/io.cpp at line 608 in ReadFile_geno
**** DEBUG: Can't figure out position for rs31443144 in src/io.cpp at line 683 in ReadFile_geno
strtok failed in ReadFile_geno in src/io.cpp at line 706
```

and we get a slightly more informative error. There is a parsing
problem. GEMMA is not very good at data format checking so we need to
do some checks. First see if the number of individuals matches in the
phenotype file and genotype file. In this case the phenotype file was
too large. In still included a header line and the parent phenotypes.

After removing those it was fine.

Gemma contains helpful information, particularly

```sh
gemma -h 2
```

where we can select the column in the phenotype file with -n. This is important
because (at this point) K accounts for missing values:

```sh
gemma -g BXD_mine.geno -p BXD_pheno.csv -gk -n 2
```

by default it writes K to output/result.cXX.txt.

Gemma also has a leave one chromosoe out (LOCO) option, but it is best
to use
[gemma-wrapper](https://github.com/genetics-statistics/gemma-wrapper)
for that because gemma-wrapper iterates through all chromosomes. More
on that below.

Now we have K, let's run an LMM:

** LMM

Running an LMM on the phenotype in column 2 (they are numbered 1,2,...)

```
gemma -g BXD_mine.geno \
    -p BXD_pheno.csv \
    -a example/BXD_snps.txt \
    -k output/result.cXX.txt \
    -lmm 1 -maf 0.1 -n 2 \
    -o n2_assoc.txt
```

gets a result file in output/result.assoc.txt

** LOCO

For LOCO we use gemma-wrapper because it facilitates running through
the chromosomes. Essentially the command is the same (after -- gets
passed to gemma):

```
gemma-wrapper --json \
    --loco 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X -- \
    -g BXD_mine.geno \
    -p BXD_pheno.csv \
    -a example/BXD_snps.txt \
    -gk \
    -n 2 \
    -debug > K.json
```

Note that the annotation file now is *required* to find the chromosomes. Gemma-wrapper computes
the K's for every chromosome and the filenames are passed into K.json. A file which can be fed
to the LMM:

```
gemma-wrapper --loco --input K.json -- \
    -g BXD_mine.geno \
    -p BXD_pheno.csv \
    -a example/BXD_snps.txt \
    -lmm 1 -maf 0.1 -n 2 \
    -debug > GWA.json
```

Inside the resulting GWA.json file there is a list of results. Check
out the paths and compile the result into one file with, for example,

```
bio-table /tmp/aa2a1c1a67fe2289d6a23afcc025818402f97521.*.assoc.txt.assoc.txt > LOCO_n2_assoc.txt
```

** Permutations

To get an idea of what is a significant hit we can also use run gemma
1000x after shuffling the phenotypes. gemma-wrapper also has an option
for that. First create K

```
gemma-wrapper --json -- \
    -g BXD_mine.geno \
    -p BXD_pheno.csv \
    -a example/BXD_snps.txt \
    -gk \
    -n 2 \
    -debug > K.json
```

Now run once with

```
gemma-wrapper --json --input K.json -- \
    -g BXD_mine.geno \
    -p BXD_pheno.csv \
    -a example/BXD_snps.txt \
    -lmm 2 -maf 0.1 -n 2 \
    -debug > GWA.json
```

GWA.json should point to the result set.

To run the permutations add one option and move the -p option to
--phenotypes *before* the double dash

```
gemma-wrapper --permutate 1000 --phenotypes BXD_pheno.csv --input K.json -- \
    -g BXD_mine.geno \
    -a example/BXD_snps.txt \
    -lmm 1 -maf 0.1 -n 2 \
    -debug > GWA.json
```

gemma-wrapper prints out the 95 and 67 percentiles where the first may
be considered the 'significant' threshold and the latter is the
'suggestive' threshold. For example

```
["95 percentile (significant) ", 3.382905e-05, 4.5]
["67 percentile (suggestive)  ", 0.0003651852, 3.4]
```

where 4.5 is the LOD, i.e., -log10(3.382905e-05)

** Annotate

Once there is a list of associations out of GEMMA you may use the gemma-annotate
tool as described [here](https://github.com/pjotrp/gemma-annotate).