topics/data/R-qtl2-format-notes.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

# R/qtl2 and GEMMA Format Notes

This document is mostly to help other non-biologists figure out their way around the format(s) of the R/qtl2 files. It mostly deals with the meaning/significance of the various fields.

From the R/qtl2 format documentation:

> The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names.

and

> All of these CSV files may be transposed relative to the form described below.

We are going to consider the "non-transposed" form here, for ease of documentation: simply flip the meanings as appropriate for the transposed files.

To convert between formats we should probably use python as that is what can use as 'esperanto'.

## Control files

Both GN and R/qtl2 have control files. For GN it basically describes the individuals (genometypes) and looks like:

```js
{
        "mat": "C57BL/6J",
        "pat": "DBA/2J",
        "f1s": ["B6D2F1", "D2B6F1"],
        "genofile" : [{
                "title" : "WGS-based (Mar2022)",
                "location" : "BXD.8.geno",
                "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44",
 ...]}]}
```

In gn-guile this gets parsed in gn/data/genotype.scm to fetch the individuals that match the genotype and phenotype layouts.

## pheno files and phenotypes

The standard GEMMA input files are not very good for trouble shooting. R/qtl2 has at least the individual or genometype ID for every line:

```
id,bolting_days,seed_weight,seed_area,ttl_seedspfruit,branches,height,pc_seeds_aborted,fruit_length
MAGIC.1,15.33,17.15,0.64,45.11,10.5,NA,0,14.95
MAGIC.2,22,22.71,0.75,49.11,4.33,42.33,1.09,13.27
MAGIC.3,23,21.03,0.68,57,4.67,50,0,13.9
```

This is a good standard and can match with the control files.

## geno files

> The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names.

For GeneNetwork, this means that the first column contains the Sample names (previously "strain names"). The first row would be a list of markers.

## gmap and pmap files

The first column of the gmap/pmap file contains genetic marker values. There are no Individuals/samples (or strains) here.

## phenocovar files

These seem to contain extra metadata for the phenotypes.

The first column is the list of phenotype identifiers whereas the first column is a list of metadata headers (phenotype covariates).

As an example,
=> https://github.com/rqtl/qtl2data/blob/main/BXD/bxd_phenocovar.csv The phenocovar file for BXD mice

We see here that this contains the individual identifier (id), and a description for each individual/sample.

# References

=> https://kbroman.org/qtl2/assets/vignettes/input_files.html
=> https://github.com/rqtl/qtl2data