summaryrefslogtreecommitdiff
path: root/issues/gn-uploader/speed-up-rqtl2-qc.gmi
blob: 43e6d49434e563e3d3bcc7641ceb57787316e478 (about) (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Speed Up QC on R/qtl2 Bundles

## Tags

## Description

The default format for the CSV files in a R/qtl2 bundle is:

```
matrix of individuals × (markers/phenotypes/covariates/phenotype covariates/etc.)
```

(A) (f/F)ile(s) in the R/qtl2 bundle could however
=> https://kbroman.org/qtl2/assets/vignettes/input_files.html#csv-files be transposed,
which means the system needs to "un-transpose" the file(s) before processing.

Currently, the system does this by reading all the files of a particular type, and then "un-transposing" the entire thing. This leads to a very slow system.

This issue proposes to do the quality control/assurance processing on each file in isolation, where possible - this will allow parallelisation/multiprocessing of the QC checks.

The main considerations that need to be handled are as follows:

* Do QC on (founder) genotype files (when present) before any of the other files
* Genetic and physical maps (if present) can have QC run on them after the genotype files
* Do QC on phenotype files (when present) after genotype files but before any other files
* Covariate and phenotype covariate files come after the phenotype files
* Cross information files … ?
* Sex information files … ?

We should probably detail the type of QC checks done for each type of file