# GeneNetwork upload API

## Upload API

The REST API will accept a gzipped tar ball which contains multiple
files:

1. A metadata file (JSON)
2. A phenotype file

On upload to the API the data gets unpacked into files in a temporary
directory. Next we compute an MD5SUM over the directory and it gets
renamed to the hash value. So, after unpacking, first we have

```
GENENETWORK_UPLOAD_DIR/tempdir/metadata.json
                              /phenotypes.tsv
```

The upload directory is set in the GN2
[config](https://github.com/genenetwork/genenetwork2/blob/testing/etc/default_settings.py). After
computing the HASH we rename it to

```
GENENETWORK_UPLOAD_DIR/e524ee7ea9b1f452c58abe560960a60f/metadata.json
                                                       /phenotypes.tsv
```

On success the upload REST API returns this HASH to the invoker:

```
{
  "status": 0,
  "token": "e524ee7ea9b1f452c58abe560960a60f"
}
```

On error the result should include the error output

```
{
  "status": 128
  "error": "gzip failed to unpack file"
}
```

## Metadata

For metadata we will follow the R/qtl2 input
[format]https://kbroman.org/qtl2/assets/vignettes/input_files.html).
See the examples at https://kbroman.org/qtl2/pages/sampledata.html.
One example is
[here](https://github.com/kbroman/qtl2/blob/gh-pages/assets/sampledata/iron/iron.yaml). We
will not be using YAML though! Use JSON instead.


The files that are uploaded are listed in the metadata.

The metadata file is a simple JSON file containing

```js
{
  "title": "This is my dataset for testing the REST API",
  "description": "Longer description"
  "date": "20210127",
  "authors": [
    "R.W. Williams"
  ],
  "crosstype": "BXD",
  "geno": "iron_geno.csv",
  "pheno": "iron_pheno.csv",
}
```

## File lookup (resolving)

Files can be in three places:

1. On the internet (URL) - fixme later
2. Uploaded in the local dir (file, but no path)
3. In the database (identifier) - fixme later

For Genotype files a file name is passed in. We will first look in the
upload dir. If it is not there we will look in the genotype_files
directory of GeneNetwork.

For phenotype files we currently only look in the upload directory.

## Phenotype file

The phenotype file is a tab delimited 'spreadsheet' where the columns
contain phenotypes and the rows contain individuals. Example

```
      pheno
BXD01 5.060
BXD02 307.866
BXD03 185.400
BXD04 380.729
BXD05 150.066
BXD06 94.483
BXD07 438.700
BXD08 NA
BXD09 130.457
BXD10 184.900
BXD11 223.400
BXD12 167.250
BXD13 313.950
BXD14 219.383
BXD15 277.800
BXD16 6.467
BXD17 364.967
BXD18 132.016
BXD19 468.133
BXD20 309.500
```

Missing data are 'NA'. Multiple pheno columns are possible.

## Running GEMMA

Using the hash received earlier we can run GEMMA against the uploaded
data. Example: `/api/gemma/lmm2/e524ee7ea9b1f452c58abe560960a60f`
should do the trick and result a status or error. The resulting output
files should be fetchable with something like:

`/api/gemma/lmm2/loco/e524ee7ea9b1f452c58abe560960a60f/results.log.txt`
and
`/api/gemma/lmm2/loco/e524ee7ea9b1f452c58abe560960a60f/results.assoc.txt`