topics/genetics/pangenotypes.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

# Pangenotypes

Here we discuss different storage solutions for pangenotypes.

## GRG format


Looking for graph genotyping I ran into Genotype Representation Graphs GRG

=> https://pmc.ncbi.nlm.nih.gov/articles/PMC11071416/

It has a binary storage format that represents something like:

```
# GRG file example: genotype graph
# Nodes section: NODE <id> <label> allele=<genotype>
NODE 1 GeneA allele=AA
NODE 2 GeneB allele=AG
NODE 3 GeneC allele=GG
NODE 4 GeneD allele=AA
NODE 5 GeneE allele=AG

# Edges section: EDGE <from_id> <to_id>
EDGE 1 2
EDGE 1 3
EDGE 2 4
EDGE 3 4
EDGE 4 5
EDGE 5 1
```

the tooling

=> https://github.com/aprilweilab/grgl.git

builds with

```
guix shell -C -N coreutils gcc-toolchain make cmake openssl nss-certs git pkg-config zlib
```

I did some tests and read the source code. The nice thing is that they have very similar ideas. Unfortunately the implementation is not what we want. I wonder why people alway reinvent data structures :/. To get an idea:

=> https://github.com/aprilweilab/grgl/blob/main/src/serialize.cpp

I would like to take similar ideas and take it to an efficient in-memory graph structure that is easily extensible. RDF is key for extensions (and queries). A fast RDF implementation we are going to try is

=> https://pyoxigraph.readthedocs.io/en/stable/index.html

Toshiaki pointed out we should look at qlever instead:

=> https://github.com/ad-freiburg/qlever