topics/genetics/test-pangenome-derived-genotypes.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005

# Test pangenome derived genotypes

Here we follow up on the work we did on precompute PublishData:

=> ../systems/mariadb/precompute-publishdata

But now run against pangenome derived genotypes.
For the BXD we have 23M markers(!) whereof 8M *not* on the reference genome.

# Tasks

* [ ] Document lmdb geno and marker information
* [ ] Extract epoch information
* [ ] Add BED file and link SNPS
* [ ] Check MAF filter - it may be too stringent
* [ ] Use ravanan/CWL to push to Octopus
* [ ] Reintroduce nodes that were not annotated for position (Flavia)
* [ ] GWA plotter
* [ ] Speed up IO for GEMMA by using lmdb for genotypes and marker file
* [ ] Use 1.5LOD score to compute QTLs instead of using 50M distance
* [ ] Reduce GEMMA GRM RAM requirements (not urgent)
* [ ] Fix -lmm 4 ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt

# Summary

To get the mapping and generate the assoc output in mdb format we run a variant of gemma-wrapper.

The workflow essentially is:

* capture the significant markers from GEMMA's mdb output (as created by gemma-wrapper)
* These are transformed into RDF using the 'gemma-mdb-to-rdf.rb' script
* Next we upload that RDF into virtuoso
* from there download a table of start-stop data using SPARQL
* We compute QTL locations using 'sparql-qtl-detect.rb'
* Upload that RDF also into virtuoso

For mapping virtuoso contains four important ttl files:

* marker positions in pangenome-marker graph
* mapped markers in pangenome-mapped graph
* computed QTL positions in pangenome-qtl graph
* trait values in traits graph (nyi)


```
gemma-batch-run.sh
```

Next we convert that output to RDF with

```
../bin/gemma-mdb-to-rdf.rb --header > output.ttl
time ../bin/gemma-mdb-to-rdf.rb --anno snps-matched.txt.mdb tmp/panlmm/*-gemma-GWA.tar.xz >> output.ttl # two hours for 7000 traits
time serdi -i turtle -o ntriples output.ttl > output.n3
```

(note that n3 files are less error prone and serdi does better than rapper with huge files) and copy the file to the virtuoso instance and load it with isql (note it may be worth search-replacing the gnt:run tag to something descriptive).

```
cd /export/guix-containers/virtuoso/data/virtuoso/ttl/
guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','test-run-3000.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
SQL> DELETE from DB.DBA.LOAD_LIST where LL_STATE = 1;
# commit changes
SQL> rdf_loader_run (); // about 1 min per GB n3
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686
```

Note it may be a good idea to drop graphs first. That is why we have separate subgraph spaces for every large TTL file:

```
log_enable(3,1);
SQL> SPARQL CLEAR GRAPH  <http://pan-test.genenetwork.org>;
SQL> SPARQL CLEAR GRAPH  <http://pan-mapped.genenetwork.org>; // 10 min
SQL> SPARQL CLEAR GRAPH  <http://pangenome-marker.genenetwork.org>;
SQL> ld_dir('/export/data/virtuoso/ttl','pangenome-markers.n3','http://pangenome-marker.genenetwork.org');
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
```

For pangenomes we have a marker file, a QTL file

As a test, fetch a table of the traits with their SNPs

```
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT * FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
FILTER (contains(?nodeid,"Marker") && ?pos < 1000)
} LIMIT 100
```

OK, we are ready to run a little workflow. First create a sorted list of IDs.

```
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT DISTINCT ?trait FROM <http://pangenome-mapped.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId ?trait.
}
```

See also

=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/list-traits.sparql

Sort that list and save as 'pan-ids-sorted.txt'. Next run

```
../../bin/workflow/qtl-detect-batch-run.sh
```

and load those in virtuoso. List new QTL

```
SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
  ?traitid a gnt:mappedTrait ;
  gnt:traitId ?t .
  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
  # ?traitid gnt:run gn:test . # use if you want the new QTL
  ?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlLOD ?lod ;
  gnt:qtlStart ?s ;
  gnt:qtlStop ?e .
  ?qtl gnt:mappedSnp ?snp .
  FILTER (?t = "10002" && ?lod >= 5.0 ) .
} LIMIT 100
```

# Prior work

For the first traits (presented at CTC'25) gemma was run as

```
echo "[$(date)] Starting kinship matrix calculation for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
        -p ${PHENO_FILE} \
              -gk \
              -o percentile_result > percentile.kinship.143.txt

echo "[$(date)] Kinship matrix calculation completed for PERCENTILE."
echo "[$(date)] Starting association analysis for PERCENTILE..."
gemma -g ${BIMBAM_DIR}/143samples.percentile.bimbam.bimbam.gz \
        -p ${PHENO_FILE} \
              -k ./output/percentile_result.cXX.txt \
              -lmm 4 \
              -maf 0.05 \
              -o percentile_association > percentile.assoc.143.txt
```

Note no LOCO.

The genotype BIMBAM file is 45G uncompressed. Even though GEMMA does not load everything in RAM, it is a bit large for my workstation. I opted to use tux04 since no one is using it. Had to reboot the machine because it is unreliable and had crashed.

There I rebuilt gemma and set up a first run:

```
tux04:/export/data/wrk/iwrk/opensource/code/genetics/gemma/tmp$
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -gk
```

Without LOCO this took about 18 minutes (186% CPU), 110Gb of RAM. We ought to work on this ;) Next

```
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 143samples.percentile.bimbam.pheno.gz -k output/result.cXX.txt -lmm 9 -maf 0.05
```

To run gemma on the current 23M BXD pangenome derived genotypes takes 2.5 hours (@ 200% CPU). That is a bit long :). 13K traits would be 43 months on a single machine. We'll need something better. As Rob writes:

> The huge majority of variants will have r2 of 1 with hundreds ir thousands of neighbors. This is just a monster distraction. We just want proximal and distal haplotype boundaries for each BXD. Then we want to layer on the weird non-SNP variants and inversions.

A few days later I had to rerun gemma because the output was wrong (I should have checked!). It shows:

```
chr     rs      ps      n_miss  allele1 allele0 af      beta    se      logl_H1 l_remle l_mle   p_wald  p_lrt   p_score
-9      A1-0    -9      0       A       T       0.171   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
-9      A2-0    -9      0       A       T       0.170   -nan    -nan    -nan    1.000000e+05    1.000000e+05    -nan  -nan     -nan
```

Turns out I was using the wrong pheno file. Let's try again.

```
/bin/time -v ../bin/gemma -g 143samples.percentile.bimbam.bimbam.gz -p 10354082_143.list.pang.txt -k output/result.cXX.txt -lmm 9 -maf 0.05
```

As a check I can diff against the original output. So, I replicated the original run! It also ran faster at 400% CPU in 35 minutes.

(btw tux04 crashed, so I upgraded the BIOS and iDRAC remotely, let's see if this improves things).

## Moving to gemma-wrapper

gemma-wrapper has extra facilities, such as LOCO and caching and lmdb output. Last time we used it in

=> ../genetics/systems/mariadb/precompute-publishdata

in a guix container it looked like

```
#! /bin/env sh

export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
./bin/gemma-wrapper --force --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json

for id in 'cat ids.txt' ; do
  echo Precomputing $id
  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
    ./bin/gn-pheno-to-gemma.rb --phenotypes pheno.json --geno-json BXD.geno.json > BXD_pheno.txt
    ./bin/gemma-wrapper --json --lmdb --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
  fi
done
```

Let's try running the big stuff instead:

```
./bin/gemma-wrapper --force --json --loco -- -g tmp/143samples.percentile.bimbam.bimbam.gz -p tmp/143samples.percentile.bimbam.pheno.gz  -gk
```

## Individuals

gemma does not really track individuals. The order of genotype columns should just be the same as in the pheno file.
In this case a sample list is provided and we'll generate a geno-json version that we can give to gemma-wrapper. Basically such a file lists the following

```
{
  "type": "gn-geno-to-gemma",
  "genofile": "BXD.geno",
  "samples": [
    "BXD1",
    "BXD2",
    "BXD5",
...
  ],
  "numsamples": 237,
  "header": [
    "# File name: BXD_experimental_DGA_7_Dec_2021",
...
```

To get this

```
cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e s,$,\"\,,| cut -f 1 143samples.pc-list.tsv|sed -e s,_.\*,,|sed -e s,^,\",|sed -e "s,$,\"\\,," > bxd_inds.list.txt
"BXD100",
"BXD101",
"BXD102",
```

Next I turned it into a JSON file by hand as 'bxd_inds.list.json'.

## Markers

With GEMMA marker names are listed in the geno file. GEMMA also can use a SNP file that gives the chromosome and location.
Without the SNP filegemma-wrapper complains it needs the SNP/marker annotation file. This is logical because for LOCO it needs to know what chromosome a marker is on.

The next step is to take the nodes file that and extract all rows from the genotype file that match nodes with chromosomes defined. Andrea is going to deliver all positions for all nodes, but for now we can use what we have. Currently we have nodes annotated in mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt:

```
mm10#1#chr3     23209565        93886997
mm10#1#chr3     23209564        93886999
mm10#1#chr3     23209563        93887016
...
```

In the genotype file we find, for example

```
A23209564-0, A, T, 1.919141867395325,  0.9306930597711228,  1.8201319833577734,  0.7607260422339468,  1.427392726736106,  1.2310230984252724,  1.6633662444541875,  0.6105610229068721, ...
```

bit funny, but you get the idea. So we can take the mm10 file and write out the genotype file again for all matching nodes with a matching SNP file that should contain for this node:

```
A23209564-0        93886999        3
```

To rewrite above mm10+C57BL_6+DBA_2J.p98.s10k.matrix-pos.txt file we can do something like

```
#! ruby

ARGF.each_line do |line|
  tag,name,pos = line.strip.split(/\t/)
  tag =~ /chr(.*)$/
  chrom = $1
  print "A#{name}-0\t#{pos}\t#{chrom}\n"
end
```

Now, another problem is that not all SNPs have a position in the genotype file (yet). As we can't display them I can drop them at this stage. So we take the SNP file and rewrite the BIMBAM file using that information. That throwaway script looks like

```
bimbamfn = ARGV.shift
snpfn = ARGV.shift
snps = {}
open(snpfn).each_line do |snpl|
  name = snpl.split(/\t/)[0]
  snps  [name] = 1
end
open(bimbamfn).each_line do |line|
  marker = line.split(/[,\s]/)[0]
  if snps[marker]
    print line
  end
end
```

takes a while to run, but as this is a one-off that does not matter. Reducing the file leads to 13667900 markers with genotypes. The original SNP file has 14927024 lines. Hmmm. The overlap is therefor not perfect (we have more annotations than genotypes now). To check this I'll run a diff.

```
cut -f 1 -d "," 143samples.percentile.bimbam.bimbam-reduced > 143samples.percentile.bimbam.bimbam-reduced-markers
sort 143samples.percentile.bimbam.bimbam-reduced-markers > markers-sorted.txt
diff --speed-large-files  143samples.percentile.bimbam.bimbam-reduced-markers markers-sorted.txt
< A80951-0
< A80952-0
< A80953-0
...
cut -f 1 snps.txt |sort > snps-col1-sorted.txt
diff --speed-large-files snps-col1-sorted.txt markers-sorted.txt
241773d228996
< A10314686-0
241777d228999
< A10314689-0
241781d229002
< A10314692-0
grep A10314686 snps-col1-sorted.txt markers-sorted.txt
snps-col1-sorted.txt:A10314686-0
snps-col1-sorted.txt:A10314686-0
markers-sorted.txt:A10314686-0
```

Ah, we have duplicate annotation lines in the SNP file.

```
grep A10314686-0 snps.txt
A10314686-0     20257882        8
A10314686-0     20384895        8
grep A10314692-0 snps.txt
A10314692-0     20257575        8
A10314692-0     20384588        8
```

so, the same node is considered two snps. This is due to the node covering multiple inds (paths). Turns out a chunk of them map on different chromosomes too. I think we ought to drop them until we have a better understanding of what they represent (they may be mismapping artifacts).

I updated the script. Now I see it skips A280000 because there is no marker annotation for that node. Good. Also the number of genotype markers got further reduced to 13209385.
I checked the gemma code and the SNP annotation file should match the genotype file line for line. Usurprising, perhaps, but now I need to rewrite both. After adapting the script we now have to files with the same number of lines.

Rerunning with the new files:

```
gemma -g new-genotypes.txt -p pheno_filtered_143.txt -gk
gemma -g new-genotypes.txt -p pheno_filtered_143.txt -k output/result.cXX.txt -maf 0.05 -lmm 4 -a snps-matched.txt
```

And, even though the results differ somewhat in size -- due to the different number of markers -- the results look very similar to what was produced before. Good!

Now we have confirmation and all the pieces we can run the same set with gemma-wrapper and LOCO.

## gemma-wrapper

The first 'challenge' is that gemma-wrapper computes hash values using a Ruby lib which is rather slow. This is also something we encounter in guix. I replaced that by using our pfff hashing for larger files.

```
/bin/time -v ../bin/gemma-wrapper --json --loco --jobs 8 -v -- -g new-genotypes.txt -p pheno_filtered_143.txt -gk -a snps-matched.txt > K.json
```

For this computation each gemma maxed out at 80Gb RAM (total 640Gb). We are really hitting limits here. In the near future we need to check why so much data is retained. As we only have 150 individuals it is a marker thing.

```
/bin/time -v ../bin/gemma-wrapper -v --json --lmdb --loco --input K.json -- -g new-genotypes.txt -p pheno_filtered_143.txt -a snps-matched.txt -debug -maf 0.05 -lmm 9 > GWA.json
```

This time gemma requires only 25Gb per chromosome, so we can run it in one go in RAM on this large server. Much of the time is spent in IO, so I think that when we start using mmap (lmdb) we can speed it up significantly.
gemma-wrapper has a wall clock time of 10 minutes utilizing 17 cores.

Some chromosomes failed with 'ERROR: Enforce failed for Trying to take the sqrt of nan in src/mathfunc.cpp at line 127 in safe_sqrt2'. Running the same with -lmm 9 passed. I'll need to keep an eye on that one.

After some fixes we now have loco in an lmdb output. The mdb file comes in at 693Mb. That will make 9TB for 13K traits. Storing the full vector is probably not wise here (and arguably we won't ever use it at this size - we should use the smoothed haplotypes). Only storing the significant values (4.0) made the size 17Mb. That makes it 215Gb total. Which is manageable. I made it even smaller by removing the (superfluous) hits from the metadata. Now down to 7Mb and 3.2Mb compressed. That'll total less than 100Gb for 13K traits. Good.

## Final hookup

Now gemma-wrapper works (and test results are confirmed) we have to wire it up to fetch traits from the DB. We also have to make sure the trait values align with the individuals in the genotype file. Earlier I was running the script gemma-batch-run.sh:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma-batch-run.sh

which looks like:

```
export TMPDIR=./tmp
curl http://127.0.0.1:8092/dataset/bxd-publish/list > bxd-publish.json
jq ".[] | .Id" < bxd-publish.json > ids.txt
# ---- Compute GRM
./bin/gemma-wrapper --json --loco -- -g BXD.geno.txt -p BXD_pheno.txt -a BXD.8_snps.txt -n 2 -gk > K.json

# ---- For all entries run LMM
for id in 'cat ids.txt' ; do
  echo Precomputing $id
  if [ ! -e tmp/*-BXDPublish-$id-gemma-GWA.tar.xz ] ; then
    curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json
    ./bin/gemma-wrapper --json --lmdb --geno-json BXD.geno.json --phenotypes pheno.json --population BXD --name BXDPublish --trait $id --loco --input K.json -- -g BXD.geno.txt -a BXD.8_snps.txt -lmm 9 -maf 0.1 -n 2 -debug > GWA.json
  fi
done
```

We already have ids.txt and the GRM. What is required is the trait values from the DB. What we need to do is run gn-guile somewhere with access to the DB. Also I need to make sure the current gemma-wrapper tar-balls up the result.

OK, we are running. Looks like the smaller datasets only use 11Gb RES RAM per chromosome. Which means we can run two computes in parallel on this machine.

The first run came through! I forgot the --reduce flag, so it came as 190Mb. I'll fix that. 34 individuals ran in 7 minutes.
We are currently runnings at a trait in 6 min. We can double that on this machine.

The following puzzles me a bit

```
## number of analyzed individuals = 31
## number of covariates = 1
## number of phenotypes = 1
## leave one chromosome out (LOCO) =       14
## number of total SNPs/var        = 13209385
## number of SNPS for K            = 12322657
## number of SNPS for GWAS         =   886728
## number of analyzed SNPs         = 13122153
```

why is the number of SNPs for GWAS low? Perhaps a threshold of 10% for maf is a bit stringent. See below.

Anyway, we are running traits and the first 500 we'll use for analysis.

Meanwhile I'll look at deploying on octopus and maybe speeding up GEMMA. See

=> issues/genetics/speeding-up-gemma

# MAF

GEMMA has a MAF filter. For every SNP a maf is computed by adding the geno value:

```
maf += geno
```

when all genotype values are added up MAF is divided by 2x the number of individuals (minus missing).

```
maf /= 2.0 * (double)(ni_test - n_miss);
```

and this is held against the maf passed on the command line. The 2.0 therefore assumes all values are between 0 and 2.

Actually I now realise we are using LOCO. So the number of SNPs are the ones on one chromosome. That makes sense!
Still we have to be careful about the MAF range. In our genotype file the values are between 0 and 2. So that is fine in itself.

# RDF

Next step is to generate RDF. The SNP annotation was slow, so I moved that to lmdb. Parsing 400 traits now takes 3 minutes. The RDF file is under 1Gb and the SNP annotation RDF is 330Mb. Not too bad!

```
guix shell -C -N --expose=/export/guix-containers/virtuoso/data/virtuoso/ttl/=/export/data/virtuoso/ttl virtuoso-ose -- isql -S 8891
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-snps-400.n3','http://pan-test.genenetwork.org');
Done. -- 3 msec.
# for testing the validity and optional delete problematic ones:
SQL> SELECT * FROM DB.DBA.load_list;
SQL> DELETE from DB.DBA.LOAD_LIST where ll_error IS NOT NULL ;
# commit changes
SQL> rdf_loader_run ();
SQL> ld_dir('/export/data/virtuoso/ttl','pan-test-400.n3','http://pan-test.genenetwork.org');
SQL> rdf_loader_run ();
SQL> checkpoint;
Done. -- 16 msec.
SQL> SPARQL SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o } LIMIT 10;
34200686
```

Or in the web interface:

```
SELECT count(*) FROM <http://pan-test.genenetwork.org> WHERE { ?s ?p ?o }
```

## Query

The RDF is formed as:

```
gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedTrait;
      rdfs:label "GEMMA BXDPublish trait 10383 mapped with LOCO (defaults)";
      gnt:trait gn:publishXRef_10383;
      gnt:loco true;
      gnt:run gn:test;
      gnt:time "2025/11/10 08:12";
      gnt:belongsToGroup gn:setBxd;
      gnt:name "BXDPublish";
      gnt:traitId "10383";
      gnt:nind 14;
      gnt:mean 18.0;
      gnt:std 10.9479;
      gnt:skew 0.3926;
      gnt:kurtosis -1.1801;
      skos:altLabel "BXD_10383";
      gnt:filename "0233fa0cf277ee7d749de08b32f97c8be6478639-BXDPublish-10383-gemma-GWA.tar.xz";
      gnt:hostname "napoli";
      gnt:user "wrk".
gn:A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 a gnt:mappedLocus;
      gnt:mappedSnp gn:GEMMAMapped_test_LOCO_BXDPublish_10383_gemma_GWA_e6478639;
      gnt:locus gn:A8828461_0;
      gnt:lodScore 4.8;
      gnt:af 0.536;
      gnt:effect -32.859.
```

and SNPs are annotated as

```
gn:A8828461_0 a gnt:marker;
                 rdfs:label "A8828461-0";
                 gnt:chr  "1";
                 gnt:pos  3304440.
gn:A8828464_0 a gnt:marker;
                 rdfs:label "A8828464-0";
                 gnt:chr  "1";
                 gnt:pos  3304500.
```

To get all tested traits you can list:

```
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?trait a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId ?traitid ;
         gnt:kurtosis ?kurtosis .
} limit 100
```

To get all SNPs for trait "10001"

```
SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10381" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
}
```

Lists:

```
| http://genenetwork.org/id/A8828461_0_BXDPublish_10383_gemma_GWA_e6478639 | "A8828461-0" | "1" | 3304440 |
```

## Scoring/annotating QTL

Next step is annotating the QTL in RDF. Earlier I wrote a script rdf-analyse-gemma-hits. It uses rapper to read two RDF files (two runs) and annotates the QTL and differences between the files. The code is not pretty:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/bin/rdf-analyse-gemma-hits.rb#L1

The supporting library is a bit better:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/6d667ac97284013867b6cac451ec7e7a22ffbf4b/lib/qtlrange.rb#L1

Basically we have a QTL locus (QLocus) that tracks chr,pos,af and lod for each hit.
QRange is a set of QLocus which also tracks some stats chr,min,max,snps,max_af,lod.
It can compute whether two QTL (QRange) overlap.
Next we have a container that tracks the QTL (QRanges) on a chromosome.

Finally there is a diff function that can show the differences on a chromosome (QRanges) for two mapped traits.

Maybe the naming could be a bit better, but the code is clear as it stands. On thing to note is that we use a fixed distance MAX_SNP_DISTANCE_BPS of 50M that decides whether a SNP falls in the same QTL. It would be worth trying to base it on dropping LOD scores (1.5 from the top). Rob and Flavia pointed out.

So, the library is fine, but the calling program is not great. The reason is that I parse RDF directly, teasing apart the logic we do in above SPARQL. I track state in dictionaries (hashes of hashes) and the result ends up convoluted. Also a lot of state in RAM. I chose RDF direct parsing because it makes for easier development. The downside is that I need to parse the whole file to make sure I have everything related to a trait. To fetch SNP results from SPARQL directly is slow too. I am in a bind.

Using curl:

```
time curl -G http://sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId ?trait ; gnt:kurtosis ?k . }
```


```
time curl -G http:///sparql -H "Accept: application/json; charset=utf-8" --data-urlencode query="
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
SELECT * FROM <http://pan-test.genenetwork.org> WHERE { ?traitid a gnt:mappedTrait ; gnt:traitId \"10001\" ; gnt:kurtosis ?k . ?snp gnt:mappedSnp ?traitid ; gnt:locus ?locus . }
"  > test.out
real    0m1.612s
user    0m0.020s
sys     0m0.000s
```

To get the trait info for 400 traits takes a second. So, that is no big deal. To get the 6K SNPs for one trait also takes a second. Hmmm. That takes hours, compared to the minutes for direct RDF parsing. Before lmdb comes to the rescue we should try running in on the virtuoso server itself. For curl we get 0.5s. Which makes it two hours for 13K traits. But when we run the query using isql it runs in 70ms which totals 15 minutes. That is perfectly fine for running the whole set!

One way is to simply script isql from the command line. Meanwhile, it also turns out the ODBC interface can be used from python or ruby. Here an example in R:

=> https://cran.stat.auckland.ac.nz/web/packages/virtuoso/index.html

Not sure if that is fast enough, but perhaps worth trying.

So, now we have a way to query the data around a trait in seconds. This means I can rewrite the QTL generator to go by trait. This also allows for a quick turnaround during development (good!). Also I want two scripts: one for computing the QTL and one for annotating the differences.

Alright. The first script should simply to fetch a trait with its markers from SPARQL and score the QTL (as RDF output). The new script is at

=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/sparql-qtl-detect.rb

First, the query for one trait looks like:

```
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX gn: <http://genenetwork.org/id/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX gnc: <http://genenetwork.org/category/>
PREFIX gnt: <http://genenetwork.org/term/>
PREFIX sdmx-measure: <http://purl.org/linked-data/sdmx/2009/measure#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX xkos: <http://rdf-vocabulary.ddialliance.org/xkos#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>

SELECT ?lod ?af ?nodeid ?chr ?pos FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
} ORDER BY DESC(?lod)
```

rendering some 22K markers for trait 10002 as a TSV:

```
"lod"   "af"    "nodeid"        "chr"   "pos"
7.5     0.547   "A13459298-0"   "8"     98658490
7.1     0.154   "A13402313-0"   "8"     96798487
7       0.432   "A13446492-0"   "8"     97355019
7       0.263   "A13387873-0"   "8"     94934820
7       0.585   "A4794343-0"    "1"     172265488
...
```

Earlier with precompute for trait 10002 we got:

```
[10002,HK] =>{"1"=>[#<QRange Chr1 𝚺14 179.862..181.546 LOD=3.07..3.07>], "8"=>[#<QRange Chr8 𝚺102 94.3743..112.929 LOD=3.1..5.57>]}
[10002,LOCO] =>{"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>, #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>], "8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
```

so the hits are in range, but the LOD may be inflated because of the number of markers. Anyway, this point we are merely concerned with scoring QTL. The first script is simply:

```
qtls = QTL::QRanges.new("10002","test")
CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
    qlocus = QTL::QLocus.new(hit["nodeid"],hit["chr"],hit["pos"].to_i,hit["af"].to_f,hit["lod"].to_f)
    qtls.add_locus(qlocus)
end
print qtls
```

and prints a long list of QTL containing a single hit.

```
[10002,test] =>{"1"=>[#<QRange Chr1 𝚺1 3099543..3099543 AF=0.583 LOD=5.8..5.8>, #<QRange Chr1 𝚺1 65908328..65908328 AF=0.627 LOD=5.7..5.7>, #<QRange Chr1 𝚺1 81604902..81604902 AF=0.451 LOD=5.5..5.5>, #<QRange Chr1 𝚺2 85087169..85087177 AF=0.781 LOD=5.5..5.6>, #<QRange Chr1 𝚺1 93740525..93740525 AF=0.762 LOD=6.5..6.5>, #<QRange Chr1 𝚺1 114086053..114086053 AF=0.568 LOD=5.7..5.7>,...
```

For trait 10002 tweaking thresholds and rebinning we get

```
#<QRange Chr8 𝚺2 34.303454..35.675301 AF=0.571 LOD=5.7..5.8>
#<QRange Chr8 𝚺621 91.752748..102.722635 AF=0.663 LOD=5.6..7.5>
#<QRange Chr1 𝚺16 65.908328..175.232335 AF=0.781 LOD=5.6..7.0>
#<QRange Chr4 𝚺5 56.498971..126.135422 AF=0.657 LOD=5.6..6.4>
#<QRange Chr12 𝚺3 23.037869..58.306731 AF=0.643 LOD=5.8..6.2>
#<QRange Chr10 𝚺2 13.442071..13.442088 AF=0.641 LOD=5.8..6.0>
#<QRange Chr10 𝚺3 94.246536..103.438796 AF=0.608 LOD=5.9..6.2>
#<QRange Chr3 𝚺2 47.644513..82.451061 AF=0.548 LOD=5.7..6.2>
#<QRange Chr9 𝚺2 97.445077..120.263403 AF=0.717 LOD=5.8..5.8>
#<QRange Chr11 𝚺2 27.4058..56.30011 AF=0.559 LOD=5.7..5.7>
```

with a LOD>5.5 cut-off. That seems justified because LOD scores are inflated. Compare this with the earlier mapping using 'traditional' genotypes:

```
[10002,LOCO] =>{
"1"=>[#<QRange Chr1 𝚺15 72.2551..73.3771 AF=0.574 LOD=4.0..5.1>,
      #<QRange Chr1 𝚺91 171.172..183.154 AF=0.588 LOD=4.5..5.3>],
"8"=>[#<QRange Chr8 𝚺32 94.4792..97.3382 AF=0.441 LOD=4.5..4.8>]}
```

we can see the significance of chr8 has gone up with pangenome mapping (relative to chr1) and we find 2 QTL now on chr8, a new one to the left.  Chr1 looks similar. We have some other candidates that may or may not be relevant (all narrow!).

Note this *is* a random trait(!) and suggests the landscape of QTLs will change pretty dramatically. Note also that Andrea will give new genotypes and smoothing to follow. But it is encouraging.

I played a bit with the QTL output, and for now settled on tracking nodes that have a LOD>5.0. We drop QTL based on the following:

```
qtl.lod.max < 6.0 or (qtl.lod.max < 7.5 - qtl.snps.size/2)
```

I.e. a single SNP QTL has to have a LOD of 7.0. A 2-SNP QTL has to have a LOD of 6.5. This begets

```
[10002,test] =>{
"1"=>[#<QRange Chr1 𝚺69 3.099543..192.718161 AF=0.781 LOD=5.1..7.0>],
"4"=>[#<QRange Chr4 𝚺12 56.498971..147.86044 AF=0.676 LOD=5.1..6.4>],
"8"=>[#<QRange Chr8 𝚺2774 34.303454..116.023702 AF=0.899 LOD=5.1..7.5>],
"10"=>[#<QRange Chr10 𝚺7 82.334108..105.062097 AF=0.623 LOD=5.1..6.2>],
"12"=>[#<QRange Chr12 𝚺9 21.707644..72.57041 AF=0.77 LOD=5.1..6.2>]}
```

which are all worth considering (I think). Obviously we could annotate all QTL in RDF triples and filter on that using SPARQL. But this makes processing a bit faster without having to deal with too much noise. We can fine tune later.

Now two more steps to go:

* [X] Fetch all mapped traits using SPARQL and write RDF
* [X] Compare QTL between datasets and annotate new hits

## Fetch all mapped traits

```
SELECT * FROM <http://pan-test.genenetwork.org> WHERE {
?traitid a gnt:mappedTrait;
         gnt:run gn:test ;
         gnt:traitId "10002" .
?snp gnt:mappedSnp ?traitid ;
        gnt:locus ?locus ;
        gnt:lodScore ?lod ;
        gnt:af ?af .
?locus rdfs:label ?nodeid ;
         gnt:chr ?chr ;
         gnt:pos ?pos .
} ORDER BY DESC(?lod)
```

The first step is to fetch this data. Let's try SPARQL over the web first.

## Compare QTL sets

The previous code I wrote to compare QTLs essentially walks the QTLs and annotates a new QTL if there is no overlap between the two sets. Again, this code is too convoluted:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/18e7a3ac8a11becba84325499116621ad095f28e/lib/qtlrange.rb#L190

The principle is straightforward, however. The code for reading the SPARQL output for a trait is

```
  CSV.foreach(fn,headers: true, col_sep: "\t") do |hit|
    trait_id = hit["traitid"] if not trait_id
    lod = hit["lod"].to_f
    if lod > 5.0 # set for pangenome input
      qlocus = QTL::QLocus.new(hit["snp"],hit["chr"],hit["pos"].to_f/10**6,hit["af"].to_f,lod)
      qtls.add_locus(qlocus)
    end
  end
```

So we can use SPARQL to build two sets on the fly and then run the diff.

Actually, when thinking about this I realised it should not be too hard to do in SPARQL to find the 'new' QTL.

```

SELECT * WHERE {
?traitid a gnt:mappedTrait ;
            gnt:traitId "10002" .
}
http://genenetwork.org/id/GEMMAMapped_LOCO_BXDPublish_10002_gemma_GWA_7c00f36d
http://genenetwork.org/id/HK_trait_BXDPublish_10002_gemma_GWA_hk_assoc_txt
http://genenetwork.org/id/GEMMAMapped_test_LOCO_BXDPublish_10002_gemma_GWA_82087f23
```

lists the three versions of compute for traits. To fetch all QTL for first mapping:

```
SELECT ?qtl ?lod ?chr ?start ?stop (count(?snp) as ?snps) WHERE {
?traitid a gnt:mappedTrait ;
  gnt:traitId "10002" .
?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlStart ?start ;
  gnt:qtlStop ?stop ;
  gnt:qtlLOD ?lod .
?qtl gnt:mappedSnp ?snp .
}
```

gets 3 QTL. Now I did not store HK in RDF, but to show the filtering principle we can fetch two traits and compare QTL.
The following gets two QTL from trait "10002" on CHR1 and holds that against that of trait "10079":

```
SELECT ?t ?s1 ?e1 ?t2 ?s2 ?e2 WHERE {
  ?traitid a gnt:mappedTrait ;
  gnt:traitId ?t .
  ?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlStart ?s1 ;
  gnt:qtlStop ?e1 .
  {
    SELECT * WHERE {
      ?tid a gnt:mappedTrait ;
      gnt:traitId "10079" ;
      gnt:traitId ?t2 .
      ?qtl2 gnt:mappedQTL ?tid ;
      gnt:qtlChr ?chr ;
      gnt:qtlStart ?s2 ;
      gnt:qtlStop ?e2 .
    }
  }
  FILTER (?t = "10002") .
} LIMIT 10

"10002",171.172,183.154,"10079",172.235,172.235
"10002",72.2551,73.3771,"10079",172.235,172.235
```

Note we pivot on two traits and one chromosome, so we find all pairs.
To say if a QTL is *new* or different we can add another FILTER

```
FILTER ((?s2 > ?s1 && ?e2 > ?e1) || (?s2 < ?s1 && ?e2 < ?e1)) .
"t","s1","e1","t2","s2","e2"
"10002",72.2551,73.3771,"10079",172.235,172.235
```

that says that this ?qtl2 does not overlap with ?qtl. I.e. here it is a new QTL!

This new insight means we should should store *all* QTL in RDF, including the single SNP ones, because it is easy to filter on them. Note that there may be a more elegant way to query traits pairwise. This is just the first thing that worked. It may need more tuning if there are more than two QTL on a chromosome. E.g. the comparison between 10002 and 10413 finds:

```
"t","s1","e1","t2","s2","e2"
"10002",72.2551,73.3771,"10413",32.3113,42.4624
"10002",171.172,183.154,"10413",171.04,171.041
"10002",171.172,183.154,"10413",32.3113,42.4624
"10002",72.2551,73.3771,"10413",171.04,171.041
```

I.e. it does find new QTL here and you still need to do a little set analysis. In words you should be able to "remove all overlapping QTL from a chromosome". Maybe we can filter the other way - select overlapping QTL and remove those from the result set.

```
BIND ((?s2 >= ?s1 && ?e2 <= ?e1) || (?s1 >= ?s2 && ?e1 <= ?e2) as ?overlap) .
"10002",171.172,183.154,"10079",172.235,172.235,1
"10002",72.2551,73.3771,"10079",172.235,172.235,0
```

now drop all ?t's that are overlapping. It appears to work with:

=> https://github.com/genetics-statistics/gemma-wrapper/blob/master/doc/examples/show-qtls-two-traits.sparql

I'll need to test it on the pangenome set.

# Listing QTL

To get all QTL from a run you can use something like

```
SELECT DISTINCT ?t ?lod (count(?snp) as ?snps) ?chr ?s ?e WHERE {
  ?traitid a gnt:mappedTrait ;
  gnt:traitId ?t .
  MINUS { ?traitid gnt:run gn:test } # use if you want the original GEMMA QTL
  # ?traitid gnt:run gn:test . # use if you want the new QTL
  ?qtl gnt:mappedQTL ?traitid ;
  gnt:qtlChr ?chr ;
  gnt:qtlLOD ?lod ;
  gnt:qtlStart ?s ;
  gnt:qtlStop ?e .
  ?qtl gnt:mappedSnp ?snp .
  FILTER (?t = "10002" && ?lod >= 5.0 ) .
} LIMIT 100
```

Note we filter on a trait name and LOD score.

For panQTL (gnt:run == gn:test) this results in

```
"t"	"lod"	"snps"	"chr"	"start"	"end"
"10002"	6.4	3	"15"	87.671663	98.028911
"10002"	6.4	12	"4"	56.498971	147.86044
"10002"	7	69	"1"	3.099543	192.718161
"10002"	7.5	2774	"8"	34.303454	116.023702
"10002"	6.2	7	"10"	82.334108	105.062097
"10002"	6.2	2	"3"	47.644513	82.451061
"10002"	6.2	1	"3"	130.145235	130.145235
"10002"	6	2	"10"	13.442071	13.442088
"10002"	6.2	9	"12"	21.707644	72.57041
```

For the traditional genotypes (gnt:run != gn:test)

```
"t"	"lod"	"snps"	"chr"	"start"	"end"
"10002"	5.3	91	"1"	171.172	183.154
"10002"	5.1	15	"1"	72.2551	73.3771
```


# Listing SNPs

Now we have all QTLs in the DB, as well as underlying SNPs, one interesting question to ask is what SNPs are repeated across our traits. This, if you remember, is the key idea of reversed genetics.
Of course, with our pangenome-derived genotypes, we now have thousands of SNPs per trait. Let's see if we can rank them by number of traits.

For our 1000 traits we map about 7.7M snps with a LOD>5


# Using sparql from emacs

Note: if you are doing SPARQL quite a bit, I recommend using sparql-mode in emacs! It is easy, faster and you can use git :)

=> https://github.com/ljos/sparql-mode

```
M-x sparql-query-region [ENTER] http://sparql-test.genenetwork.org/sparql/ [ENTER]
```