topics/ai/gn_ai.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192

# Build an AI system for GN

## Tags
* type: feature
* assigned: johannesm
* priority: medium
* status: in progress
* keywords: llm, rag, ai, agent

## Description

The aim is to build an AI system/agent/RAG able to digest mapping results and metadata in GN for analysis scaling. This is not quite possible at the moment, given that one stills need to dig and compare manually that type of information. And the data in GN is somehow big for such approach :)

I have made an attempt to using Deep-Learning for my Masters project. It could work but required further processing of results for interpretation. Not quite handy! Instead, we want a system which takes care of all the work (at least most of it) and that we can understand. This is how transformers and LLMs came into the picture.

This work is an extension of the GNQA system initiated by Shelby and Pjotr. 

## Tasks
* [X] Look for transformer model ready for use and try
* [X] Build a RAG system and test with small corpus of mapping results
* [X] Experiment with actual mapping results and metadata
* [X] Move from RAG to agent
* [] Scale analysis to more data
* [] Compare performance of open LLMs with Claude in the system


### Look for transformer model ready for use and try

Given the success of transformers, I was first incited by Pjotr to look for a model that can support different types of data i.e numerical (mapping results) vs textual (metadata).

I found TAPAS which:
* takes data of different types in tabular format
* takes a query or question in form of text
* performs operations on rows of the data table
* retrieves relevant information
* returns an answer to the original query

Experimentations were ongoing when Rob found with the help of Claude that this architecture would not go far. I know we used an AI to assist our work on AI (at least we did not ask an AI to do the job from the go :))
But it was a good point. TAPAS is relatively old and a lot of progress have been made with LLM and agent since!

To take advantage of all the progress made with LLM, need to find a way to have only text data. LLMs are trained to undertand and work with text. Metadata, being RDF, is already in text format. I only needed to convert the mapping results to text. It is a detour worth the shot if this can give more flexibility and save development time!

### Build a RAG system and test with a small corpus of mapping results

I have read a number of books and found that RAG system are pretty easy to design with LangChain. The RAG is made of 2 components:
* search and retrieval -> need a keyword search algorithm or embedding model
* response generation -> need an LLM

The system takes a query and return a response based on the information retrieved. Very efficient in increasing factuality of LLM responses!

I wrote a first draft with:

1. BMD25 keyword search algorithm and Qwen3-Embedding-0.6B model
This is an hybrid search subsystem. It searches both by keyword and meaning. It should be more scalable. The subsystem:
* takes the query
* processes it
* extracts the most similar documents or data chunks
* passes the results to the LLM for further processing

2. Qwen2.5-72B-Instruct LLM
This is the response subsystem. It:
* takes the original query
* takes retrieved information
* reason using tree-of-thought mechanism for validation
* generate an answer to the query using retrieved information

Small bonus
I suspected that more than one query might be required to address some questions. I needed to find a way to make the system conversational so that one could ask many questions without the information to be lost in between. This came down to adding memory to the system. So, I added a small model for memory propagation. The model was:
3. Phi-3-mini-4k-instruct
It uses the model and the ConversationSummary class from LangChain to summarizekey information in interaction for accessibility between sessions.

Time to test!

* Query 1

```
Question: What is the lod for trait leptin receptor EPFLMouseLiverCDEx0413 at position 100?

Response:
**Expert 1:**
The question asks for the LOD score at position 100. From the relevant information, I see that there is a LOD score of 1 at position 100.

**Expert 2:**
I agree with Expert 1. The relevant information clearly states that the LOD score at position 100 is 1.

**Expert 3:**
I concur with both Experts 1 and 2. The LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is indeed 1.

**Conclusion:**
All experts agree that the LOD score for the trait leptin receptor EPFLMouseLiverCDEx0413 at position 100 is 1. <|end|>
```

Since the corpus used for testing is small, I could inspect to confirm the accuracy of the response generated by the RAG. Lookup revealed that it was correct, yah!

* Query 2

```
Question: Which position of leptin receptor EPFLMouseLiverCDEx0413 was examined in the previous prompt?

Response:
**Expert 1:**
The previous prompt asked for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.

**Expert 2:**
I agree with Expert 1. The previous prompt specifically asked about the LOD score at position 100. Thus, the position examined is 100.

**Expert 3:**
I concur with both Experts 1 and 2. The previous prompt clearly specified that it was asking for the LOD score at position 100. Therefore, the position examined in the previous prompt is 100.

**Conclusion:**
All experts agree that the position of leptin receptor EPFLMouseLiverCDEx0413 examined in the previous prompt is 100. <|end|>
```

This was also correct. I was very excited, Rob too!

I received a feedback that made me clarify the meaning of position. I rebuilt the small corpus using a BXD dataset of traits I run GEMMA on for my Masters project, making sure to use actual marker positions this time and including also marker names.

Let's experiment again!

I got results similar to the ones above to the exception that marker positions are true and marker names are supported.

I faced a challenge though :(

For queries that require combinations of different data chunks or documents (non atomic queries), the system does not perform well. For example, to the query

* How many traits hepatic nuclear factor 4 are in the datasets?
The system was confused. Even after prompt engineering, the answer generated was not accurate

* Identify 2 traits that have similar lod values on chromosome 1 position 3010274
The system sometimes missed or caught only 1 trait having a lod value at the position.

This is probably because the system cannot execute more than one retrieval run. To get there, I need to make the RAG more autonomous: this is how the concept of agent came up.


### Experiment with actual mapping results and metadata

Getting an agent asked for more readings. In the meantime, I decided to get actual mapping results and metadata for experimentation. Would be sad to proceed if the system is actually not compatible with data to use in production :)

I waited for Pjotr to precompute GEMMA association results and export them with metadata to an endpoint. The RDF schema was very interesting to learn and Bonz did some work about that in the past :)

You can check out recent developments of  Pjotr's work here:
=> https://issues.genenetwork.org/topics/systems/mariadb/precompute-publishdata

For Bonz work, see:
=> https://github.com/genenetwork/gn-docs/blob/master/rdf-documentation/phenotype-metadata.md

Anyway, it took some time but I finally got a glance of the data.

This started with the metadata from an old endpoint created by Bonz. I had also to learn SPARQL - I was quite new to it!

We thought LLMs can make directly sense of RDF data (still in text format) but it turns out it is not. They can recognize that it is RDF but in between all the URIs, they start making mistakes quite quickly. Instead of using RDF natively, we decided to use LLMs to first convert RDF data - could be both metadata or mapping results - to natural text before using it with the RAG system. The system should do best and we confirmed that!

Pjotr made available the first version of the global endpoint. Nothing should stop me now :) I wrote a script to fetch metadata from the endpoint. I am not sharing my code so far. Let me fix that right now. You can follow this link for the script I was referring to above:
=> https://github.com/johanmed/gn-rag/blob/main/fetch_metadata.py

Pjotr also made available the ttl files in my home directory on balg01 - full flexibility!

I naturalized some RDF triples. The corpus now looked like below

```
The phenotype identified as BXD_19926, or abbreviated as TAG(48:3)_HFD, is part of the EPFL LISP3 Study. It's a measurement of Liver Lipidomics: TAG48:3_HFD, which has side chain FA composition [TG12:0_18:1_18:2,TG14:0_16:0_18:3,TG14:0_16:1_18:2,TG14:0_16:2_18:1,TG14:1_16:0_18:2,TG14:1_16:1_18:1,TG16:0_16:1_16:2,TG16:1_16:1_16:1] belonging to the lipid class triacylglycerol TAG. This phenotype was measured by MS [R.U.AUC] in overnight fasted male mice at 29 weeks of age, who were on a high fat diet (60% kCal/fat HarlanTD.06414).
The phenotype was submitted by Jha P and contributed to by Auwerx J. It belongs to the group setBxd and has a sequence number of 1. The mean value for this phenotype is 0.00410489, with an additive effect of 0.00066854. The LOD score for this phenotype is 2.91779.
The locus associated with this phenotype is Rsm10000001653.
```

It was time to proceed to testing. The results I show below are not exactly for the RAG system I explained above. I was improving the RAG in parallel. So watch out. I will explain everything in the next task :)

The goal was to try a complex query the previous system failed on. You can see that the question was indeed not atomic. 

```
Question: Identify traits having a lod score > 4.0

Response:
In addressing the query to identify traits with a LOD score exceeding 4.0, it's pertinent to highlight three mapped loci that meet these criteria: Rsm10000011038_BXDPublish_170
86_gemma_GWA_19ca1aeb (LOD: 4.5), Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548 (LOD: 4.14) and Rsm10000000559 (LOD: 4.61). These findings underscore the significance of these loci in genetic studies, given their high LOD scores indicative of strong linkage evidence.
```

It is already better. At least, the system combined information from different data or document chunks. Retrieved chunks were:

```
[[Document(metadata={}, page_content='\nThe mapped locus associated with Rsm10000001536 has a LOD score of 2.07, and the mapped SNP is identified as GEMMAMapped_LOCO_BXDPublish_24451_gemma_GWA_aecf628e. The effect size for this locus is 0.01, with an allele frequency (AF) of 0.333.\n                '), Document(metadata={}, page_content='\nThe mapped
 locus Rsm10000011536 is associated with a LOD score of 5.69, an effect size of 0.385 and an allele frequency of 0.526. This locus has been mapped to the SNP GEMMAMapped_LOCO_BXDPublish_2032
0_gemma_GWA_6832c0e4.\n                '), Document(metadata={}, page_content='\nThe mapped locus, Rsm10000000185_BXDPublish_12641_gemma_GWA_8c8d0548, has an effect size of -3.137 and a LOD
score of 4.14. This locus is associated with the mapped SNP GEMMAMapped_LOCO_BXDPublish_12641_gemma_GWA_8c8d0548, and it has an allele frequency of 0.556.\n                '), Document(metad
ata={}, page_content='\nIn plain English, this data refers to a mapped locus associated with the Rsm10000011038_BXDPublish_17086_gemma_GWA_19ca1aeb identifier. This locus is linked to the Rsm10000011038 identifier, has an effect size of -0.048, a LOD score of 4.5, and an allele frequency (AF) of 0.167. The mapped SNP associated with this data can be found under the GEMMAMapped_LOCO_BXDPublish_17086_gemma_GWA_19ca1aeb identifier.\n                '), Document(metadata={}, page_content='\n                In plain English, the data describes a genetic locus identified as Rsm10000000559. This locus was mapped through an effect size of -34.191, with an allele frequency of 0.438. The mapping achieved a LOD score of 4.61, indicating the statistical significance of this genetic association. The mapped locus is associated with a specific SNP (Single Nucleotide Polymorphism) identified as GEMMAMapped_LOCO_BXDPublish_12016_gemma_GWA_bc6adcae.\n                ')]]
```

### Move from RAG to agent

This is where I made the system more autonomous i.e agentic. I am now going to explain how I did it.