From 3dd0254e1f8bb482b0abc6f461c6f1b31f2ed681 Mon Sep 17 00:00:00 2001 From: Munyoki Kilyungi Date: Wed, 18 Feb 2026 17:33:24 +0300 Subject: Update issue. Signed-off-by: Munyoki Kilyungi --- issues/ai/search.gmi | 87 ++++++++++++++++++++++++---------------------------- 1 file changed, 40 insertions(+), 47 deletions(-) diff --git a/issues/ai/search.gmi b/issues/ai/search.gmi index 084b3ba2..54443c30 100644 --- a/issues/ai/search.gmi +++ b/issues/ai/search.gmi @@ -66,19 +66,15 @@ LIMIT 10; ### Build search corpus wth phenotype metadata -Given the extensive training of LLM on text data, a naive approach would be converting the RDF graph related to phenotypes to text. We agreed on that with Bonz. - -I wrote something similar some time back for GNAgent: +* Naive approach (old way). Convert RDF graph from ttl-files -> json output: => https://github.com/genenetwork/gn-ai/blob/383f89441d7787023eaf1e2926c0dedca256fe1a/gnagent/utils/rdf2partial_text.py This script uses ttl files extracted from the SPARQL endpoint. As such, prefixes were replaced with full namespaces in the results. -In our case, I think I could just work with the ttl files originally generated by Bonz. For this, I however need to adapt the script because of the reason mentioned above. Basically, I need to remove namespaces from the logic and only work with prefixes. The adapted code is at: -=> https://github.com/genenetwork/gn-ai/blob/main/ai_search/utils/new_rdf2partial_text.py - -Running it on balg01 generated a dictionary where keys represent subjects and values are list of predicates associated to the corresponding subject. Redundant objects for a specific subject were discarded. +* [X] Adapted code (need to remove namespace from logic and only work prefixes): +=> https://github.com/genenetwork/gn-ai/blob/383f89441d7787023eaf1e2926c0dedca256fe1a/aisearch/utils/new_rdf2partial_text.py -Next, subjects and objects were linked to form English-like sentences with the following logic - at the first build time: +Output: keys -> subjects; values -> list of predicates. Removes redundant objects. Subjects and objects are linked to form English-like sentences: ``` docs = [] for key in tqdm(collection): @@ -89,8 +85,7 @@ for key in tqdm(collection): docs.append(concat) ``` -See function corpus_to_docs of: -=> https://github.com/genenetwork/gn-ai/commit/6edd0a8cd997d4d384d880a771f9c2e817888909 +=> https://github.com/genenetwork/gn-ai/blob/383f89441d7787023eaf1e2926c0dedca256fe1a/aisearch/src/rag.py#L82 See function corpus_to_docs Documents look like: @@ -107,14 +102,15 @@ gnt:family is/has rdfs:domain gnc:set . ", "gnt:short_name is/has a owl:ObjectPr ### Design RAG -I built a simple RAG system that answers a question based on a corpus. Given the fragility of LLM system, I leveraged DSPy. This should also make it easy to switch between proprietary and open models. +* [X] (johannesm) Simple RAG that answers questions based off a corpus. Used DSPs to switch between different providers: You can inspect full implementation details at: -=> https://github.com/genenetwork/gn-ai/commit/6edd0a8cd997d4d384d880a771f9c2e817888909 +=> https://github.com/genenetwork/gn-ai/blob/383f89441d7787023eaf1e2926c0dedca256fe1a/aisearch/src/rag.py rag.py ### Create system prompt -To get the system return a concise answer and point to specific URLs for verification, the system prompt is probably the most important and dynamic part. My first draft had the following instructions: + +* [X] (johannesm) First system level prompt draft. Aim: get the system to return concise answer with links for URLs. ``` You are an expert in biology and genomics. You excel at leveraging the data or context you have been given to address any user query. @@ -145,28 +141,23 @@ Do not make any mistakes. ``` -I provided mapping between prefix and namespace to teach the model how to generate the URLs. I will probably have to do a number of experimentations :) +* [X] (johannesm) Provided mapping between prefix and namespace to teach the model how to generate the URLs. +* [X] (johannesm) Do a number of experimentations to improve above. ### Design a proper JSON output format for the system -It is extremely useful to control the system by defining an output format. This should also help parse output to other tools when the time comes. - -Reviewing the options... - -- Asking the LLM to format its output as JSON is one way - -I could just let the LLM format the output as JSON. But the JSON generated might not be valid. +It is useful to control the system by defining an output format. This should also help parse output to other tools when the time comes. -- Passing JSON format as example in prompt +Reviewing the options: -Another option is to predefine the format of the JSON and pass it in the prompt to the system. Also, some models might deviate from the instructions. +* (a) Asking the LLM to format its output as JSON. Delegate the JSON formatting to LLM. Risk: output JSON may be invalid. +* (b) Passing JSON format as example in prompt. Pre-define JSON format; pass it in the prompt. Risk: Some models may deviate from the instructions. +* (c) Defining an output schema the LLM needs to comply to. DSPy offers an adapter (JSONAdapter). Model independent. -- Defining a schema +Went with Option (c). -Finally, I could define an output schema the LLM needs to comply to. DSPy offers an adapter (JSONAdapter) that facilitates its implementation. This is regardless of the model used with the system. - -I decided to go for the last option because of robustness. I created a schema using pydantic BaseModel and used with the DSPy predictor as below: +* [X] Create schema using pydantic ""BaseModel"" and used it with the DSPy predictor: ``` class Information(BaseModel): @@ -188,9 +179,9 @@ class Generate(dspy.Signature): ``` See the workings at: -=> https://github.com/genenetwork/gn-ai/commit/be82b4d19e6a56f2fe04872bd65da84d7805d824 +=> https://github.com/genenetwork/gn-ai/blob/383f89441d7787023eaf1e2926c0dedca256fe1a/aisearch/src/config.py -I also iterated on the system prompt. Now it is: +* [X] (johannesm) Iterate on the system prompt. Now it is: ``` You excel at addressing search query using the context you have. You do not mistakes. @@ -221,21 +212,22 @@ geoSeries => http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc ### Teach model how to build link to trait result page in CD -From conversations with Bonz, trait id and dataset are coded and link to the result page. We can leverage that to teach the system how to build CD link when it has access to the trait id and the dataset name for a specific trait in GN. +Trait Id and dataset can be linked to a result page. See this URL: + +=> https://cd.genenetwork.org/show_trait?trait_id=10027&dataset=BXDPublish -RDF codes trait as "dataset name" + "trait id" under a specific namespace. +Trait id: 10027; dataset name: BXDPublish. In RDF this is is: -Example: https://rdf.genenetwork.org/v1/id/trait_BXDPublish_16339 +=> https://rdf.genenetwork.org/v1/id/trait_BXDPublish_NOwdrwlHIC -This specific trait has an id of 16339 in the BXDPublish GN table (dataset name). +That trait has an alias encoded as "owl:equivalentClass BXDPublish_10027." -The corresponding link in CD to the Trait result page is: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXDPublish +To build a result page from RDF, we need a trait's unique identifer which can be queried from RDF. -It is just a matter of replacing the trait id and the dataset name in the URL parameters. +* [X] (johannesm) use prompt engineering to get LLM to make above substitution. Model finetuning for this is too expensive. -The only ways (at least those I can think of) to get an LLM make that substitution in the URL is through prompt engineering and model finetuning. Model finetuning seems a bit too much given that we only want to modify a few links in the output. In addition, it is very expensive. On the other hand, prompt engineering is quick to implement. I am going to provide examples like the previous one in the system prompt to help LLM translate RDF links for traits to valid CD links to trait result page. +* [X] (johannes) Provide more system prompting examples (I.e. translate RDF links for traits to valid trait result page): -New system prompt: ``` You excel at addressing search query using the context you have. You do not make mistakes. Extract answers to the query from the context and provide links associated with each RDF entity. @@ -268,9 +260,9 @@ New trait link: https://cd.genenetwork.org/show_trait?trait_id=16339&dataset=BXD \n ``` -This was enough to get the system return valid CD links with Claude models :) +Above was enough to get the system to return valid CD links with Claude models :) -Here is an example. +Another example: ``` Query: What are the traits related to the BXD? @@ -319,16 +311,17 @@ System feedback: Next thing we want to do is packaging. Previous setup had logic and execution codes mixed. I cleaned that by moving all execution codes to `main.py`. Check it out at: => https://github.com/genenetwork/gn-ai/commit/8193d6adcd210b94de88fbeadeaf4353d6df3923 -I realized that `main.py` is not a good module name. Changing it to `search.py`. I also take the opportunity to do a few cleaning. -New code is at: +* [X] (johannesm) Name "main.py" -> "search.py". Clean-up: => https://github.com/genenetwork/gn-ai/commit/9bbbca60c91a69db66e57688fd7879682ac7ce5b -Now that everything is well modularized I can attempt to packaging. I used poetry :) - -After a number of tests and fixes, I have managed to pull it off. Dependencies can be inspected at: -=> https://github.com/genenetwork/gn-ai/blob/main/aisearch/pyproject.toml +* [X] (johannesm) Use poetry for packaging. +=> https://github.com/genenetwork/gn-ai/blob/main/aisearch/pyproject.toml Poetry Dependencies -I also tried uploading to TestPyPI and finally PyPI. Package can now be installed with pip. See documentation: +* [X] (johannesm) Upload package to PyPI: => https://github.com/genenetwork/gn-ai/blob/main/aisearch/README.md -I think AI search (GNAIS) can be loaded as a module in any GeneNetwork code and used, provided that parameters for the search are defined. \ No newline at end of file +AI search (GNAIS) can be loaded as a module in any GeneNetwork code and used, provided that parameters for the search are defined. + +## For Later (Nice To Haves) + +* [ ] (bonfacem) Add package to guix-bioinformatics. Questions: langchain support? -- cgit 1.4.1