# Large Language Models (LLMs) & Metadata * assigned: soloshelby, priscilla * contact: bonfacem * keywords: gnsoc, LLMs, metadata ## Integrate an LLM Q&A system into gn.genenetwork.org This development will be done in stages: * [X] 1 - get API access to FahamuAI GeneNetwork Q&A system * [X] 2 - create local python Flask sandbox * [X] 3 - build placeholder UI * [X] 3.5 - integrate FahamuAI API into placeholder @ github.com/ShelbySolomonDarnell/GN-LLMs * [X] 4 - create guix package for flask site * [X] 5 - Serve GNQA on tux02 * [-] 5.5 - Get feedback from testers of GNQA * [ ] 6 - Improve GNQA after evaluating feedback * [ ] 6.5 - Add reference rating to GNQA * [X] 7 - create UI for Q&A window that fits into current GN framework * [X] 8 - put GNQA GN UI on cd.genenetwork.org for internal researcher access * [ ] 9 - create CI/CD tests for new module * [-] 10 - integrate new functionality into GN{2-3} * [ ] 11 - Use db to save querys with answers and references for users ## Tasks for Priscilla * [X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes * [X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org * [X] 3 - Get bib data for documents and put in json format ## Task for collaborator * [X] Build feedback into api for qualifying/rating references and answers * [X] Associate references with their titles rather than their document ids * [] Build better document referencing using a documents bibliography information ## Add GN metadata * [ ] export GN RDF triples * [ ] convert data of triples into plain English sentences * [ ] submit triples-based sentences to Q&A LLM * [ ] submit RDF metadata to an Oracle to support Q&A system truthfulness ## Set up system update protocol These are all living systems that must be kept up-to-date. GN is consistently being used for research and we are improving its design and functionality to make this statement perpetually true. In order to keep the Q&A system up-to-date we must: * [ ] create protocol to get new publications * [ ] query web for new publications utilizing GN * [ ] pull links to the newly found documents * [ ] acquire the documents * [ ] process documents for LLM The National Library of Medicine's PubMed is a National Institute of Health system that is one of the most widely used resources for researchers found PubMed is consistently updated by the NIH, so we must build a script to: * [ ] poll its API on a regular basis * [ ] download new citations, * [ ] parse citations and metadata for input into LLM * [ ] upload new data into LLM By ensuring up-to-date information about the main information sources for the GeneNetwork Q&A system, the system grows with the knowledgebase. Add functionality that allows someone to submit documentation to the system, which is added after being reviewed by a specialist. improvement suggestions from CTC * [ ] Ontology annotations in GN * [ ] Gene prioritization * [ ] Improving pdf to text algorithm Build lisp tools for pdf document processing. A probable library to use can be found at https://github.com/archimag/cl-pdf * [X] read directory structure * [ ] filter pdf files * [ ] call library to extract text from pdf * [ ] create rules to remove headings, references, and appendices * [ ] create json document with extracted data with bibliographical information * [ ] take file text and run through a tokenizer * [ ] make string tokenizer plug-n-play