topics/lmms/llm-metadata.gmi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

# Large Language Models (LLMs) & Metadata

* assigned: soloshelby, priscilla
* contact: bonfacem
* keywords: gnsoc, LLMs, metadata

## Integrate an LLM Q&A system into gn.genenetwork.org
This development will be done in stages:
* [X] 1 - get API access to FahamuAI GeneNetwork Q&A system
* [X] 2 - create local python Flask sandbox
* [X] 3 - build placeholder UI
* [X] 3.5 - integrate FahamuAI API into placeholder @ github.com/ShelbySolomonDarnell/GN-LLMs
* [X] 4 - create guix package for flask site
* [X] 5 - Serve GNQA on tux02 
* [-] 5.5 - Get feedback from testers of GNQA
* [ ] 6 - Improve GNQA after evaluating feedback
* [ ] 6.5 - Add reference rating to GNQA
* [X] 7 - create UI for Q&A window that fits into current GN framework
* [X]  8 - put GNQA GN UI on cd.genenetwork.org for internal researcher access 
* [ ] 9 - create CI/CD tests for new module
* [-] 10 - integrate new functionality into GN{2-3}
* [ ] 11 - Use db to save querys with answers and references for users
 
## Tasks for Priscilla
* [X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
* [X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org
* [X] 3 - Get bib data for documents and put in json format

## Task for collaborator
* [X] Build feedback into api for qualifying/rating references and answers
* [X] Associate references with their titles rather than their document ids
* [] Build better document referencing using a documents bibliography information

## Add GN metadata
* [ ] export GN RDF triples
* [ ] convert data of triples into plain English sentences
* [ ] submit triples-based sentences to Q&A LLM
* [ ] submit RDF metadata to an Oracle to support Q&A system truthfulness

## Set up system update protocol
These are all living systems that must be kept up-to-date. 
GN is consistently being used for research and we are improving its design and functionality to make this statement perpetually true.
In order to keep the Q&A system up-to-date we must:
* [ ] create protocol to get new publications
* [ ] query web for new publications utilizing GN
* [ ] pull links to the newly found documents
* [ ] acquire the documents
* [ ] process documents for LLM

The National Library of Medicine's PubMed is a National Institute of Health system that is one of the most widely used resources for researchers found 
PubMed is consistently updated by the NIH, so we must build a script to:
* [ ] poll its API on a regular basis
* [ ] download new citations,
* [ ] parse citations and metadata for input into LLM
* [ ] upload new data into LLM

By ensuring up-to-date information about the main information sources for the GeneNetwork Q&A system, the system grows with the knowledgebase.

Add functionality that allows someone to submit documentation to the system, which is added after being reviewed by a specialist.

improvement suggestions from CTC
* [ ] Ontology annotations in GN
* [ ] Gene prioritization
* [ ] Improving pdf to text algorithm

Build lisp tools for pdf document processing.
A probable library to use can be found at https://github.com/archimag/cl-pdf
* [X] read directory structure
* [ ] filter pdf files
* [ ] call library to extract text from pdf
* [ ] create rules to remove headings, references, and appendices
* [ ] create json document with extracted data with bibliographical information
* [ ] take file text and run through a tokenizer
* [ ] make string tokenizer plug-n-play