From 6ff452c240839f5eb4a8d29dcc2acfeb96cba9bd Mon Sep 17 00:00:00 2001
From: SidiBlak
Date: Wed, 20 Dec 2023 02:41:02 -0600
Subject: create lisp pdf document management tools.

---
 topics/lmms/llm-metadata.gmi | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

(limited to 'topics/lmms')

diff --git a/topics/lmms/llm-metadata.gmi b/topics/lmms/llm-metadata.gmi
index 9fd2732..41dc44b 100644
--- a/topics/lmms/llm-metadata.gmi
+++ b/topics/lmms/llm-metadata.gmi
@@ -20,9 +20,9 @@ This development will be done in stages:
 * [ ] 9 - integrate new functionality into GN{2-3}
 
 ## Tasks for Priscilla
-* [-] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
+* [X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
 * [X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org
-* [-] 3 - Get bib data for documents and put in json format
+* [X] 3 - Get bib data for documents and put in json format
 
 ## Task for collaborator
 * [X] Build feedback into api for qualifying/rating references and answers
@@ -59,3 +59,14 @@ improvement suggestions from CTC
 * [ ] Ontology annotations in GN
 * [ ] Gene prioritization
 * [ ] Improving pdf to text algorithm
+
+Build lisp tools for pdf document processing.
+A probable library to use can be found at https://github.com/archimag/cl-pdf
+* [X] read directory structure
+* [ ] filter pdf files
+* [ ] call library to extract text from pdf
+* [ ] create rules to remove headings, references, and appendices
+* [ ] create json document with extracted data with bibliographical information
+* [ ] take file text and run through a tokenizer
+* [ ] make string tokenizer plug-n-play
+
-- 
cgit v1.2.3