summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSidiBlak2023-12-20 02:41:02 -0600
committerGitHub2023-12-20 02:41:02 -0600
commit6ff452c240839f5eb4a8d29dcc2acfeb96cba9bd (patch)
tree2fbffca4693e0241994dc95e0b435f43ecd6acc9
parent798deb388638f13ed40ecc19eed8c53d44b6ab99 (diff)
downloadgn-gemtext-6ff452c240839f5eb4a8d29dcc2acfeb96cba9bd.tar.gz
create lisp pdf document management tools.
-rw-r--r--topics/lmms/llm-metadata.gmi15
1 files changed, 13 insertions, 2 deletions
diff --git a/topics/lmms/llm-metadata.gmi b/topics/lmms/llm-metadata.gmi
index 9fd2732..41dc44b 100644
--- a/topics/lmms/llm-metadata.gmi
+++ b/topics/lmms/llm-metadata.gmi
@@ -20,9 +20,9 @@ This development will be done in stages:
* [ ] 9 - integrate new functionality into GN{2-3}
## Tasks for Priscilla
-* [-] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
+* [X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes
* [X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org
-* [-] 3 - Get bib data for documents and put in json format
+* [X] 3 - Get bib data for documents and put in json format
## Task for collaborator
* [X] Build feedback into api for qualifying/rating references and answers
@@ -59,3 +59,14 @@ improvement suggestions from CTC
* [ ] Ontology annotations in GN
* [ ] Gene prioritization
* [ ] Improving pdf to text algorithm
+
+Build lisp tools for pdf document processing.
+A probable library to use can be found at https://github.com/archimag/cl-pdf
+* [X] read directory structure
+* [ ] filter pdf files
+* [ ] call library to extract text from pdf
+* [ ] create rules to remove headings, references, and appendices
+* [ ] create json document with extracted data with bibliographical information
+* [ ] take file text and run through a tokenizer
+* [ ] make string tokenizer plug-n-play
+