diff options
author | SidiBlak | 2023-12-20 02:41:02 -0600 |
---|---|---|
committer | GitHub | 2023-12-20 02:41:02 -0600 |
commit | 6ff452c240839f5eb4a8d29dcc2acfeb96cba9bd (patch) | |
tree | 2fbffca4693e0241994dc95e0b435f43ecd6acc9 | |
parent | 798deb388638f13ed40ecc19eed8c53d44b6ab99 (diff) | |
download | gn-gemtext-6ff452c240839f5eb4a8d29dcc2acfeb96cba9bd.tar.gz |
create lisp pdf document management tools.
-rw-r--r-- | topics/lmms/llm-metadata.gmi | 15 |
1 files changed, 13 insertions, 2 deletions
diff --git a/topics/lmms/llm-metadata.gmi b/topics/lmms/llm-metadata.gmi index 9fd2732..41dc44b 100644 --- a/topics/lmms/llm-metadata.gmi +++ b/topics/lmms/llm-metadata.gmi @@ -20,9 +20,9 @@ This development will be done in stages: * [ ] 9 - integrate new functionality into GN{2-3} ## Tasks for Priscilla -* [-] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes +* [X] 1 - Acquire 1000 research documents w.r.t. Genetics, genomics research on diabetes * [X] 2 - Acquire 1000 more research documents w.r.t GeneNetwork.org -* [-] 3 - Get bib data for documents and put in json format +* [X] 3 - Get bib data for documents and put in json format ## Task for collaborator * [X] Build feedback into api for qualifying/rating references and answers @@ -59,3 +59,14 @@ improvement suggestions from CTC * [ ] Ontology annotations in GN * [ ] Gene prioritization * [ ] Improving pdf to text algorithm + +Build lisp tools for pdf document processing. +A probable library to use can be found at https://github.com/archimag/cl-pdf +* [X] read directory structure +* [ ] filter pdf files +* [ ] call library to extract text from pdf +* [ ] create rules to remove headings, references, and appendices +* [ ] create json document with extracted data with bibliographical information +* [ ] take file text and run through a tokenizer +* [ ] make string tokenizer plug-n-play + |