about summary refs log tree commit diff
diff options
context:
space:
mode:
authorPjotr Prins2026-03-27 11:38:42 +0100
committerPjotr Prins2026-03-27 11:38:42 +0100
commit088240be9ef1c014bf10fb64a8a80fdc278f19db (patch)
treeb6c7871632209cf5e15952ba39f1e2b2e8475432
parentafa3fd534a558fb2ea11f8c40df968635d4291c7 (diff)
downloadgenecup-088240be9ef1c014bf10fb64a8a80fdc278f19db.tar.gz
Add instruction (README) to install punkt
-rw-r--r--README.md14
1 files changed, 13 insertions, 1 deletions
diff --git a/README.md b/README.md
index d824f9c..08676b3 100644
--- a/README.md
+++ b/README.md
@@ -44,12 +44,24 @@ cat pmid.list |fetch-pubmed  -path PubMed/Archive/ >test.xml
 
 You should see 2473 abstracts in the test.xml file.
 
+## NLTK tokens
+
+You also need to fetch punkt.zip from https://www.nltk.org/nltk_data/
+
+```sh
+cd minipubmed
+mkdir tokenizers
+cd tokenizers
+wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
+unzip punkt.zip
+```
+
 # Run the server
 
 You can use the [guix.scm](./guix.scm) container to run genecup:
 
 ```sh
-GeneCup$ guix shell -L . -C -N -F genecup-gemini coreutils edirect -- env EDIRECT_PUBMED_MASTER=./minipubmed GEMINI_API_KEY="AIza****" ./server.py --port 4201
+GeneCup$ guix shell -L . -C -N -F genecup-gemini coreutils edirect -- env EDIRECT_PUBMED_MASTER=./minipubmed NLTK_DATA=./minipubmed GEMINI_API_KEY="AIza****" ./server.py --port 4201
 ```
 
 ## Development