README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

# GeneCup: Mining gene relationships from PubMed using custom ontology

URL: [https://genecup.org](https://genecup.org)

GeneCup automatically extracts information from PubMed and NHGRI-EBI GWAS catalog on the relationship of any gene with a custom list of keywords hierarchically organized into an ontology. The users create an ontology by identifying categories of concepts and a list of keywords for each concept.

As an example, we created an ontology for drug addiction related concepts over 300 of these keywords are organized into six categories:
* names of abused drugs, e.g., opioids
* terms describing addiction, e.g., relapse
* key brain regions implicated in addiction, e.g., ventral striatum
* neurotrasmission, e.g., dopaminergic
* synaptic plasticity, e.g., long term potentiation
* intracellular signaling, e.g., phosphorylation

Live searches are conducted through PubMed to get relevant PMIDs, which are then used to retrieve the abstracts from a local archive. The relationships are presented as an interactive cytoscape graph. The nodes can be moved around to better reveal the connections. Clicking on the links will bring up the corresponding sentences in a new browser window. Stress related sentences for addiction keywords are further classified into either systemic or cellular stress using a convolutional neural network.

## Top addiction related genes for addiction ontology

0. extract gene symbol, alias and name from NCBI gene_info for taxid 9606.
1. search PubMed to get a count of these names/alias, with addiction keywords and drug name
2. sort the genes with top counts, retrieve the abstracts and extract sentences with the 1) symbols and alias and 2) one of the keywords. manually check if there are stop words need to be removed.
3. sort the genes based on the number of abstracts with useful sentences.
4. generate the final list, include symbol, alias, and name

## Dependencies

* [local copy of PubMed](https://dataguide.nlm.nih.gov/edirect/archive.html)
* python == 3.8
* see requirements.txt for list of packages and versions

## Deploy with GNU Guix

The main genecup.org service is deployed deterministically (and self contained) using GNU Guix. See https://issues.genenetwork.org/topics/deploy/genecup and https://git.genenetwork.org/guix-bioinformatics/

## Development

The source code and data are in a git repository: https://git.genenetwork.org/genecup/

Unpack minipubmed and punkt (see below). And run, for example, using GNU Guix:

```sh
guix shell -C -N -F python python-flask coreutils-minimal python-bcrypt python-nltk python-numpy python-pandas python-regex python-flask-sqlalchemy edirect inetutils python-keras tensorflow sed -- env EDIRECT_PUBMED_MASTER=minipubmed/ NLTK_DATA=`pwd`/minipubmed ./server.py
```

and the service should be listening on port 4200.

## Mini PubMed for testing

For testing or code development, it is useful to have a small collection of PubMed abstracts in the same format as the local PubMed mirror. We provide 2473 abstracts that can be used to test four gene symbols (gria1, crhr1, drd2, and penk).

1. install [edirect](https://dataguide.nlm.nih.gov/edirect/install.html) (make sure you refresh your shell after install so the PATH is updated)
2. unpack the minipubmed.tgz file
3. test the installation by running:
```
cd minipubmed
cat pmid.list |fetch-PubMed  -path PubMed/Archive/ >test.xml
```
You should see 2473 abstracts in the test.xml file.

## NLTK tokens

You also need to fetch punkt.zip from https://www.nltk.org/nltk_data/

```sh
cd minipubmed
mkdir tokenizers
cd tokenizers
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
unzip punkt.zip
```

## Source code

https://git.genenetwork.org/genecup/

## Support

E-mail [Pjotr Prins](https://thebird.nl) or [Hao Chen](https://www.uthsc.edu/neuroscience-institute/about/faculty/chen.php).

## License

GeneCup source code is published under the liberal free software MIT licence (aka expat license)

## Cite

[GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships](https://academic.oup.com/g3journal/article/12/5/jkac059/6548160) by
Gunturkun MH, Flashner E, Wang T, Mulligan MK, Williams RW, Prins P, and Chen H.

G3 (Bethesda). 2022 May 6;12(5):jkac059. doi: 10.1093/g3journal/jkac059. PMID: 35285473; PMCID: PMC9073678.

```
@article{GeneCup,
  pmid         = {35285473},
  author       = {Gunturkun, M. H. and Flashner, E. and Wang, T. and Mulligan, M. K. and Williams, R. W. and Prins, P. and Chen, H.},
  title        = {{GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships}},
  journal      = {G3 (Bethesda)},
  year         = {2022},
  doi          = {10.1093/g3journal/jkac059},
  url          = {http://www.ncbi.nlm.nih.gov/pubmed/35285473}
}
```