step 1 in topgene

author: Hao Chen 2019-05-14 22:47:16 -0500
committer: Hao Chen 2019-05-14 22:47:16 -0500
commit: 7699060b030394399fd7aff9c1f3e2840f5b834f (patch)
tree: 69531f1b33e651f863634061d41f22fd757d233f
parent: 72bb5dd7ff7e9b2c098042843d62b96e6c09f497 (diff)
download: genecup-7699060b030394399fd7aff9c1f3e2840f5b834f.tar.gz
2 files changed, 52 insertions, 0 deletions
diff --git a/Readme.md b/Readme.md
index 90efc4f..48a6d38 100644
--- a/Readme.md
+++ b/Readme.md
@@ -10,8 +10,23 @@ This app searches PubMed to find sentences that contain the query terms (e.g., g
 
 Live searches are conducted through PubMed to get relevant PMIDs, which are then used to retrieve the abstracts from a local archive. The relationships are presented as an interactive cytoscape graph. The nodes can be moved around to better reveal the connections. Clicking on the links will bring up the corresponding sentences in a new browser window.
 
+## top addiction related genes
+
+0. extract gene symbol, alias and name from NCBI gene_info for taxid 9606.
+```
+grep ^9606 ~/Downloads/gene_info |cut -f 3,5,12|sed "s/\t-//"|sed "s/\t/|/2"|sed "s/\t-//"|grep -v ^LOC|grep -v -i pseudogene|sed "s/(|)\// /g" |sort >ncbi_gene_symb_syno_name_txid9606.txt 
+```
+
+1. search PubMed to get a count of these names/alias, with addiction keywords and drug name 
+2. sort the genes with top counts, retrieve the abstracts and extract sentences with the 1) symbols and alias and 2) one of the keywords. manually check if there are stop words need to be removed. 
+3. sort the genes based on the number of abstracts with useful sentences.
+4. generate the final list, include symbol, alias, and name
+
 ## dependencies
 
 * [local copy of PubMed](https://dataguide.nlm.nih.gov/edirect/archive.html)
 * python flask
 * python nltk
+
+## planned
+* NLP analysis of the senences (topic modeling, ranking, etc.)
diff --git a/topGene_step1_cnt_abstracts.py b/topGene_step1_cnt_abstracts.py
new file mode 100755
index 0000000..0880aff
--- /dev/null
+++ b/topGene_step1_cnt_abstracts.py
@@ -0,0 +1,37 @@
+#!/bin/env python3 
+import os
+import re
+import time
+from ratspub  import *
+
+def gene_addiction_cnt(gene):
+    q="\"(" + addiction.replace("|", "[tiab] OR ")  + ") AND (" + drug.replace("|", "[tiab] OR ", ) + ") AND (" + gene + ")\""
+    count=os.popen('esearch -db pubmed  -query ' + q + ' | xtract -pattern ENTREZ_DIRECT -element Count ').read()
+    if (len(count)==0):
+        print("pause")
+        time.sleep(15)
+        return gene_addiction_cnt(gene)
+    else:
+        return (count)
+
+out=open("gene_addiction_abstract_cnt_result.tab", "w+")
+
+with open ("./ncbi_gene_symb_syno_name_txid9606.txt", "r") as f:
+    for line in f:
+        line=re.sub(r"\)|\(|\[|\]|\*|\'","",line.strip())
+        if "\t" in line:
+            (gene, synostring)=line.strip().split("\t")
+            if "|" in synostring:
+                synos=synostring.split("|") 
+            elif len(synostring)>3:
+                synos=synostring
+            for syno in synos:
+                if len(syno)>3:
+                    gene+="|"+syno
+        else:
+            gene=line.strip()
+        gene_q=gene.replace("|", " [tiab] OR ")
+        gene_q+="[tiab]"
+        count=gene_addiction_cnt(gene_q)
+        print(gene+"\t"+count)
+        out.write(gene+"\t"+count)
author	Hao Chen	2019-05-14 22:47:16 -0500
committer	Hao Chen	2019-05-14 22:47:16 -0500
commit	7699060b030394399fd7aff9c1f3e2840f5b834f (patch)
tree	69531f1b33e651f863634061d41f22fd757d233f
parent	72bb5dd7ff7e9b2c098042843d62b96e6c09f497 (diff)
download	genecup-7699060b030394399fd7aff9c1f3e2840f5b834f.tar.gz