# Inspect Discrepancies Between XAPIAN and SQL Search. * assigned: bonfacem, rookie101 ## Description When doing a XAPIAN search, we miss some data that is unavailable from the SQL Search. The searches we tested: => https://cd.genenetwork.org/search?species=mouse&group=BXD&type=Hippocampus+mRNA&dataset=HC_M2_0606_P&search_terms_or=WIKI%3Dglioma&search_terms_and=&accession_id=None&FormID=searchResulto Normal SQL search for dataset=HC_M2_0606_P species=mouse group=BXD WIKI=glioma For the above search, we get 31 results. => https://cd.genenetwork.org/gsearch?type=gene&terms=species%3Amouse+group%3Abxd+dataset%3Ahc_m2_0606_p+wiki%3Aglioma species:mouse group:bxd dataset:hc_m2_0606_p wiki:glioma For the above search, we get 26 results. We miss the following entries from the XAPIAN search: ``` 15 1423803_s_at Gltscr2 glioma tumor suppressor candidate region gene 2 16 1451121_a_at Gltscr2 glioma tumor suppressor candidate region 2; exons 8 and 9 17 1452409_at Gltscr2 glioma tumor suppressor candidate region gene 2 25 1416556_at Sas sarcoma amplified sequence 26 1430029_a_at Sas sarcoma amplified sequence ``` We want to work out why the above miss in the xapian documents for the given trait. To do that we first use quest to search for one of the symbols to get the exact doc-id: ``` quest --msize=2 -s en --boolean-prefix="iden:Qgene:" \ "iden:"1423803_s_at:hc_m2_0606_p"" --db=/export/data/genenetwork-xapian/ Parsed Query: Query(0 * Qgene:1423803_s_at:hc_m2_0606_p) Exactly 1 matches MSet: 9665867: [0] {"name": "1423803_s_at", "symbol": "Gltscr2", "description": "glioma tumor suppressor candidate region gene 2", "chr": "1", "mb": 4.687986, "dataset": "HC_M2_0606_P", "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN", "species": "mouse", "group": "BXD", "tissue": "Hippocampus mRNA", "mean": 11.749030303030299, "lrs": 11.3847971289981, "additive": -0.0650828877005346, "geno_chr": "5", "geno_mb": 137.010795} ``` Inspecting the doc-id in XAPIAN, see: ``` bonfacem@tux02 /export5/xapian-test/xapian-07-04 $ xapian-delve -r 9665867 -d /export/data/genenetwork-xapian/ Data for record #9665867: {"name": "1423803_s_at", "symbol": "Gltscr2", "description": "glioma tumor suppressor candidate region gene 2", "chr": "1", "mb": 4.687986, "dataset": "HC_M2_0606_P", "dataset_fullname": "Hippocampus Consortium M430v2 (Jun06) PDNN", "species": "mouse", "group": "BXD", "tissue": "Hippocampus mRNA", "mean": 11.749030303030299, "lrs": 11.3847971289981, "additive": -0.0650828877005346, "geno_chr": "5", "geno_mb": 137.010795} Term List for record #9665867: 1423803_s_at 2 5330430h08rik 9430097c02rik Qgene:1423803_s_at:hc_m2_0606_p XC1 XDShc_m2_0606_p XGbxd XIhippocampus XImrna XPC5 XSmouse XTgene XYgltscr2 ZXDShc_m2_0606_p ZXGbxd ZXIhippocampus ZXImrna ZXSmous ZXYgltscr2 Zbc017637 Zbxd Zcandid Zgene Zglioma Zgltscr2 Zhc_m2_0606_p Zhippocampus Zmous Zmrna Zregion Zsuppressor Ztumor bc017637 bxd candidate gene glioma gltscr2 hc_m2_0606_p hippocampus mouse mrna region suppressor tumor ```