From d79f4ebc6176185660e8e69218a0e2583b8e960f Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sat, 15 Oct 2016 15:47:21 +0000 Subject: Doc: info on PublishData phenotypes --- doc/database.org | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 157 insertions(+), 5 deletions(-) (limited to 'doc/database.org') diff --git a/doc/database.org b/doc/database.org index 624174a4..c6f03c24 100644 --- a/doc/database.org +++ b/doc/database.org @@ -1,9 +1,19 @@ -- github Document reduction issue +* Database Information + +WARNING: This document contains information on the GN databases which +will change over time. The GN database is currently MySQL based and, +while efficient, contains a number of design choices we want to grow +'out' of. Especially with an eye on reproducibility we want to +introduce versioning. + +So do not treat the information in this document as a final way of +accessing data. It is better to use the +[[https://github.com/genenetwork/gn_server/blob/master/doc/API.md][REST API]]. * The small test database (2GB) The default install comes with a smaller database which includes a -number of the BSD's and the Human liver dataset (GSE9588). +number of the BXD's and the Human liver dataset (GSE9588). * GeneNetwork database @@ -750,9 +760,30 @@ show indexes from ProbeSetFreeze; | 1 | 5 | 0.303492 | +--------+----------+----------+ -** Publication and publishdata (all pheno) +** Publication + +Publication: + +| Id | PubMed_ID | Abstract | Title | Pages | Month | Year | + -Phenotype pubs +** Publishdata (all pheno) + +One of three phenotype tables. + +mysql> select * from PublishData limit 5; ++---------+----------+-------+ +| Id | StrainId | value | ++---------+----------+-------+ +| 8966353 | 349 | 29.6 | +| 8966353 | 350 | 27.8 | +| 8966353 | 351 | 26.6 | +| 8966353 | 352 | 28.5 | +| 8966353 | 353 | 24.6 | ++---------+----------+-------+ +5 rows in set (0.25 sec) + +See below for phenotype access. ** QuickSearch @@ -1073,7 +1104,37 @@ select * from ProbeSetXRef limit 5; i.e., for Strain Id 1 (DataId) 1, the locus '10.095.400' has a phenotype value of 5.742. -GeneNetwork1 already has a limited REST interface, if you do +Interestingly ProbeData and PublishData have the same layout as +ProbeSetData. ProbeData is only in use for Affy assays - and not used +for computations. PublishData contains trait values. ProbeSetData.id +matches ProbeSetXRef.DataId while PublishData.id matches +PublishXRef.DataId. + +select * from PublishXRef limit 3; ++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+ +| Id | InbredSetId | PhenotypeId | PublicationId | DataId | Locus | LRS | additive | Sequence | comments | ++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+ +| 10001 | 8 | 1 | 1 | 8966353 | D2Mit5 | 10.18351644706 | -1.20875 | 1 | | +| 10001 | 7 | 2 | 53 | 8966813 | D7Mit25UT | 9.85534330983917 | -2.86875 | 1 | | +| 10001 | 4 | 3 | 81 | 8966947 | CEL-6_57082524 | 11.7119505898121 | -23.28875 | 1 | elissa modified Abstract at Tue Jun 7 11:38:00 2005 | ++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+ +3 rows in set (0.00 sec) + +ties the trait data (PublishData) with the inbredsetid (matching +PublishFreeze.InbredSetId), locus and publication. + +select * from PublishFreeze -> ; ++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+ +| Id | Name | FullName | ShortName | CreateTime | public | InbredSetId | confidentiality | AuthorisedUsers | ++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+ +| 1 | BXDPublish | BXD Published Phenotypes | BXDPublish | 2004-07-17 | 2 | 1 | 0 | NULL | +| 18 | HLCPublish | HLC Published Phenotypes | HLC Publish | 2012-02-20 | 2 | 34 | 0 | NULL | ++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+ +2 rows in set (0.02 sec) + +which gives us the datasets. + +GeneNetwork1 has a limited REST interface, if you do : curl "http://robot.genenetwork.org/webqtl/main.py?cmd=get&probeset=1443823_s_at&db=HC_M2_0606_P" @@ -1082,6 +1143,9 @@ we get : ProbeSetID B6D2F1 C57BL/6J DBA/2J BXD1 BXD2 BXD5 BXD6 BXD8 BXD9 BXD11 BXD12 BXD13 BXD15 BXD16 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24 BXD27 BXD28 BXD29 BXD31 BXD32 BXD33 BXD34 BXD38 BXD39 BXD40 BXD42 BXD67 BXD68 BXD43 BXD44 BXD45 BXD48 BXD50 BXD51 BXD55 BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD66 BXD69 BXD70 BXD73 BXD74 BXD75 BXD76 BXD77 BXD79 BXD73a BXD83 BXD84 BXD85 BXD86 BXD87 BXD89 BXD90 BXD65b BXD93 BXD94 A/J AKR/J C3H/HeJ C57BL/6ByJ CXB1 CXB2 CXB3 CXB4 CXB5 CXB6 CXB7 CXB8 CXB9 CXB10 CXB11 CXB12 CXB13 BXD48a 129S1/SvImJ BALB/cJ BALB/cByJ LG/J NOD/ShiLtJ PWD/PhJ BXD65a BXD98 BXD99 CAST/EiJ KK/HlJ WSB/EiJ NZO/HlLtJ PWK/PhJ D2B6F1 : 1443823_s_at 15.251 15.626 14.716 15.198 14.918 15.057 15.232 14.968 14.87 15.084 15.192 14.924 15.343 15.226 15.364 15.36 14.792 14.908 15.344 14.948 15.08 15.021 15.176 15.14 14.796 15.443 14.636 14.921 15.22 15.62 14.816 15.39 15.428 14.982 15.05 15.13 14.722 14.636 15.242 15.527 14.825 14.416 15.125 15.362 15.226 15.176 15.328 14.895 15.141 15.634 14.922 14.764 15.122 15.448 15.398 15.089 14.765 15.234 15.302 14.774 14.979 15.212 15.29 15.012 15.041 15.448 14.34 14.338 14.809 15.046 14.816 15.232 14.933 15.255 15.21 14.766 14.8 15.506 15.749 15.274 15.599 15.673 14.651 14.692 14.552 14.563 14.164 14.546 15.044 14.695 15.162 14.772 14.645 15.493 14.75 14.786 15.003 15.148 15.221 +(see https://github.com/genenetwork/gn_server/blob/master/doc/API.md +for the latest REST API). + getTraitData is defined in the file [[https://github.com/genenetwork/genenetwork/blob/master/web/webqtl/textUI/cmdClass.py#L134][web/webqtl/textUI/cmdClass.py]]. probe is None, so the code at line 199 is run @@ -1165,6 +1229,94 @@ select * from ProbeSetData limit 5; 5 rows in set (0.00 sec) linked by ProbeSetXRef.dataid. + +*** For PublishData: + +List datasets for BXD (InbredSetId=1): + +select * from PublishXRef where InbredSetId=1 limit 3; ++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+ +| Id | InbredSetId | PhenotypeId | PublicationId | DataId | Locus | LRS | additive | Sequence | comments | ++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+ +| 10001 | 1 | 4 | 116 | 8967043 | rs8253516 | 13.4974914158039 | 2.39444444444444 | 1 | robwilliams modified post_publication_description at Mon Jul 30 14:58:10 2012 + | +| 10002 | 1 | 10 | 116 | 8967044 | rs3666069 | 22.0042692151629 | 2.08178571428572 | 1 | robwilliams modified phenotype at Thu Oct 28 21:43:28 2010 + | +| 10003 | 1 | 15 | 116 | 8967045 | D18Mit4 | 15.5929163293343 | 19.0882352941176 | 1 | robwilliams modified phenotype at Mon May 23 20:52:19 2011 + | ++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+ + +where ID is the 'record' or, effectively, dataset. + +select distinct(publicationid) from PublishXRef where InbredSetId=1 limit 3; ++---------------+ +| publicationid | ++---------------+ +| 116 | +| 117 | +| 118 | ++---------------+ + +select distinct PublishXRef.id,publicationid,phenotypeid,Phenotype.post_publication_description from PublishXRef,Phenotype where InbredSetId=1 and phenotypeid=Phenotype.id limit 3; ++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+ +| id | publicationid | phenotypeid | post_publication_description | ++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+ +| 10001 | 116 | 4 | Central nervous system, morphology: Cerebellum weight [mg] | +| 10002 | 116 | 10 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg] | +| 10003 | 116 | 15 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] | ++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+ + +The id field is the same that is used in the GN2 web interface and the +PublicationID ties the datasets together. + +To list trait values: + +SELECT Strain.Name, PublishData.id, PublishData.value from +(Strain,PublishData, PublishXRef) Where PublishData.StrainId = +Strain.id limit 3; + ++------+---------+-------+ +| Name | id | value | ++------+---------+-------+ +| CXB1 | 8966353 | 29.6 | +| CXB1 | 8966353 | 29.6 | +| CXB1 | 8966353 | 29.6 | ++------+---------+-------+ + +here id should match dataid again: + +SELECT Strain.Name, PublishData.id, PublishData.value from +(Strain,PublishData, PublishXRef) Where PublishData.StrainId = +Strain.id and PublishXRef.dataid=8967043 and +PublishXRef.dataid=PublishData.id limit 3; ++------+---------+-------+ +| Name | id | value | ++------+---------+-------+ +| BXD1 | 8967043 | 61.4 | +| BXD2 | 8967043 | 49 | +| BXD5 | 8967043 | 62.5 | ++------+---------+-------+ + +*** Datasets + +The REST API aims to present a unified interface for genotype and +phenotype data. Phenotype datasets appear in two major forms in the +database and we want to present them as one resource. + +Dataset names are defined in ProbeSetFreeze.name and Published.id -> +publication (we'll ignore the probe dataset that uses +ProbeFreeze.name). These tables should be meshed. It looks like the +ids are non-overlapping with the publish record IDs starting at 10,001 +(someone has been smart, though it sets the limit of probesets now to +10,000). + +The datasets are organized differently in these tables. All published +BXD data is grouped on BXDpublished with the publications as +'datasets'. So, that is how we list them in the REST API. + +To fetch all the datasets we first list ProbeSetFreeze entries. Then +we list the Published entries. + ** Fetch genotype information *** SNPs -- cgit v1.2.3