1 files changed, 160 insertions, 5 deletions
diff --git a/doc/database.org b/doc/database.org
index 624174a4..5107b660 100644
--- a/doc/database.org
+++ b/doc/database.org
@@ -1,9 +1,19 @@
-- github Document reduction issue
+* Database Information
+
+WARNING: This document contains information on the GN databases which
+will change over time. The GN database is currently MySQL based and,
+while efficient, contains a number of design choices we want to grow
+'out' of. Especially with an eye on reproducibility we want to
+introduce versioning.
+
+So do not treat the information in this document as a final way of
+accessing data. It is better to use the
+[[https://github.com/genenetwork/gn_server/blob/master/doc/API.md][REST API]].
 
 * The small test database (2GB)
 
 The default install comes with a smaller database which includes a
-number of the BSD's and the Human liver dataset (GSE9588).
+number of the BXD's and the Human liver dataset (GSE9588).
 
 * GeneNetwork database
 
@@ -750,9 +760,30 @@ show indexes from ProbeSetFreeze;
 |      1 |        5 | 0.303492 |
 +--------+----------+----------+
 
-** Publication and publishdata (all pheno)
+** Publication
+
+Publication:
+
+| Id   | PubMed_ID | Abstract    | Title   | Pages   | Month | Year |
+
 
-Phenotype pubs
+** Publishdata (all pheno)
+
+One of three phenotype tables.
+
+mysql> select * from PublishData limit 5;
++---------+----------+-------+
+| Id      | StrainId | value |
++---------+----------+-------+
+| 8966353 |      349 |  29.6 |
+| 8966353 |      350 |  27.8 |
+| 8966353 |      351 |  26.6 |
+| 8966353 |      352 |  28.5 |
+| 8966353 |      353 |  24.6 |
++---------+----------+-------+
+5 rows in set (0.25 sec)
+
+See below for phenotype access.
 
 ** QuickSearch
 
@@ -1073,7 +1104,37 @@ select * from ProbeSetXRef limit 5;
 i.e., for Strain Id 1 (DataId) 1, the locus '10.095.400' has a
 phenotype value of 5.742.
 
-GeneNetwork1 already has a limited REST interface, if you do
+Interestingly ProbeData and PublishData have the same layout as
+ProbeSetData. ProbeData is only in use for Affy assays - and not used
+for computations. PublishData contains trait values. ProbeSetData.id
+matches ProbeSetXRef.DataId while PublishData.id matches
+PublishXRef.DataId.
+
+select * from PublishXRef limit 3;
++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId  | Locus          | LRS              | additive  | Sequence | comments                                              |
++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+
+| 10001 |           8 |           1 |             1 | 8966353 | D2Mit5         |   10.18351644706 |  -1.20875 |        1 |                                                       |
+| 10001 |           7 |           2 |            53 | 8966813 | D7Mit25UT      | 9.85534330983917 |  -2.86875 |        1 |                                                       |
+| 10001 |           4 |           3 |            81 | 8966947 | CEL-6_57082524 | 11.7119505898121 | -23.28875 |        1 | elissa modified Abstract at Tue Jun  7 11:38:00 2005  |
++-------+-------------+-------------+---------------+---------+----------------+------------------+-----------+----------+-------------------------------------------------------+
+3 rows in set (0.00 sec)
+
+ties the trait data (PublishData) with the inbredsetid (matching
+PublishFreeze.InbredSetId), locus and publication.
+
+select * from PublishFreeze -> ;
++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+
+| Id | Name       | FullName                 | ShortName   | CreateTime | public | InbredSetId | confidentiality | AuthorisedUsers |
++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+
+|  1 | BXDPublish | BXD Published Phenotypes | BXDPublish  | 2004-07-17 |      2 |           1 |               0 | NULL            |
+| 18 | HLCPublish | HLC Published Phenotypes | HLC Publish | 2012-02-20 |      2 |          34 |               0 | NULL            |
++----+------------+--------------------------+-------------+------------+--------+-------------+-----------------+-----------------+
+2 rows in set (0.02 sec)
+
+which gives us the datasets.
+
+GeneNetwork1 has a limited REST interface, if you do
 
 : curl "http://robot.genenetwork.org/webqtl/main.py?cmd=get&probeset=1443823_s_at&db=HC_M2_0606_P"
 
@@ -1082,6 +1143,9 @@ we get
 : ProbeSetID      B6D2F1  C57BL/6J        DBA/2J  BXD1    BXD2    BXD5    BXD6   BXD8     BXD9    BXD11   BXD12   BXD13   BXD15   BXD16   BXD19   BXD20   BXD21  BXD22    BXD23   BXD24   BXD27   BXD28   BXD29   BXD31   BXD32   BXD33   BXD34  BXD38    BXD39   BXD40   BXD42   BXD67   BXD68   BXD43   BXD44   BXD45   BXD48  BXD50    BXD51   BXD55   BXD60   BXD61   BXD62   BXD63   BXD64   BXD65   BXD66  BXD69    BXD70   BXD73   BXD74   BXD75   BXD76   BXD77   BXD79   BXD73a  BXD83  BXD84    BXD85   BXD86   BXD87   BXD89   BXD90   BXD65b  BXD93   BXD94   A/J    AKR/J    C3H/HeJ C57BL/6ByJ      CXB1    CXB2    CXB3    CXB4    CXB5    CXB6   CXB7     CXB8    CXB9    CXB10   CXB11   CXB12   CXB13   BXD48a  129S1/SvImJ    BALB/cJ  BALB/cByJ       LG/J    NOD/ShiLtJ      PWD/PhJ BXD65a  BXD98   BXD99  CAST/EiJ KK/HlJ  WSB/EiJ NZO/HlLtJ       PWK/PhJ D2B6F1
 : 1443823_s_at    15.251  15.626  14.716  15.198  14.918  15.057  15.232  14.968 14.87    15.084  15.192  14.924  15.343  15.226  15.364  15.36   14.792  14.908 15.344   14.948  15.08   15.021  15.176  15.14   14.796  15.443  14.636  14.921 15.22    15.62   14.816  15.39   15.428  14.982  15.05   15.13   14.722  14.636 15.242   15.527  14.825  14.416  15.125  15.362  15.226  15.176  15.328  14.895 15.141   15.634  14.922  14.764  15.122  15.448  15.398  15.089  14.765  15.234 15.302   14.774  14.979  15.212  15.29   15.012  15.041  15.448  14.34   14.338 14.809   15.046  14.816  15.232  14.933  15.255  15.21   14.766  14.8    15.506 15.749   15.274  15.599  15.673  14.651  14.692  14.552  14.563  14.164  14.546 15.044   14.695  15.162  14.772  14.645  15.493  14.75   14.786  15.003  15.148 15.221
 
+(see https://github.com/genenetwork/gn_server/blob/master/doc/API.md
+for the latest REST API).
+
 getTraitData is defined in the file [[https://github.com/genenetwork/genenetwork/blob/master/web/webqtl/textUI/cmdClass.py#L134][web/webqtl/textUI/cmdClass.py]].
 probe is None, so the code at line 199 is run
 
@@ -1165,6 +1229,97 @@ select * from ProbeSetData limit 5;
 5 rows in set (0.00 sec)
 
 linked by ProbeSetXRef.dataid.
+
+*** For PublishData:
+
+List datasets for BXD (InbredSetId=1):
+
+select * from PublishXRef where InbredSetId=1 limit 3;
++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId  | Locus     | LRS              | additive         | Sequence | comments                                                                       |
++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+
+| 10001 |           1 |           4 |           116 | 8967043 | rs8253516 | 13.4974914158039 | 2.39444444444444 |        1 | robwilliams modified post_publication_description at Mon Jul 30 14:58:10 2012
+ |
+| 10002 |           1 |          10 |           116 | 8967044 | rs3666069 | 22.0042692151629 | 2.08178571428572 |        1 | robwilliams modified phenotype at Thu Oct 28 21:43:28 2010
+                    |
+| 10003 |           1 |          15 |           116 | 8967045 | D18Mit4   | 15.5929163293343 | 19.0882352941176 |        1 | robwilliams modified phenotype at Mon May 23 20:52:19 2011
+                    |
++-------+-------------+-------------+---------------+---------+-----------+------------------+------------------+----------+--------------------------------------------------------------------------------+
+
+where ID is the 'record' or, effectively, dataset.
+
+select distinct(publicationid) from PublishXRef where InbredSetId=1 limit 3;
++---------------+
+| publicationid |
++---------------+
+|           116 |
+|           117 |
+|           118 |
++---------------+
+
+select distinct
+PublishXRef.id,publicationid,phenotypeid,Phenotype.post_publication_description
+from PublishXRef,Phenotype where InbredSetId=1 and
+phenotypeid=Phenotype.id limit 3;
++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+
+| id    | publicationid | phenotypeid | post_publication_description                                                                                               |
++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+
+| 10001 |           116 |           4 | Central nervous system, morphology: Cerebellum weight [mg]                                                                 |
+| 10002 |           116 |          10 | Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]                 |
+| 10003 |           116 |          15 | Central nervous system, morphology: Brain weight, male and female adult average, unadjusted for body weight, age, sex [mg] |
++-------+---------------+-------------+----------------------------------------------------------------------------------------------------------------------------+
+
+The id field is the same that is used in the GN2 web interface and the
+PublicationID ties the datasets together.
+
+To list trait values:
+
+SELECT Strain.Name, PublishData.id, PublishData.value from
+(Strain,PublishData, PublishXRef) Where PublishData.StrainId =
+Strain.id limit 3;
+
++------+---------+-------+
+| Name | id      | value |
++------+---------+-------+
+| CXB1 | 8966353 |  29.6 |
+| CXB1 | 8966353 |  29.6 |
+| CXB1 | 8966353 |  29.6 |
++------+---------+-------+
+
+here id should match dataid again:
+
+SELECT Strain.Name, PublishData.id, PublishData.value from
+(Strain,PublishData, PublishXRef) Where PublishData.StrainId =
+Strain.id and PublishXRef.dataid=8967043 and
+PublishXRef.dataid=PublishData.id limit 3;
++------+---------+-------+
+| Name | id      | value |
++------+---------+-------+
+| BXD1 | 8967043 |  61.4 |
+| BXD2 | 8967043 |    49 |
+| BXD5 | 8967043 |  62.5 |
++------+---------+-------+
+
+*** Datasets
+
+The REST API aims to present a unified interface for genotype and
+phenotype data. Phenotype datasets appear in two major forms in the
+database and we want to present them as one resource.
+
+Dataset names are defined in ProbeSetFreeze.name and Published.id ->
+publication (we'll ignore the probe dataset that uses
+ProbeFreeze.name). These tables should be meshed. It looks like the
+ids are non-overlapping with the publish record IDs starting at 10,001
+(someone has been smart, though it sets the limit of probesets now to
+10,000).
+
+The datasets are organized differently in these tables. All published
+BXD data is grouped on BXDpublished with the publications as
+'datasets'. So, that is how we list them in the REST API.
+
+To fetch all the datasets we first list ProbeSetFreeze entries. Then
+we list the Published entries.
+
 ** Fetch genotype information
 
 *** SNPs