Update architecture

author: Pjotr Prins 2024-06-30 06:49:19 -0500
committer: Pjotr Prins 2024-06-30 06:49:19 -0500
commit: 5d59f357e00133f2260e5019cf8b6a39ab533d2f (patch)
tree: d6550e90f73cafc4dc88f4716c5fd81b1862713f /topics
parent: 566b22ac8b62fc967ba5b3231dc61fff64bccc8e (diff)
download: gn-gemtext-5d59f357e00133f2260e5019cf8b6a39ab533d2f.tar.gz
1 files changed, 59 insertions, 20 deletions
diff --git a/topics/database/mariadb-database-architecture.gmi b/topics/database/mariadb-database-architecture.gmi
index c725173..b5954f9 100644
--- a/topics/database/mariadb-database-architecture.gmi
+++ b/topics/database/mariadb-database-architecture.gmi
@@ -29,25 +29,26 @@ select table_name,round(((data_length + index_length) / 1024 / 1024), 2) `Size i
 
 ```
 +-------------------------+------------+
-| table_name              | Size in MB | Should be:
+| table_name              | Size in MB | Should be named:
 +-------------------------+------------+
-| PublishData             |      22.54 | PublishStrainValues <- id, strainId, value (3M traits)
-| PublishSE               |       4.71 | Published SE (300K traits)
-| PublishXRef             |       2.18 | List of changes on editing data (mostly Rob)
-| ProbeSetData            |   59358.80 | Trait data          <- id, strain, value
-| ProbeSetSE              |   14551.02 | Trait Error         <- SE values aligns with ProbeSetData
+| PublishData             |      22.54 | ClassicTraitValues  <- data-id, strain-id, value (3M traits)
+| PublishSE               |       4.71 | ClassicTraitValueError (300K traits) <- data-id, strain-id, value
+| PublishXRef             |       2.18 | List of publications <- id, data-id, inbred-id, pheno-id, pub-id
+| ProbeSetData            |   59358.80 | BulkTraitValues     <- id, strain, value
+| ProbeSetSE              |   14551.02 | BulkTraitValueError <- SE values aligns with ProbeSetData
 | ProbeSetXRef            |    4532.89 | PrecomputedLRS      <- precomputed LRS values, pointing to dataset+trait
-| ProbeSet                |    2880.21 | ProbeSetInfo        <- mRNA probeset description, e.g. 100001_at comes with sequence info
-| ProbeSetFreeze          |       0.22 | DatasetInfo         <- dataset description, e.g. "Hippocampus_BXD_Jun06" - probesetfreezeid points to dataset
-| Probe                   |    2150.30 | ProbeInfo           <- Probe trait info incl sequence
-| ProbeFreeze             |       0.06 | Dataset names       <- Similar to ProbesetFreeze
-| Phenotype               |       6.50 | PhenotypeMeta       <- "Hippocampus weight"
-| ProbeXRef               |     743.38 | ProbeFreezeDataIDs  <- link ProbeFreezeId,ProbeId with DataId
-| Datasets                |       2.31 | DatasetMeta         <- "Data generated by..."
-| NStrain                 |       4.80 | StrainCountDataId   <- Strains used in dataset
-| Strain                  |       1.07 | StrainNames         <- with species ID and alias
+| ProbeSet                |    2880.21 | ProbeSetInfo        <- over utilized mRNA probeset description, e.g. 100001_at comes with sequence info
+| ProbeSetFreeze          |       0.22 | DatasetInfo         <- dataset description, e.g. "Hippocampus_BXD_Jun06" - probesetfreezeid points to dataset, shortname, public?
+| Probe                   |    2150.30 | ProbeInfo           <- Probe trait info incl sequence, id, probeset-id
+| ProbeFreeze             |       0.06 | Dataset names       <- Similar to ProbesetFreeze, id, chip-id, inbredset-id, tissue-id
+| Phenotype               |       6.50 | PhenotypeMeta       <- "Hippocampus weight", id, prepublish short-name, postpublish short-name
+| ProbeXRef               |     743.38 | ProbeFreezeDataIDs  <- link ProbeFreeze-Id,Probe-Id with Data-Id
+| Datasets                |       2.31 | DatasetMeta         <- "Data generated by...", investigator-id, publication title
+| NStrain                 |       4.80 | StrainCountDataId   <- Strains used in dataset, count, strain-id, data-id
+| Strain                  |       1.07 | StrainNames         <- with species ID and alias, id, species-id, name
 | TissueProbeSetData      |      74.42 |                     <- link Id,TissueID with value
-| TissueProbeSetXRef      |      14.73 |
+| TissueProbeSetXRef      |      14.73 | TissueGeneTable? <- data-id, gene-id, mean, symbol, TissueProbeSetFreezeId | ProbesetId | DataId
+| TissueProbeSetFreeze    |       0.01 | tissueprobefreeze-id
 | InbredSet               |       0.01 | InbredSetMeta -> Id,SpeciesId,FullName
 | ProbeData               |   22405.44 | (OLD?) mRNAStrainValues used for partial correlations <- id, strain, value = individual probe data (mRNA) [GN1,GN3]
 | ProbeSE                 |    6263.83 | (OLD?) Trait Error  <- trait SE aligns with ProbeData? [GN3]
@@ -122,7 +123,6 @@ Less commonly used tables:
 | Dataset_mbat            |       0.00 |
 | TissueProbeFreeze       |       0.00 |
 | EnsemblChip             |       0.00 |
-| TissueProbeSetFreeze    |       0.01 |
 | UserPrivilege           |       0.00 |
 | CaseAttribute           |       0.00 |
 | MappingMethod           |       0.00 |
@@ -173,7 +173,7 @@ MariaDB [db_webqtl]> select * from ProbeData limit 5;
 
 ProbeSet describes ProbeSetData. I.e., every probe ID comes with a sequence (microarray) etc.
 
-As for duplicated data - I'm pretty sure duplicated or "detached"* data happens sometimes, though that's not related to the PublishData/ProbeSetData distinction (unless this is done deliberately for some reason). I believe that whether data is entered as one or the other primarily comes down to the desire/need to divide it into datasets (or by tissue) within a group (with mRNA expression data just being the most common reason for this). I've encountered a situation before with Arthur where there was data in ProbeSetData that wasn't also in ProbeSetXRef
+As for duplicated data: duplicated or "detached"* data happens sometimes, though that's not related to the PublishData/ProbeSetData distinction (unless this is done deliberately for some reason). I believe that whether data is entered as one or the other primarily comes down to the desire/need to divide it into datasets (or by tissue) within a group (with mRNA expression data just being the most common reason for this). I've encountered a situation before with Arthur where there was data in ProbeSetData that wasn't also in ProbeSetXRef
 
 an you give an example of exactly what you mean? PublishData would be stuff like sex, weight, etc (is this what you mean?) while ProbeSetData is used for mRNA expression data (except for a few situations where it isn't lol).
 
@@ -189,9 +189,10 @@ Group -> Tissue -> Dataset -> ProbeSetData
 [OBSOLETE] ProbeData meanwhile is a table with fine-grained probe level Affymetrix data only. Contains 1 billion rows March 2016. This table may be *deleted* later since it is only used by the Probe Table display in GN1. Not used in GN2
 "ProbeData" should probably be "AssayData" or something more neutral.
 
-In comparison the "ProbeSetData" table contains more molecular assay data, including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016. In comparison, ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes).
+In comparison the "ProbeSetData" table contains more molecular assay data, including probe set data, RNA-seq data, proteomic data, and metabolomic data. 2.5 billion rows March 2016.
+ProbeData contains data only for Affymetrix probe level data (e.g. Exon array probes and M430 probes).
 
-"StrainId" should be "CaseId" or "SampleId".
+"StrainId" should be "CaseId" or "SampleId" or "GenometypeId".
 
 ```
 select * from ProbeData limit 2;
@@ -212,6 +213,44 @@ select count(*) from ProbeData limit 2;
 1 row in set (0.00 sec)
 ```
 
+## PublishData
+
+```
+MariaDB [db_webqtl]> select * from PublishData where StrainId=5 limit 5;
++---------+----------+------------+
+| Id      | StrainId | value      |
++---------+----------+------------+
+| 8967043 |        5 |  49.000000 |
+| 8967044 |        5 |  50.099998 |
+| 8967045 |        5 | 403.000000 |
+| 8967046 |        5 |  45.500000 |
+| 8967047 |        5 |  44.900002 |
++---------+----------+------------+
+5 rows in set (0.265 sec)
+MariaDB [db_webqtl]> select * from PublishSE where StrainId=5 limit 5;
++---------+----------+-------+
+| DataId  | StrainId | error |
++---------+----------+-------+
+| 8967043 |        5 |  1.25 |
+| 8967044 |        5 |  0.71 |
+| 8967045 |        5 |   8.6 |
+| 8967046 |        5 |  1.23 |
+| 8967047 |        5 |  1.42 |
++---------+----------+-------+
+5 rows in set (0.203 sec)
+MariaDB [db_webqtl]> select * from PublishXRef limit 2;
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Id    | InbredSetId | PhenotypeId | PublicationId | DataId  | mean              | Locus          | LRS              | additive         | Sequence | comments                                                                                                                                                     |
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| 10001 |           1 |           4 |           116 | 8967043 | 52.13529418496525 | rs48756159     | 13.4974911471087 | 2.39444435069444 |        1 | robwilliams modified post_publication_description at Mon Jul 30 14:58:10 2012
+robwilliams modified post_publication_description at Sat Jan 30 13:48:49 2016
+ |
+| 10002 |           1 |          10 |           116 | 8967044 | 52.22058767430923 | rsm10000005699 |  22.004269639323 | 2.08178575714286 |        1 | robwilliams modified phenotype at Thu Oct 28 21:43:28 2010
+                                                                                                  |
++-------+-------------+-------------+---------------+---------+-------------------+----------------+------------------+------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
+2 rows in set (0.328 sec)
+```
+
 ## ProbeSet
author	Pjotr Prins	2024-06-30 06:49:19 -0500
committer	Pjotr Prins	2024-06-30 06:49:19 -0500
commit	5d59f357e00133f2260e5019cf8b6a39ab533d2f (patch)
tree	d6550e90f73cafc4dc88f4716c5fd81b1862713f /topics
parent	566b22ac8b62fc967ba5b3231dc61fff64bccc8e (diff)
download	gn-gemtext-5d59f357e00133f2260e5019cf8b6a39ab533d2f.tar.gz