diff options
Diffstat (limited to 'topics/data')
| -rw-r--r-- | topics/data/R-qtl2-format-notes.gmi | 72 | ||||
| -rw-r--r-- | topics/data/epochs.gmi | 153 | ||||
| -rw-r--r-- | topics/data/precompute/steps.gmi | 28 |
3 files changed, 246 insertions, 7 deletions
diff --git a/topics/data/R-qtl2-format-notes.gmi b/topics/data/R-qtl2-format-notes.gmi new file mode 100644 index 0000000..3397b5e --- /dev/null +++ b/topics/data/R-qtl2-format-notes.gmi @@ -0,0 +1,72 @@ +# R/qtl2 and GEMMA Format Notes + +This document is mostly to help other non-biologists figure out their way around the format(s) of the R/qtl2 files. It mostly deals with the meaning/significance of the various fields. + +From the R/qtl2 format documentation: + +> The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names. + +and + +> All of these CSV files may be transposed relative to the form described below. + +We are going to consider the "non-transposed" form here, for ease of documentation: simply flip the meanings as appropriate for the transposed files. + +To convert between formats we should probably use python as that is what can use as 'esperanto'. + +## Control files + +Both GN and R/qtl2 have control files. For GN it basically describes the individuals (genometypes) and looks like: + +```js +{ + "mat": "C57BL/6J", + "pat": "DBA/2J", + "f1s": ["B6D2F1", "D2B6F1"], + "genofile" : [{ + "title" : "WGS-based (Mar2022)", + "location" : "BXD.8.geno", + "sample_list" : ["BXD1", "BXD2", "BXD5", "BXD6", "BXD8", "BXD9", "BXD11", "BXD12", "BXD13", "BXD14", "BXD15", "BXD16", "BXD18", "BXD19", "BXD20", "BXD21", "BXD22", "BXD23", "BXD24", "BXD24a", "BXD25", "BXD27", "BXD28", "BXD29", "BXD30", "BXD31", "BXD32", "BXD33", "BXD34", "BXD35", "BXD36", "BXD37", "BXD38", "BXD39", "BXD40", "BXD41", "BXD42", "BXD43", "BXD44", + ...]}]} +``` + +In gn-guile this gets parsed in gn/data/genotype.scm to fetch the individuals that match the genotype and phenotype layouts. + +## pheno files and phenotypes + +The standard GEMMA input files are not very good for trouble shooting. R/qtl2 has at least the individual or genometype ID for every line: + +``` +id,bolting_days,seed_weight,seed_area,ttl_seedspfruit,branches,height,pc_seeds_aborted,fruit_length +MAGIC.1,15.33,17.15,0.64,45.11,10.5,NA,0,14.95 +MAGIC.2,22,22.71,0.75,49.11,4.33,42.33,1.09,13.27 +MAGIC.3,23,21.03,0.68,57,4.67,50,0,13.9 +``` + +This is a good standard and can match with the control files. + +## geno files + +> The genotype data file is a matrix of individuals × markers. The first column is the individual IDs; the first row is the marker names. + +For GeneNetwork, this means that the first column contains the Sample names (previously "strain names"). The first row would be a list of markers. + +## gmap and pmap files + +The first column of the gmap/pmap file contains genetic marker values. There are no Individuals/samples (or strains) here. + +## phenocovar files + +These seem to contain extra metadata for the phenotypes. + +The first column is the list of phenotype identifiers whereas the first column is a list of metadata headers (phenotype covariates). + +As an example, +=> https://github.com/rqtl/qtl2data/blob/main/BXD/bxd_phenocovar.csv The phenocovar file for BXD mice + +We see here that this contains the individual identifier (id), and a description for each individual/sample. + +# References + +=> https://kbroman.org/qtl2/assets/vignettes/input_files.html +=> https://github.com/rqtl/qtl2data diff --git a/topics/data/epochs.gmi b/topics/data/epochs.gmi new file mode 100644 index 0000000..3e8b676 --- /dev/null +++ b/topics/data/epochs.gmi @@ -0,0 +1,153 @@ +# Epochs + +In the 2019 BXD paper epochs are brought up. Basically, even though the BXD are 'immortal' with identical children, mutations do creep in. An epoch is a period of mice and we track the years a mouse was used. So a BXD1 breeding started at 1971 and production in 2001. In GN we don't make a distinction (per se), but obviously these are (slightly) different mice today. Ashbrook et al. find some interesting results that differ in epochs. + +In GN epochs are currently handled as a trait. This can help with covariate mapping. For a different epoch, however, the genotypes should also be adapted. The effect on the kinship matrix will be minor, but genotypes can be used for fine mapping. With pangenome derived genotypes it should get even more interesting. + +# Fetching data + +Tracking the epochs is happening in spreadsheet. According to track changes only one item was changed in two years - BXD10 was marked as extinct. + +In the GN SQL database Epoch with its RRID is stored as a CaseAttribute: + +``` +MariaDB [db_webqtl]> select * from CaseAttribute LIMIT 3; ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| InbredSetId | CaseAttributeId | Name | Description + | ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| 1 | 1 | Status | Live= Available at JAX, Cryo=Cryopreserved only, Extinct + | +| 1 | 36 | RRID | Research resource identifier given by SciCrunch.org + | +| 1 | 37 | Epoch | BXD family subgroups. Each number with common parents. Epoch1(BXD1-32), Epoch2-6 (BXD33-220). See Ashbrook et al. https://pubmed.ncbi.nlm.nih.gov/33472028/ | ++-------------+-----------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+ +``` + +And + +``` +MariaDB [db_webqtl]> select * from CaseAttributeXRefNew LIMIT 40; ++-------------+----------+-----------------+------------+ +| InbredSetId | StrainId | CaseAttributeId | Value | ++-------------+----------+-----------------+------------+ +| 1 | 1 | 1 | Live | +| 1 | 1 | 36 | JAX:100006 | +| 1 | 1 | 37 | 0 | +| 1 | 1 | 40 | | +| 1 | 2 | 1 | Live | +| 1 | 2 | 36 | JAX:000664 | +| 1 | 2 | 37 | 0 | +| 1 | 2 | 40 | 69 | +| 1 | 3 | 1 | Live | +| 1 | 3 | 36 | JAX:000671 | +| 1 | 3 | 37 | 0 | +| 1 | 3 | 40 | 108 | +| 1 | 4 | 1 | Live +``` + +I am not going to comment on this table architecture, other than that RDF is a much better fit. + +For extracting this data, the SQL table is probably the best source of 'truth' as it is seen by users on a regular basis. But, at this point, we'll just use the spreadsheet. Generating something like: + +``` +gn:Bxd14 + dct:description "BXD014/TyJ" ; + gnt:epoch 1 ; + gnt:availability "Cryorecovery" ; + gnt:method "B6 female to D2 male F2 intercross" ; + gnt:M_origin "B6" ; + gnt:Y_origin "D2" ; + gnt:JAX "000329" ; + gnt:start_year 1971 ; + gnt:age_seq_ind 271 ; + gnt:birth_seq_ind "2/18/2016" ; + gnt:availability_2023 "Cryorecovery" ; + gnt:has_genotypes true ; + rdfs:label "BXD14" . +gn:Bxd65 + dct:description "BXD065/RwwJ" ; + gnt:epoch 3 ; + gnt:availability "Available" ; + gnt:method "Advanced intercross progeny of B6 female to D2 male" ; + gnt:M_origin "B6" ; + gnt:Y_origin "D2" ; + gnt:JAX "007110" ; + gnt:start_year 1999 ; + gnt:age_seq_ind 46 ; + gnt:birth_seq_ind "9/18/2016" ; + gnt:availability_2023 "Available" ; + gnt:has_genotypes true ; + rdfs:label "BXD65" . +etc. +``` + +# Approach + +## Fetching data + +To get at the epochs we'll need to fetch the sample/ind names (such as BXD73b) from GN. + +For every dataset we can fetch samples+values with + +``` +curl http://127.0.0.1:8092/dataset/bxd-publish/values/$id.json > pheno.json +{"BXD40":-1.631969,"BXD68":-2.721761,"BXD43":-2.290135,"BXD44":-2.512057,"BXD48":-3.128819 ... +``` + +These are also stored in the pangemma output lmdb files. We don't want to store all values in RDF as these are only used for compute and can be easily fetched on demand from GN. We do want to access the sample names, but that is a list that is not necessarily unique to a single trait. In fact a trait should be referencing an experiment/dataset that has the samples/inds. Usually they will use the same animals. To not complicate things we'll just point to the samples with something like + +``` +traitid gn:sample gn:BXD40 . +``` + +Currently RDF contains + +``` +gn:Bxd12 rdfs:label "BXD12" . +gn:Bxd12 rdf:type gnc:strain . +gn:Bxd12 gnt:belongsToSpecies gn:Mus_musculus . +``` + +and traits have + +``` +gn:traitBxd_10002 rdf:type gnc:Phenotype . +gn:traitBxd_10002 gnt:belongsToGroup gn:setBxd . +gn:traitBxd_10002 gnt:traitId "10002" . +gn:traitBxd_10002 skos:altLabel "BXD_10002" . +gn:traitBxd_10002 dct:description "Central nervous system, morphology: Cerebellum weight after adjustment for covariance with brain size [mg]" . +gn:traitBxd_10002 gnt:abbreviation "ADJCBLWT" . +gn:traitBxd_10002 gnt:submitter "robwilliams" . +gn:traitBxd_10002 gnt:mean "52.22058767430923"^^xsd:double . +gn:traitBxd_10002 gnt:locus gn:Rsm10000005699 . +gn:traitBxd_10002 gnt:lodScore "4.779380894726979"^^xsd:double . +gn:traitBxd_10002 gnt:additive "2.0817857571428617"^^xsd:double . +gn:traitBxd_10002 gnt:sequence "1"^^xsd:integer . +gn:traitBxd_10002 dct:isReferencedBy pubmed:11438585 . +``` + +ignore the capitalization and some naming - gnc:strain should be gnc:sample - we'll fix that. But for now we can find some trait info and we can link the individuals up with a trait. + +The query we want to write is something like + +``` +SELECT * WHERE { + ?traitid a gnc:Phenotype; + gnt:traitId "10002" ; + gnt:belongsToGroup gn:setBxd ; + gnt:traitId ?trait ; + dct:isReferencedBy ?pubmed . + OPTIONAL { + ?traitid dct:description ?descr ; + gnt:sample_id ?sampleid . + ?sampleid rdfs:label ?sample . + } +} LIMIT 10 +``` + +So, for every trait/sample combination we need to add + +``` +gn:traitBxd_10002 gnt:sample_id gn:Bxd12 . +``` diff --git a/topics/data/precompute/steps.gmi b/topics/data/precompute/steps.gmi index 75e3bfd..ac03d1a 100644 --- a/topics/data/precompute/steps.gmi +++ b/topics/data/precompute/steps.gmi @@ -2,7 +2,8 @@ At this stage precompute fetches a trait from the DB and runs GEMMA. Next it tar balls up the vector for later use. It also updates the database with the latest info. -To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. I don't have to write all code because we have a working runner. I just need to chunk the work. +To actually kick off compute on machines that do not access the DB I realize now we need a step-wise approach. Basically you want to shift files around without connecting to a DB. And then update the DB whenever it is convenient. So we are going to make it a multi-step procedure. +We need to chunk the work. We will track precompute steps here. We will have: @@ -13,8 +14,18 @@ We will track precompute steps here. We will have: Trait archives will have steps for * [X] step p1: list-traits-to-compute -* [+] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper -* [ ] step p3: gemma-to-lmdb: create a clean vector +* [X] step p2: gemma-lmm9-loco-output: Compute standard GEMMA lmm9 LOCO vector with gemma-wrapper +* [X] step p3: gemma-to-lmdb: create a clean vector + +Start precompute + +* [ ] Fetch traits on tux04 +* [ ] Set up runner on tux04 and others +* [ ] Run on Octopus + +Work on published data + +* [ ] Fetch traits The DB itself can be updated from these @@ -22,8 +33,11 @@ The DB itself can be updated from these Later +* [ ] Rqtl2: Compute Rqtl2 vector * [ ] bulklmm: Compute bulklmm vector +Interestingly this work coincides with Arun's work on CWL. Rather than trying to write a workflow in bash, we'll use ccwl and accompanying tools to scale up the effort. + # Tags * assigned: pjotrp @@ -36,10 +50,10 @@ Later * [ ] Check Artyoms LMDB version for kinship and maybe add LOCO * [+] Create JSON metadata controller for every compute incl. type of content -* [+] Create genotype archive -* [+] Create kinship archive +* [X] Create genotype archive +* [X] Create kinship archive * [+] Create trait archives -* [+] Kick off lmm9 step +* [X] Kick off lmm9 step * [ ] Update DB step v1 # Step p1: list traits to compute @@ -62,7 +76,7 @@ At this point we can write {"2":9.40338,"3":10.196,"4":10.1093,"5":9.42362,"6":9.8285,"7":10.0808,"8":9.17844,"9":10.1527,"10":10.1167,"11":9.88551,"13":9.58127,"15":9.82312,"17":9.88005,"19":10.0761,"20":10.2739,"21":9.54171,"22":10.1056,"23":10.5702,"25":10.1433,"26":9.68685,"28":9.98464,"29":10.132,"30":9.96049,"31":10.2055,"35":10.1406,"36":9.94794,"37":9.96864,"39":9.31048} ``` -Note that it (potentially) includes the parents. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with: +Note that it (potentially) includes the parents and that is corrected when generating the phenotype file for GEMMA. Also the strain-id is a string and we may want to plug in the strain name. To allow for easy comparison downstream. Finally we may want to store a checksum of sorts. In Guile this can be achieved with: ```scheme (use-modules (rnrs bytevectors) |
