# Precompute mapping input data GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries. At this stage we precompute GEMMA and tarball or lmdb it. As a project is never complete we need to add a metadata record in each tarball that track the status of the 'package'. Also, to offload compute to machines without DB access we need to prepare a first step that contains genotypes and phenotypes for compute. The genotypes have to be shared, as well as the computed kinship with and without LOCO. See => /topics/data/precompute/steps The mariadb database gets (re)updated from the computed data, parsing metadata and forcing an update. We only need to handle state to track creating batches for (offline) compute. As this can happen on a single machine we don't need to do anything real fancy. Just write out the last updated traitid, i.e., DataId from ProbeSetXRef which is indexed. Before doing anything we use Locus_old to identify updates by setting it to NULL. See the header of list-traits-to-compute.scm. # Tags * assigned: pjotrp * type: precompute, gemma * status: in progress * priority: high * keywords: ui, correlations # Tasks See => topics/data/precompute/steps Next, for running the full batch: * [X] Store all GEMMA values efficiently * [ ] Include metadata record in lmdb and as JSON file * [ ] Include metadata record on compute status * [ ] Remove junk from tarball - use lmdb? * [ ] List significant markers as metadata in lmdb * [ ] Reread below info * [ ] Submit jobs to PBS using CCWL * [ ] Report results to mariadb And after: * [ ] Track metadata of computed datasets (in RDF?) * [ ] Compute significance with GEMMA or other LMM (bulkLMM?) * [ ] Store significance and significant values for processing * [ ] Update search & correlations to use these * [ ] Further optimize computations so they can run continuously in the background # Info ## Original qtlreaper version The original reaper precompute lives in => https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py This script first fetches inbredsets ``` select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType,Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet LIMIT 5; +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+ | 1 | 1 | BXD | BXD | 1 | BXD Family | 2 | 1 | riset | Reference Populations (replicate average, SE, N) | 1 | 0 | BXD | | 2 | 2 | B6D2F2 OHSU Brain | B6D2F2 | 1 | B6D2F2 OHSU Brain | 2 | 1 | intercross | Crosses, AIL, HS | 3 | 0 | NULL | | 4 | 4 | AXB/BXA | AXBXA | 1 | AXB/BXA Family | 2 | 1 | NULL | Reference Populations (replicate average, SE, N) | 1 | 0 | AXB | | 5 | 5 | AKXD | AKXD | 1 | AKXD Family | 2 | 1 | NULL | Reference Populations (replicate average, SE, N) | 1 | 0 | AKD | | 6 | 6 | B6BTBRF2 | B6BTBRF2 | 1 | B6BTBRF2 | 2 | 1 | intercross | Crosses, AIL, HS | 3 | 0 | BBT | +----+-------------+-------------------+----------+-----------+-------------------+--------+-----------------+-------------+--------------------------------------------------+-------------+-------------+---------------+ ``` ``` MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5; +----+----------+ | Id | Name | +----+----------+ | 1 | BXD | | 2 | B6D2F2 | | 4 | AXBXA | | 5 | AKXD | | 6 | B6BTBRF2 | +----+----------+ ``` and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today (let alone the latest?). Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id. There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06" ``` select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5; | 1054 | ``` Next for this Id we fetch, known as `ProbeSetXRefInfos`: ``` MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5; +------------+----------------+--------+ | ProbeSetId | Locus | DataId | +------------+----------------+--------+ | 1 | rs13480619 | 1 | | 2 | rs29535974 | 2 | | 3 | rs49742109 | 3 | | 4 | rsm10000002321 | 4 | | 5 | rsm10000019445 | 5 | +------------+----------------+--------+ ``` Then we fetch the trait values: ``` MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5; +----+----------+-------+ | Id | StrainId | value | +----+----------+-------+ | 1 | 1 | 5.742 | | 1 | 2 | 5.006 | | 1 | 3 | 6.079 | | 1 | 4 | 6.414 | | 1 | 5 | 4.885 | +----+----------+-------+ ``` with names: ``` MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5; +----------+-------+ | Name | value | +----------+-------+ | B6D2F1 | 5.742 | | C57BL/6J | 5.006 | | DBA/2J | 6.079 | | BXD1 | 6.414 | | BXD2 | 4.885 | +----------+-------+ ``` 35 strains were used for this dataset: ``` select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1; +---------------------------+ | count(ProbeSetData.value) | +---------------------------+ | 35 | +---------------------------+ ``` with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for ``` select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5; +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | 1 | 1 | 1 | 10.095.400 | 13.3971627898894 | 0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 | 12.590069931048 | 0.269 | -0.28515625 | NULL | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ ``` the actual update is ``` update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s ``` so Locus, LRS and additive fields are updated. The old reaper scores are in ``` MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10; +------------+------------------+----------------+------------------+--------------------+ | ProbeSetId | ProbeSetFreezeId | Locus | LRS | additive | +------------+------------------+----------------+------------------+--------------------+ | 1 | 1 | rs13480619 | 12.590069931048 | -0.28515625 | | 2 | 1 | rs29535974 | 10.5970737900941 | -0.116783333333333 | | 3 | 1 | rs49742109 | 6.0970532702754 | 0.112957489878542 | | 4 | 1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 | | 5 | 1 | rsm10000019445 | 10.9232633740162 | 0.114764705882353 | | 6 | 1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 | | 7 | 1 | rs4178505 | 7.46477918183565 | 0.104331983805668 | | 8 | 1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 | | 9 | 1 | rsm10000014965 | 11.8837168740735 | 0.341458333333334 | | 10 | 1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 | +------------+------------------+----------------+------------------+--------------------+ 10 rows in set (0.000 sec) ``` This means for every dataset one single maximum score gets stored(!) From this exercise we can conclude: * Existing precomputed values are by linear regression (old QTLreaper) * Only one highest score is stored * Every time the script is run *all* datasets are recomputed * The '_old' named values in the table are not touched by the script * h2 and mean are not updated * No significance score is computed (needs permutations) Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently. ## Notes on ProbeSetXRef ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny ``` select count(LRS) from ProbeSetXRef; +------------+ | count(LRS) | +------------+ | 28335955 | +------------+ ``` half the fields are used. What to think of ``` MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5; +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | 589 | 8599010 | 70606737 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8593710 | 70606738 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8607637 | 70606739 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8594490 | 70606740 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8591994 | 70606741 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ ``` ``` MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001"; +----------+ | count(*) | +----------+ | 291447 | +----------+ ``` There is obviously more. I think this table can use some cleaning up? Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is: ``` wqflask/maintenance/quantile_normalize.py wqflask/maintenance/generate_probesetfreeze_file.py wqflask/base/mrna_assay_tissue_data.py wqflask/wqflask/update_search_results.py wqflask/wqflask/correlation/rust_correlation.py wqflask/base/trait.py wqflask/wqflask/do_search.py wqflask/base/data_set/mrnaassaydataset.py wqflask/base/data_set/dataset.py wqflask/wqflask/correlation/pre_computes.py wqflask/wqflask/api/router.py wqflask/wqflask/show_trait/show_trait.py scripts/insert_expression_data.py scripts/maintenance/Update_Case_Attributes_MySQL_tab.py scripts/maintenance/readProbeSetSE_v7.py scripts/maintenance/QTL_Reaper_v6.py scripts/maintenance/readProbeSetMean_v7.py wqflask/tests/unit/base/test_mrna_assay_tissue_data.py ``` Let's visit these one by one to make sure there is no side-effect. ### wqflask/maintenance/quantile_normalize.py Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`. Last time it was used is probably 2018. ### wqflask/maintenance/generate_probesetfreeze_file.py This one is even older and probably from 2013. ### wqflask/base/mrna_assay_tissue_data.py Another dinosaur, actually uses TissueProbeSetXRef, so not relevant. ### wqflask/wqflask/update_search_results.py This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on. ### wqflask/wqflask/correlation/rust_correlation.py Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME ### wqflask/base/trait.py Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS. ### wqflask/wqflask/do_search.py Searches for genes with a QTL within the given LRS values LRS searches can take 3 different forms: - LRS > (or <) min/max_LRS - LRS=(min_LRS max_LRS) - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb) where min/max_LRS represent the range of LRS scores and start/end_Mb represent the range in megabases on the given chromosome ### wqflask/base/data_set/mrnaassaydataset.py Also search results. ### wqflask/base/data_set/dataset.py ID sync ### wqflask/wqflask/correlation/pre_computes.py ID sync ### wqflask/wqflask/api/router.py Fetch traits and sample data API (we may use this!) ### wqflask/wqflask/show_trait/show_trait.py Same. ### The following are a punch of scripts: * scripts/insert_expression_data.py * scripts/maintenance/Update_Case_Attributes_MySQL_tab.py * scripts/maintenance/readProbeSetSE_v7.py * scripts/maintenance/QTL_Reaper_v6.py * scripts/maintenance/readProbeSetMean_v7.py * wqflask/tests/unit/base/test_mrna_assay_tissue_data.py ## Using LRS * wqflask/wqflask/static/new/javascript/lod_chart.js LRS is used to display a LOD chart and in a bunch of python files that match above list for ProbeSetXRef: * scripts/maintenance/QTL_Reaper_v6.py * wqflask/base/webqtlConfig.py * wqflask/base/trait.py * wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py * wqflask/tests/unit/base/test_trait.py * wqflask/wqflask/parser.py * wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py * wqflask/wqflask/update_search_results.py * wqflask/wqflask/correlation/show_corr_results.py * wqflask/wqflask/marker_regression/run_mapping.py * wqflask/wqflask/export_traits.py * wqflask/wqflask/gsearch.py * wqflask/base/data_set/phenotypedataset.py * wqflask/base/data_set/markers.py * wqflask/wqflask/api/router.py * wqflask/base/data_set/mrnaassaydataset.py * wqflask/wqflask/correlation/rust_correlation.py * wqflask/wqflask/marker_regression/display_mapping_results.py * wqflask/wqflask/collect.py * wqflask/wqflask/do_search.py * wqflask/wqflask/views.py * wqflask/utility/Plot.py * test/requests/main_web_functionality.py * test/requests/correlation_tests.py From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL). It looks safe to start replacing qtlreaper results with GEMMA results. Only thing I am not sure about is correlations. A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes. Anyway, if things go haywire we'll find out soon (enough). At the next stage we ignore all this and start precompute with GEMMA on the BXD. ## Precompute DB We will use a database to track precompute updates (see below). On the computing host (or client) we should track the following: * Dataset (phenotypes) * time: MySQL time ProbeSetData table was last updated (if identical we can skip) * Hash on DB inputs (for DB table updates) * Genotypes * Algorithm * Hash on run inputs (phenotypes, genotypes, algorithm, invocation) * time: of initiate run * time: of completion * Hostname of run (this) * File path (this) * Hash on output data (for validation) * array of rec: DB hostnames + time stamps: successfully updated DB table for these servers The logic is that if the DB table was changed we should recompute the hash on inputs. Note the ProbeSetData table is the largest at 200G including indices. If that hash changed the mapping algorithm should rerun. A new record will be created on the new inputs. A flag is set for the updated DB for precomputed values. We can garbage collect when multiple entries for `Dataset' exist. What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice. We would like to share precompute with other setups, however. Also we typically do not want to run the precompute on the production host. That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server. We want to track compute so we can distribute running the algorithms across servers and/or PBS. This implies the compute machines have to be able to query the DB in some way. Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes. A run is started and on completion the DB is notified and updated. Note that a runner can not be parallel on a single results directory, so one runner per target output directory. We can have different runners, one for local machine, one for PBS and one for remotes. This also implies that 'state' is handled with the output dir (on a machine). We will create a JSON record for every compute. This state can be pushed into 'any' GN instance at any point. For running jobs the DB can be queried through a (same) REST API. The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine. On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table: * Dataset * Hash on relevant phenotypes (ProbeSetData) * time: Job started * host: host running job * status: (final) job status * time: DB updated ## Tracking updates to DB This brings us to CRON jobs. There are several things that ought to be updated when the DB changes. Xapian being one example and now this table. These should run on a regular basis and only update what really changed. We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian). One possibility is to use MySQL triggers to track changes to the database table. Even a hash can be computed as suggested here: => https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying. But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service. Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus. The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast. `ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06". `ProbeSetId` is the trait in the collection, e.g. '100001_at'. `DataId` is the counter in ProbeSetXRef table. So now this should be clear on update: ``` update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s ``` Every trait gets an LRS and effect size this way. ## Writing script to Hash ProbeSetData tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB? For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies. We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service: => https://github.com/genenetwork/gn-auth In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with ``` LOGLEVEL = "DEBUG" SECRET_KEY = '9a4a2a...' SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl" AUTH_DB = "/home/gn2/auth_copy.db" AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations" ``` The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state: * SQL server (mysql) - read-write * Other databases - read-write * Upload files - read-write * Reference files - read-only * Secret files - read-only * Cache files - read-write Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB". We'll take care of that later. For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words. ## Getting precompute running Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now. The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now. The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup! ``` MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200; +-----+------------+------------+-----------+--------+------------+ | Id | Name | Name2 | SpeciesId | Symbol | Alias | +-----+------------+------------+-----------+--------+------------+ | 1 | B6D2F1 | B6D2F1 | 1 | NULL | NULL | | 2 | C57BL/6J | C57BL/6J | 1 | B6J | P | | 3 | DBA/2J | DBA/2J | 1 | D2J | P | | 4 | BXD1 | BXD1 | 1 | NULL | NULL | | 5 | BXD2 | BXD2 | 1 | NULL | NULL | | 6 | BXD5 | BXD5 | 1 | NULL | NULL | | 7 | BXD6 | BXD6 | 1 | NULL | NULL | select * from Strain where SpeciesId=1 and Name like '%BXD%'; returns 1371 rows ``` The originally named StrainXRef table links StrainID with SpeciesID: ``` MariaDB [db_webqtl]> select * from StrainXRef limit 3; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 1 | 1 | 10 | N | MP_F1 | | 1 | 2 | 20 | N | Mat | | 1 | 3 | 30 | N | Pat | +-------------+----------+---------+------------------+----------------+ ``` ## Getting all BXD datasets Together with the InbredSet table we have the information we need to fetch BXD datasets. ``` select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3; +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ | 1 | 1 | BXD | BXD | 1 | BXD Family | 2 | 1 | 0 | 1 | 0 | BXD | | 2 | 2 | B6D2F2 OHSU Brain | B6D2F2 | 1 | B6D2F2 OHSU Brain | 2 | 1 | 0 | 3 | 0 | NULL | | 4 | 4 | AXB/BXA | AXBXA | 1 | AXB/BXA Family | 2 | 1 | NULL | 1 | 0 | AXB | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ ``` So ``` MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 1 | 1 | 10 | N | MP_F1 | | 1 | 2 | 20 | N | Mat | | 1 | 3 | 30 | N | Pat | | 1 | 4 | 40 | Y | RI_progeny | | 1 | 5 | 50 | Y | RI_progeny | +-------------+----------+---------+------------------+----------------+ ``` ``` select * from ProbeSetData where StrainId = 4 limit 5; ``` Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points! Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute. Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old. Let's fetch one where Locus_old is now NULL: ``` select ProbeSetFreezeId,ProbeSetId,DataId,Locus_old,Locus from ProbeSetXRef where Locus_old is NULL limit 1; +------------------+------------+--------+-----------+-----------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | Locus | +------------------+------------+--------+-----------+-----------+ | 31 | 48552 | 458384 | NULL | D10Mit194 | +------------------+------------+--------+-----------+-----------+ 1 row in set (0.00 sec) ``` DataId is really ProbeSetDataId, so: ``` MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.id = 458384 limit 1; +--------+----------+--------+ | Id | StrainId | value | +--------+----------+--------+ | 458384 | 42 | 10.306 | +--------+----------+--------+ 1 row in set (0.00 sec) ``` And StrainId can be found in StrainXRef with ``` MariaDB [db_webqtl]> select * from Strain where Strain.id=42; +----+---------+---------+-----------+--------+-------+ | Id | Name | Name2 | SpeciesId | Symbol | Alias | +----+---------+---------+-----------+--------+-------+ | 42 | CASE001 | CASE001 | 1 | NULL | NULL | +----+---------+---------+-----------+--------+-------+ ``` and ``` MariaDB [db_webqtl]> select * from StrainXRef where StrainId=42; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 2 | 42 | 1 | N | NULL | +-------------+----------+---------+------------------+----------------+ ``` ``` MariaDB [db_webqtl]> select * from InbredSet where Id=2;; +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | Description | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ | 2 | 2 | B6D2F2 OHSU Brain | B6D2F2 | 1 | B6D2F2 OHSU Brain | 2 | 1 | intercross | Crosses, AIL, HS | 3 | 0 | NULL | NULL | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ ``` which is not part of the BXD (InbredSetId=1). So, StrainXRef is crucial to get to the actual BXD. Let's look how it is used in existing code. In GN2 it is used in 2 places only: router and sample list: ``` SELECT Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1; ``` lists all strain *names* based on Strain.id. The nomenclature is bogglingly bad. Anyway, it confirms that we can use this to see if something is part of the BXD. Get all the BXD used for mapping ``` SELECT StrainId,Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1 AND Used_for_mapping="Y" ORDER BY StrainId; ``` After some thought I decided we will create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record: * top hit info (locus, LRS, p-value, effect) * number of other hits * file with other hits And down the line we can point to a different DB table that records significant hits. As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA! ## Create Hash table This new Hash table can be called 'SignificantHits' or something. ProbeSetXRef is defined as ``` CREATE TABLE ProbeSetXRef ( ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0, ProbeSetId int(10) unsigned NOT NULL DEFAULT 0, DataId int(10) unsigned NOT NULL DEFAULT 0, Locus_old char(20) DEFAULT NULL, LRS_old double DEFAULT NULL, pValue_old double DEFAULT NULL, mean double DEFAULT NULL, se double DEFAULT NULL, Locus varchar(50) DEFAULT NULL, LRS double DEFAULT NULL, pValue double DEFAULT NULL, additive double DEFAULT NULL, h2 float DEFAULT NULL, PRIMARY KEY (DataId), ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ``` Not all choices are that sane, but we'll leave it like it is for now (it is a small table). The new table will look like (LRS=Log wise ratios) ``` CREATE TABLE SignificantHits ( Id INT UNSIGNED NOT NULL DEFAULT 0, ProbeSetFreezeId INT UNSIGNED NOT NULL DEFAULT 0, <- points to dataset info ProbeSetId INT UNSIGNED NOT NULL DEFAULT 0, <- points to trait info ProbeSetDataId INT UNSIGNED NOT NULL DEFAULT 0, <- id, strain value OR ID of the table? mean double DEFAULT NULL, se double DEFAULT NULL, Locus varchar(50) DEFAULT NULL, LRS double DEFAULT NULL, pValue double DEFAULT NULL, significant double DEFAULT NULL, additive double DEFAULT NULL, h2 float DEFAULT NULL, PRIMARY KEY (Id), ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ``` Note prepending Id and a rename from DataId to ProbeSetDataId because they refer to the same: ``` MariaDB [db_webqtl]> select max(DataId) from ProbeSetXRef limit 10; | max(DataId) | | 92213693 | select max(Id) from ProbeSetData limit 3; | max(Id) | | 92213693 | ``` In the future this table will be the replacement for the badly named ProbeSetXRef ## Creating a list of work on tux04 We'll use a test table first on tux02. It should not matter for precompute because we'll regenerate hashes from the DB input and GEMMA output. I.e., we only need to recompute if something has changed in phenotypes or genotypes and that is quite rare. We will work on tux04 and I'll first install the database there. In the process we'll need to get the basic environment going that we need for work. First I thought I'd copy the backup from space - as I thought it has 10Gbs, but actually it uses 1Gbs and the backup is a month old (need to fix that too!). Alright, let's use tux02 for the DB. On the new machine tux04 we disable /etc/profile and /etc/bash.bashrc - people can add that themselves and it messes up guix profiles. After creating my environment it looks like any other machine I use and I have the same keyboard short cuts. Note that my UID may be wrong for Octopus, but that requires a more heavy surgery anyway. Next we create a drop for the backups from another machine following: => ../backup_drops.gmi Copy the backup which takes a while over the slow network and unpack with the great tool borg ``` time borg extract --progress /export/backup/bacchus/drop/tux01/borg-tux01::borg-backup-mariadb-20231111-06:39-Sat ``` Next install a recent Mariadb from guix as a container using the instructions in => mariadb.gmi => setting-up-local-development-database.gmi move the database files into, for example, /export/mysql/var/lib/mysql. chown files to your user account. Next ``` cd /export/mysql mkdir tmp mkdir run tux04:/export/mysql$ ~/opt/guix-pull/bin/guix shell -C -N coreutils sed mariadb --share=/export/mysql/var=/var --share=/export/mysql/tmp=/tmp export TMPDIR=/tmp mysqld_safe --datadir='/var/lib/mysql/' --protocol tcp --port=3307 --user=$USER --group=users --nowatch ``` and a client with: ``` /export/mysql$ ~/opt/guix-pull/bin/guix shell mysql -- mysql --socket=var/run/mysqld/mysqld.sock -uwebqtlout -pwebqtlout db_webqtl ``` ### Modifying the tables Set unused 'Locus_old' to NULL because we use that to update the existing table with the new values for production (note this is a quick hack discussed earlier): ``` update ProbeSetXRef set Locus_old=NULL; ``` SELECT DISTINCT DataId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId = ProbeSetData.Id where StrainId>45 AND Locus_old is NULL limit 10; ## Preparing for GEMMA Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA. ### Testing a script to access the database A first script on tux02 runs: ```scheme (use-modules (dbi dbi)) (define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306")) (dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1") (display (dbi-get_row db_webqtl)) ((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0)) ``` Note that you can run this over guile-geiser remote. ### Install and run gemma Now we have a working DB query method on a compute node we can start running GEMMA. GEMMA is packaged as part of GNU Guix, so that part is straightforward. After above guix pull ``` . ~/opt/guix-pull/etc/profile <- don't do this, it does not work on Debian Guix images! git clone https://git.genenetwork.org/guix-bioinformatics/ git clone https://gitlab.inria.fr/guix-hpc/guix-past ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -A gemma gemma-gn2 0.98.5-8cd4cdb out /home/wrk/guix-bioinformatics/gn/packages/gemma.scm:31:2 ``` and install gemma with ``` wrk@tux05:~$ ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -i gemma -p ~/opt/gemma ``` ### Creating the precompute-runner The precompute-runner can be part of `gn-guile/scripts/precompute`, i.e., the new GN guile repo's maintenance script dir: => https://git.genenetwork.org/gn-guile/tree/scripts/precompute See the README.md for more information on running the script etc. For development I'll tunnel to the Tux02 database. As we are doing the BXD's first we first fetch a record from ProbeSetXRef that has Locus_old set to NULL AND matches a BXD trait. First we fetch all BXD strainids. I wrote a function `bxd-strain-id-names` for that. Next, using => https://git.genenetwork.org/gn-guile/commit/?id=b1db013cc01c94e27edf982be9b027a2b0bb9712 we fetch the first BXD dataset where all strains are members of BXD: ``` ((Locus . rs13480619) (DataId . 1) (ProbeSetId . 1))1 WE HAVE OUR FIRST BXD DATASET for precompute!((1 . 5.742) (2 . 5.006) (3 . 6.079) (4 . 6.414) (5 . 4.885) (6 . 4.719) (7 . 5.761) (8 . 5.604) (9 . 5.661) (10 . 5.708) (11 . 5.628) (12 . 6.325) (13 . 5.37) (14 . 6.544) (15 . 5.476) (16 . 5.248) (17 . 5.528) (19 . 5.51) (20 . 5.886) (21 . 5.177) (22 . 5.655) (23 . 5.522) (24 . 5.549) (25 . 4.588) (26 . 5.618) (28 . 6.335) (29 . 5.569) (30 . 4.422) (31 . 5.194) (35 . 4.784) (36 . 5.056) (37 . 5.869) (39 . 5.175) (40 . 5.207) (41 . 5.264)) ``` If it is not a match we'll need to iterate to the next one. The current version of precompute can fetch all the BXD datasets by name and trait name. => https://git.genenetwork.org/gn-guile/tree/scripts/precompute/precompute-hits.scm For GEMMA we need to write a phenotype file which is simply a vector of values. Meanwhile, the genotype file and SNP files for the BXD are fixed (currently version 8 with 21.056 markers). To run GEMMA we can take an example from `gemma-wrapper`, a tool I wrote to cache results: => https://github.com/genetics-statistics/gemma-wrapper gemma-wrapper does a bit more work than we really need and is a bit too generic for this precompute exercise. The following things matter: * handling of hashes to prevent recompute * handling of the kinship matrix (K) * handling of leave one chromosome out (LOCO) all of which are handled by gemma-wrapper really well. We'll use gemma-wrapper for the time being to do the actual computations. BTW the cache for GN production is currently at 190G with 40K files. That is 3 months worth of caching GEMMA as we clean up older files! For precompute we may need to be smarter because that only represents 970 association results and 400 kinship matrices. We should improve GEMMA by: * by default only keep the required results * provide a more compact storage of results We should be using something like lmdb as that will speed up lookups and GEMMA itself. Simple compression will reduce output 5x for one 7Mb output. Some K matrices are huge too. The largest one runs at 0.5 Gb. GEMMA does crash once in a while, unsurprisingly. For the BXD we need to compute at least 100K traits - I have not done a proper count yet - and we can delete output files when the database gets updated. We should, however, hang on to hits that are potentially significant, so I want to retain output, or create a new mysql table. ## Run gemma-wrapper We can run a local version of gemma-wrapper with ``` .guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper ruby -- ruby /gemma-wrapper/bin/gemma-wrapper ``` this is useful because we want to change gemma-wrapper itself. Mostly to produce less bulky output and add some metadata for downstream processing (i.e. to update the database itself). Note how versatile `guix shell` is for development! gemma-wrapper shows the following output: ``` NOTE: gemma-wrapper is soon to be replaced Using GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020 Found 0.98.3, comparing against expected v0.98.4 GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020 WARNING: GEMMA version is out of date. Update GEMMA to 0.98.4! ``` Some irony there. We are going to replace gemma-wrapper at some point. And the gemma that is packaged in guix by default it older that what we want to use. Though the release notes point out mostly maintenance updates: => https://github.com/genetics-statistics/GEMMA/releases We can pull in a local gemma too using the same `--expose` trick above. So, let's use that for now: ``` .guix-shell --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma ruby -- /gemma/bin/gemma ``` This does not work out of the box because I typically build gemma on a different Guix profile, so it won't find the libraries (RPATH is fixed). There are two options, build a static version of gemma or override the LD_LIBRARY_PATH. The first used to work, but Guix no longer carries static libs for openblas, so I would need to provide those myself. Overriding LD_LIBRARY_PATH is a bit inconvenient, but in this case doable. We have to run from inside the container and set: ``` env LD_LIBRARY_PATH=$GUIX_ENVIRONMENT/lib/ /gemma/bin/gemma GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 type ./gemma -h [num] for detailed help options: 1: quick guide 2: file I/O related 3: SNP QC ... ``` and success. With gemma wrapper we can pick up the latest gemma by setting the GEMMA_COMMAND ``` (system "env GEMMA_COMMAND=/gemma/bin/gemma /gemma-wrapper/bin/gemma-wrapper") gemma-wrapper 0.99.6 (Ruby 3.2.2) by Pjotr Prins 2017-2022 NOTE: gemma-wrapper is soon to be replaced Using GEMMA 0.98.6 (2022-08-05) by Xiang Zhou, Pjotr Prins and team (C) 2012-2022 ``` All set for actual compute! ## Running gemma The following command fails because we don't have a pheno file yet: ``` gemma-wrapper --debug -- -gk -g BXD.8_geno.txt.gz -p pheno.txt -a BXD.8_snps.txt ``` Now the pheno file needs to match the number of genotypes in the geno file which, presumably, is generated from the file `BXD.8.gene`. It has the header: ``` BXD1 BXD2 BXD5 BXD6 BXD8 BXD9 BXD11 BXD12 BXD13 BXD14 BXD15 BXD16 BXD18 BXD19 BXD20 BXD21 BXD22 BXD23 BXD24 BXD24a BXD25 BXD27 BXD28 BXD29 BXD30 BXD31 BXD32 BXD33 BXD34 BXD35 BXD36 BXD37 BXD38 BXD39 BXD40 BXD41 BXD42 BXD43 BXD44 BXD45 BXD48 BXD48a BXD49 BXD50 BXD51 BXD52 BXD53 BXD54 BXD55 BXD56 BXD59 BXD60 BXD61 BXD62 BXD63 BXD64 BXD65 BXD65a BXD65b BXD66 BXD67 BXD68 BXD69 BXD70 BXD71 BXD72 BXD73 BXD73a BXD73b BXD74 BXD75 BXD76 BXD77 BXD78 BXD79 BXD81 BXD83 BXD84 BXD85 BXD86 BXD87 BXD88 BXD89 BXD90 BXD91 BXD93 BXD94 BXD95 BXD98 BXD99 BXD100 BXD101 BXD102 BXD104 BXD105 BXD106 BXD107 BXD108 BXD109 BXD110 BXD111 BXD112 BXD113 BXD114 BXD115 BXD116 BXD117 BXD119 BXD120 BXD121 BXD122 BXD123 BXD124 BXD125 BXD126 BXD127 BXD128 BXD128a BXD130 BXD131 BXD132 BXD133 BXD134 BXD135 BXD136 BXD137 BXD138 BXD139 BXD141 BXD142 BXD144 BXD145 BXD146 BXD147 BXD148 BXD149 BXD150 BXD151 BXD152 BXD153 BXD154 BXD155 BXD156 BXD157 BXD160 BXD161 BXD162 BXD165 BXD168 BXD169 BXD170 BXD171 BXD172 BXD173 BXD174 BXD175 BXD176 BXD177 BXD178 BXD180 BXD181 BXD183 BXD184 BXD186 BXD187 BXD188 BXD189 BXD190 BXD191 BXD192 BXD193 BXD194 BXD195 BXD196 BXD197 BXD198 BXD199 BXD200 BXD201 BXD202 BXD203 BXD204 BXD205 BXD206 BXD207 BXD208 BXD209 BXD210 BXD211 BXD212 BXD213 BXD214 BXD215 BXD216 BXD217 BXD218 BXD219 BXD220 C57BL/6JxBXD037F1 BXD001xBXD065aF1 BXD009xBXD170F1 BXD009xBXD172F1 BXD012xBXD002F1 BXD012xBXD021F1 BXD020xBXD012F1 BXD021xBXD002F1 BXD024xBXD034F1 BXD032xBXD005F1 BXD032xBXD028F1 BXD032xBXD65bF1 BXD034xBXD024F1 BXD034xBXD073F1 BXD055xBXD074F1 BXD055xBXD65bF1 BXD061xBXD071F1 BXD062xBXD077F1 BXD065xBXD077F1 BXD069xBXD090F1 BXD071xBXD061F1 BXD073bxBXD065F1 BXD073bxBXD077F1 BXD073xBXD034F1 BXD073xBXD065F1 BXD073xBXD077F1 BXD074xBXD055F1 BXD077xBXD062F1 BXD083xBXD045F1 BXD087xBXD100F1 BXD065bxBXD055F1 BXD102xBXD077F1 BXD102xBXD73bF1 BXD170xBXD172F1 BXD172xBXD197F1 BXD197xBXD009F1 BXD197xBXD170F1 ``` with 235 named BXD mice. Now when we check the DB we get ``` select * from StrainXRef WHERE InbredSetId=1 AND Used_for_mapping="Y" 276 rows ``` so it looks like we need some extra logic to fetch the used individuals from the actual genotype file(s). For development I am running: ``` .guix-shell ruby --expose=/home/wrk/iwrk/opensource/code/genetics/gemma-wrapper/=/gemma-wrapper --expose=/home/wrk/iwrk/opensource/code/genetics/gemma/=/gemma -- guile -L . -s ./scripts/precompute/precompute-hits.scm ``` ## Writing the phenotype file For gemma we need to feed a phenotype file that has only the individuals that are in the genotype file (the other 'missing' phenotype values should be NAs). The genotype header in GN2 is read from a `BXD.json` file. Code is at wqflask/wqflask/marker_regression/run_mapping.py:get_genofile_samplelist(dataset) Zach writes: > The JSON files were basically just a "stopgap" solution to allow for > multiple genotype files for any given group (which previously wasn't > possible), and they don't even exist for most groups - the code uses > them only if they exist for a given group. They contain the > title/label, filename, and sample lists for situations where the other > genotype files only contain a subset of the samples from the full > sample list, but otherwise (and usually) the samplelist is just fetched > from the .geno file. It's also possible to fetch that subset of samples > from the other/alternate .geno files, but it's easier to just fetch the > samples as a list from that .json file than it is to parse it out of > the other .geno files (and the JSON file would still be used to to get > the "title" and filenames, so it's already being read in the code). > Basically the JSON files are just a sort of "genotype file metadata" > thing to deal with the existence of multiple genotype files for a > single group and provide the "attributes" of the files in question > (namely the title and filename - the sample lists were only added later > once we started to get genotype files that didn't contain the full > samplelist for the group in question). The "titles" appear in the > drop-down on the trait page* - in their absence it will just use > "{group_name}.geno" (for example BXD.geno). So if you deleted BXD.json, > mapping would still work for BXD, but it would just default to BXD.geno > and wouldn't give the option of other files. This information could > probably also be stored somehow in the .geno file, but I'm not sure how > to easily do so since that's a weird proprietary format and would > seemingly require some sort of direct parsing in the GN code (while > JSON is more simple to deal with). => https://genenetwork.org/show_trait?trait_id=1427571_at&dataset=HC_M2_0606_P > If you scroll down to mapping and Inspect Element on the > Genotypes drop-down you'll see how the values are the actual filenames > while the text is the title from the JSON file So this is genotype metadata which could go into virtuoso (where all metadata belongs). Anyway, we'll get there when the time comes, but for now I can use it to see what genotypes we have in the genotype file and write the phenotype file accordingly. I note we also have a GenoData table in mariadb which has allele info. The script => https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/load_genotypes.py loads these from the BXD.geno-type files. As far as I can tell GenoData is not used in GN2+GN3. Other Geno files are basically used to fetch metadata on genotypes. Genotype state lives in 4 places. Time to create a 5th one with lmdb ;). At least the BXD.json file lines up with BXD.8.geno and the BIMBAM version with 235 inds. Using this information we created our first phenotype file and GEMMA run! # Successfully running gemma-wrapper Running the wrapper is a two step process: ``` env TMPDIR=tmp ruby ./bin/gemma-wrapper --force --json \ --loco -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -gk > K.json env TMPDIR=tmp ruby ./bin/gemma-wrapper --json --input K.json \ -- \ -g test/data/input/BXD_geno.txt.gz \ -p test/data/input/BXD_pheno.txt \ -a test/data/input/BXD_snps.txt \ -lmm -maf 0.1 > G.json ``` The current LOCO approach leads to two files for every chromosome GRM and two files for every chr association output file. In this case 82 files with a total of 13Mb (4Mb compressed). That is a bit insane if you know the input is 300K, even knowing disk space is cheap(!) Cheap is not always cheap because we still need to process the data and with growing datasets the size grows rapidly overall. So, this is the right time to put gemma-wrapper on a diet. The GRM files are largest. Currently we create kinship files for every population subset that is used and that may change once we simply reduce the final GRM by removing cols/rows. But that is one exercise we want to prove first using our precompute exercise. In this case we will simply compress the kinship files and that halves the size with zip. xz compression brings it down to 1/4. That is impressive by itself. I also checked lmza and bzip2 and they were no better. So, with gemma-wrapper we can now store the GRMs in an xz archive. For the assoc files we will cat them in to one file and compress that too, reducing the size to 1/7th. As noted above, the current cache size for GN is 190Gb for 3 months. We can reduce that significantly and that will speed up lookups. Decompression with xz is very fast. # Storing GRM output The current version of gemma-wrapper stores per chromosome GRMs in separate files. The first fix was to store them in an xz archive. gemma-wrapper already uses a temporary directory so, that was straightforward. Next I had to tell gemma-wrapper not to recompute when the xz archive exists. This now works. Next step is to use the archive to check the GWA run. In the final step we will archive results and write an lmdb file for further processing. If we can make it really small we'll retain that instead of the full archive. This code is part of gemma-wrapper right now: => https://github.com/genetics-statistics/gemma-wrapper/blob/master/bin/gemma2lmdb.py I looked at using SQLITE - which would have been a bit easier - but the size would have been at least double the size of lmdb (floats are stored as 8 bytes in SQLITE). Inside the lmdb file we also store the results of the log files as a JSON 'meta' record. A metadata record is important because it allows for quick assessments later, such has how long the compute took at each step and how much memory was used. We collect metadata at every step. This means that we pass around a growing JSON object. Also we should be able to expose the JSON easily so it can be parsed with python, ruby, jq etc. With metadata we can store some extra information on the ProbeSet: ``` MariaDB [db_webqtl]> select Id,Chr,Mb,Name,Symbol,description from ProbeSet where Id=1 limit 1; +----+------+-----------+-----------+--------+---------------------------------+ | Id | Chr | Mb | Name | Symbol | description | +----+------+-----------+-----------+--------+---------------------------------+ | 1 | 9 | 44.970689 | 100001_at | Cd3g | CD3d antigen, gamma polypeptide | +----+------+-----------+-----------+--------+---------------------------------+ ``` and the dataset: ``` MariaDB [db_webqtl]> select * from ProbeSetFreeze WHERE Name='HC_M2_0606_P' limit 5; +-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ | Id | ProbeFreezeId | AvgID | Name | Name2 | FullName | ShortName | CreateTime | OrderList | public | confidentiality | AuthorisedUsers | DataScale | +-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ | 112 | 30 | 2 | HC_M2_0606_P | Hippocampus_M430_V2_BXD_PDNN_Jun06 | Hippocampus Consortium M430v2 (Jun06) PDNN | Hippocampus M430v2 BXD 06/06 PDNN | 2016-02-11 | 1 | 2 | 0 | | log2 | +-----+---------------+-------+--------------+------------------------------------+--------------------------------------------+-----------------------------------+------------+-----------+--------+-----------------+-----------------+-----------+ ``` gemma-wrapper outputs JSON - but it is fairly elaborate so we'll reduce it to a minimum. At the time of lmdb creation we will pass in a small JSON file that describes the gemma-wrapper run. # Storing assoc output To kick off precompute we added new nodes to the Octopus cluster: doubling its capacity. In the next step we have to compress the output of GEMMA so we can keep it forever. For this we want to have the peaks (obviously), but we als want to retain the 'shape' of the distribution - i.e., the QTL with sign. This shape we can use for correlations and potentially some AI-style mining. The way it is presented in AraQTL. For the sign we can use the SNP additive effect estimate. The se of Beta is a function of MAF of the SNP. So if you want to present Beta as the SNP additive effect for a standardized genotype, then you want to use Beta/se; otherwise, Beta is the SNP additive effect for the original, unstandardized genotype. The Beta is obtained by controlling for population structure. For effect sign we need to check the incoming genotypes because they may have been switched. Anyway, we can consider compressing the shape the way a CDROM is compressed. For now, in the next step we will compress the storage using xz - as discussed above. In the next step, after running the precompute and storing the highest hit in the DB, we will start using lmdb to store assoc values. The precompute runs on tux04 with ``` tux04:~/services/gn-guile$ . .guix-shell ruby --expose=/home/wrk/services/gemma-wrapper=/gemma-wrapper --share=/export2/precompute-gemma -- env TMPDIR=/export2/precompute-gemma guile -L . -s ./scripts/precompute/precompute-hits.scm ``` ## Final tweaks The precompute runs and updates the DB. We are creating a new precompute table that contains the highest hits, so we can track status of the runs. From this we can update ProbeSetXRef - rather than doing it directly. That way we can also track the top hits for different computation. For one result we have ```js "meta": { "type": "gemma-wrapper", "version": "0.99.7-pre1", "population": "BXD", "name": "HC_U_0304_R", "trait": "101500_at", "url": "https://genenetwork.org/show_trait?trait_id=101500_at&dataset=HC_U_0304_R", "archive_GRM": "46bfba373fe8c19e68be6156cad3750120280e2e-gemma-cXX.tar.xz", "archive_GWA": "779a54a59e4cd03608178db4068791db4ca44ab3-gemma-GWA.tar.xz", "dataid": 75629, "probesetid": 1097, "probesetfreezeid": 7 } ``` Continue with steps: => /topics/data/precompute/steps # Notes 1 rsm10000000577 3.20 Chr1: 69.315198 -0.248 => chr,pos,af,beta,se,l_mle,p_lrt 1,69315198,0.2800000011920929,0.49527430534362793,0.13411599397659302,100000.0,0.0006254952750168741 Math.log(0.0006254952750168741,10) => -3.203775966613006 -0.4952743053436279/2 => -0.24763715267181394 5 rsm10000001990 3.11 Chr4: 81.464858 0.282 4,81464858,0.36500000953674316,-0.5631462931632996,0.15581850707530975,100000.0,0.0007759517757222056 Math.log(0.0007759517757222056,10) => -3.1101652686754857 0.563146/2 => 0.281573 The likelihood ratio is bounded between zero and one. The l_lrt runs between 0.0 and 1.0. The smaller the number the higher the log10 value (basically the number of digits; 0.000001 is -5.99 The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In our case it is always exactly 10^5 or -10^5 for some reason. According to GEMMA doc that means it is a pure genetic effect when positive. ## Xapian We want to add the log values, effect size and number of individuals to the xapian search ## NAs in GN A note from Zach: On Sat, Dec 09, 2023 at 06:09:56PM -0600, Zachary Sloan wrote: > (After typing the rest of this out, I realized that part of the > confusion might be about how locations are stored. We don't actually > database locations in the ProbeSetXRef table - we only database the > peak Locus marker name. This is then cross-referenced against the Geno > table, where the actual Location is stored. This is the main source of > the problem. So I think the best short-term solution might be to just > directly database the locations in the ProbeSetXRef table. Those > locations might become out of date, but as you mention they'd still > probably be in the same ballpark.) It is logical to store the location with the peak - if it changes we should recompute. That also adds the idea that we should track the version of the genotypes in that table. ## More complicated datasets A good dataset to take apart is => http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P because it has 71 BXD samples and 32 other samples. ## Variations This paper discusses a number of approaches that may be interesting: => https://biodatamining.biomedcentral.com/articles/10.1186/s13040-023-00331-3 Automated quantitative trait locus analysis (AutoQTL) ## Fetch strain IDs The following join will fetch StrainID in a dataset ``` MariaDB [db_webqtl]> select StrainId, Locus, DataId, ProbeSetId, ProbeSetFreezeId from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL ORDER BY DataId LIMIT 5; +----------+-----------+--------+------------+------------------+ | StrainId | Locus | DataId | ProbeSetId | ProbeSetFreezeId | +----------+-----------+--------+------------+------------------+ | 1 | rs6394483 | 115467 | 13825 | 8 | | 2 | rs6394483 | 115467 | 13825 | 8 | | 3 | rs6394483 | 115467 | 13825 | 8 | | 4 | rs6394483 | 115467 | 13825 | 8 | | 5 | rs6394483 | 115467 | 13825 | 8 | +----------+-----------+--------+------------+------------------+ 5 rows in set (0.205 sec) ``` ## Count data sets Using above line we can count the number of times BXD1 was used: ``` MariaDB [db_webqtl]> select count(StrainId) from ProbeSetXRef INNER JOIN ProbeSetData ON ProbeSetXRef.DataId=ProbeSetData.Id where DataId>0 AND Locus_old is NULL AND StrainId=1 ORDER BY DataId LIMIT 5; +-----------------+ | count(StrainId) | +-----------------+ | 10432197 | +-----------------+ 1 row in set (39.545 sec) ``` Or the main BXDs