# Precompute mapping input data GN relies on precomputed mapping scores for search and other functionality. Here we prepare for a new generation of functionality that introduces LMMs for compute and multiple significant scores for queries. # Tags * assigned: pjotrp * type: precompute, gemma * status: in progress * priority: high * keywords: ui, correlations # Tasks * [ ] Start using GEMMA for precomputed values as a background pipeline on a different machine * [ ] Update the table values using GEMMA output (single highest score) Above is the quick win for plugging in GEMMA values. We will make sure not to recompute the values that are already up to date. This is achieved by naming the input and output files as a hash on their DB inputs. Next: * [ ] Store all GEMMA values efficiently * [ ] Track metadata of computed datasets (in RDF?) * [ ] Compute significance with GEMMA or other LMM (bulkLMM?) * [ ] Store signficance and significant values for processing * [ ] Update search & correlations to use these * [ ] Further optimize computations so they can run continuously in the background # Info ## Original qtlreaper version The original reaper precompute lives in => https://github.com/genenetwork/genenetwork2/blob/testing/scripts/maintenance/QTL_Reaper_v6.py This script first fetches inbredsets ``` MariaDB [db_webqtl]> select Id, Name from InbredSet limit 5; +----+----------+ | Id | Name | +----+----------+ | 1 | BXD | | 2 | B6D2F2 | | 4 | AXBXA | | 5 | AKXD | | 6 | B6BTBRF2 | +----+----------+ ``` and expands them to a .geno file, e.g. BXD.geno. Note that the script does not compute with the many variations of .geno files we have today. Next it sets the Id for ProbeSetFreeze which is the same as the InbredSet Id. So, ProbeSetFreeze.Id == IndbredSet.Id. There are groups/collections, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06" ``` select max(distinct(ProbeSetFreezeId)) from ProbeSetXRef limit 5; | 1054 | ``` Next for this Id we fetch, known as `ProbeSetXRefInfos`: ``` MariaDB [db_webqtl]> select ProbeSetId, Locus, DataId from ProbeSetXRef where ProbeSetFreezeId=1 limit 5; +------------+----------------+--------+ | ProbeSetId | Locus | DataId | +------------+----------------+--------+ | 1 | rs13480619 | 1 | | 2 | rs29535974 | 2 | | 3 | rs49742109 | 3 | | 4 | rsm10000002321 | 4 | | 5 | rsm10000019445 | 5 | +------------+----------------+--------+ ``` Then we fetch the trait values: ``` MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.Id = 1 limit 5; +----+----------+-------+ | Id | StrainId | value | +----+----------+-------+ | 1 | 1 | 5.742 | | 1 | 2 | 5.006 | | 1 | 3 | 6.079 | | 1 | 4 | 6.414 | | 1 | 5 | 4.885 | +----+----------+-------+ ``` with names: ``` MariaDB [db_webqtl]> select Strain.Name, ProbeSetData.value from Strain, ProbeSetData where Strain.Id = ProbeSetData.StrainId and ProbeSetData.Id = 1 limit 5; +----------+-------+ | Name | value | +----------+-------+ | B6D2F1 | 5.742 | | C57BL/6J | 5.006 | | DBA/2J | 6.079 | | BXD1 | 6.414 | | BXD2 | 4.885 | +----------+-------+ ``` 35 strains were used for this dataset: ``` select count(ProbeSetData.value) from ProbeSetData where ProbeSetData.Id = 1; +---------------------------+ | count(ProbeSetData.value) | +---------------------------+ | 35 | +---------------------------+ ``` with genotypes (from files) and these phenotypes (from MySQL) qtlreaper is started and next we update the values for ``` select * from ProbeSetXRef where ProbeSetId=1 and ProbeSetFreezeId=1 limit 5; +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ | 1 | 1 | 1 | 10.095.400 | 13.3971627898894 | 0.163 | 5.48794285714286 | 0.08525787814808819 | rs13480619 | 12.590069931048 | 0.269 | -0.28515625 | NULL | +------------------+------------+--------+------------+------------------+------------+------------------+---------------------+------------+-----------------+--------+-------------+------+ ``` the actual update is ``` update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s ``` so Locus, LRS and additive fields are updated. The old reaper scores are in ``` MariaDB [db_webqtl]> select ProbeSetId,ProbeSetFreezeId,Locus,LRS,additive from ProbeSetXRef limit 10; +------------+------------------+----------------+------------------+--------------------+ | ProbeSetId | ProbeSetFreezeId | Locus | LRS | additive | +------------+------------------+----------------+------------------+--------------------+ | 1 | 1 | rs13480619 | 12.590069931048 | -0.28515625 | | 2 | 1 | rs29535974 | 10.5970737900941 | -0.116783333333333 | | 3 | 1 | rs49742109 | 6.0970532702754 | 0.112957489878542 | | 4 | 1 | rsm10000002321 | 11.7748675511731 | -0.157113725490196 | | 5 | 1 | rsm10000019445 | 10.9232633740162 | 0.114764705882353 | | 6 | 1 | rsm10000017255 | 8.45741703245224 | -0.200034412955466 | | 7 | 1 | rs4178505 | 7.46477918183565 | 0.104331983805668 | | 8 | 1 | rsm10000144086 | 12.1201771258006 | -0.134278431372548 | | 9 | 1 | rsm10000014965 | 11.8837168740735 | 0.341458333333334 | | 10 | 1 | rsm10000020208 | 10.2809848009836 | -0.173866666666667 | +------------+------------------+----------------+------------------+--------------------+ 10 rows in set (0.000 sec) ``` This means for every dataset one single maximum score gets stored(!) From this exercise we can conclude: * Existing precomputed values are by linear regression (old QTLreaper) * Only one highest score is stored * Every time the script is run *all* datasets are recomputed * The '_old' named values in the table are not touched by the script * h2 and mean are not updated * No significance score is computed (needs permutations) Rob voiced a wish to retain all scores (at least those above 1.0). This is not feasible in mariadb, but we can retain GEMMA output if stored efficiently. ## Notes on ProbeSetXRef ProbeSetXRef is pretty small, currently @5.6Gb and 48,307,650 rows, so we could decide to add columns to track different mappers. Something funny ``` select count(LRS) from ProbeSetXRef; +------------+ | count(LRS) | +------------+ | 28335955 | +------------+ ``` half the fields are used. What to think of ``` MariaDB [db_webqtl]> select * from ProbeSetXRef where LRS=0 limit 5; +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | LRS_old | pValue_old | mean | se | Locus | LRS | pValue | additive | h2 | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ | 589 | 8599010 | 70606737 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8593710 | 70606738 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8607637 | 70606739 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8594490 | 70606740 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | | 589 | 8591994 | 70606741 | NULL | NULL | NULL | 0 | NULL | rsm10000000001 | 0 | NULL | 0 | NULL | +------------------+------------+----------+-----------+---------+------------+------+------+----------------+------+--------+----------+------+ ``` ``` MariaDB [db_webqtl]> select count(*) from ProbeSetXRef where LRS=0 and Locus="rsm10000000001"; +----------+ | count(*) | +----------+ | 291447 | +----------+ ``` There is obviously more. I think this table can use some cleaning up? Looking at the GN1 code base ProbeSetXRef is used in a great number of files. In GN2 it is: ``` wqflask/maintenance/quantile_normalize.py wqflask/maintenance/generate_probesetfreeze_file.py wqflask/base/mrna_assay_tissue_data.py wqflask/wqflask/update_search_results.py wqflask/wqflask/correlation/rust_correlation.py wqflask/base/trait.py wqflask/wqflask/do_search.py wqflask/base/data_set/mrnaassaydataset.py wqflask/base/data_set/dataset.py wqflask/wqflask/correlation/pre_computes.py wqflask/wqflask/api/router.py wqflask/wqflask/show_trait/show_trait.py scripts/insert_expression_data.py scripts/maintenance/Update_Case_Attributes_MySQL_tab.py scripts/maintenance/readProbeSetSE_v7.py scripts/maintenance/QTL_Reaper_v6.py scripts/maintenance/readProbeSetMean_v7.py wqflask/tests/unit/base/test_mrna_assay_tissue_data.py ``` Let's visit these one by one to make sure there is no side-effect. ### wqflask/maintenance/quantile_normalize.py Appears to be a one-off to normalize a certain dataset in `/home/zas1024/cfw_data/`. Last time it was used is probably 2018. ### wqflask/maintenance/generate_probesetfreeze_file.py This one is even older and probably from 2013. ### wqflask/base/mrna_assay_tissue_data.py Another dinosaur, actually uses TissueProbeSetXRef, so not relevant. ### wqflask/wqflask/update_search_results.py This is for global 'gene' search making sure ProbeSet.Id = ProbeSetXRef.ProbeSetId AND ProbeSetXRef.ProbeSetFreezeId=ProbeSetFreeze.Id. LRS is passed on. ### wqflask/wqflask/correlation/rust_correlation.py Again making sure IDs match. LRS is passed on. This requires scrutiny FIXME ### wqflask/base/trait.py Trait class defines a trait in webqtl, can be either Microarray, Published phenotype, genotype, or user input trait. Fetches LRS. ### wqflask/wqflask/do_search.py Searches for genes with a QTL within the given LRS values LRS searches can take 3 different forms: - LRS > (or <) min/max_LRS - LRS=(min_LRS max_LRS) - LRS=(min_LRS max_LRS chromosome start_Mb end_Mb) where min/max_LRS represent the range of LRS scores and start/end_Mb represent the range in megabases on the given chromosome ### wqflask/base/data_set/mrnaassaydataset.py Also search results. ### wqflask/base/data_set/dataset.py ID sync ### wqflask/wqflask/correlation/pre_computes.py ID sync ### wqflask/wqflask/api/router.py Fetch traits and sample data API (we may use this!) ### wqflask/wqflask/show_trait/show_trait.py Same. ### The following are a punch of scripts: * scripts/insert_expression_data.py * scripts/maintenance/Update_Case_Attributes_MySQL_tab.py * scripts/maintenance/readProbeSetSE_v7.py * scripts/maintenance/QTL_Reaper_v6.py * scripts/maintenance/readProbeSetMean_v7.py * wqflask/tests/unit/base/test_mrna_assay_tissue_data.py ## Using LRS * wqflask/wqflask/static/new/javascript/lod_chart.js LRS is used to display a LOD chart and in a bunch of python files that match above list for ProbeSetXRef: * scripts/maintenance/QTL_Reaper_v6.py * wqflask/base/webqtlConfig.py * wqflask/base/trait.py * wqflask/tests/unit/wqflask/marker_regression/test_display_mapping_results.py * wqflask/tests/unit/base/test_trait.py * wqflask/wqflask/parser.py * wqflask/tests/unit/wqflask/correlation/test_show_corr_results.py * wqflask/wqflask/update_search_results.py * wqflask/wqflask/correlation/show_corr_results.py * wqflask/wqflask/marker_regression/run_mapping.py * wqflask/wqflask/export_traits.py * wqflask/wqflask/gsearch.py * wqflask/base/data_set/phenotypedataset.py * wqflask/base/data_set/markers.py * wqflask/wqflask/api/router.py * wqflask/base/data_set/mrnaassaydataset.py * wqflask/wqflask/correlation/rust_correlation.py * wqflask/wqflask/marker_regression/display_mapping_results.py * wqflask/wqflask/collect.py * wqflask/wqflask/do_search.py * wqflask/wqflask/views.py * wqflask/utility/Plot.py * test/requests/main_web_functionality.py * test/requests/correlation_tests.py From above it can be concluded these precomputed values are used for display and for getting search results (filtering on datasets that have a QTL). It looks safe to start replacing qtlreaper results with GEMMA results. Only thing I am not sure about is correlations. A cursory inspection suggests LRS is only used for final output and that makes sense if correlations are done on (expression) phenotypes. Anyway, if things go haywire we'll find out soon (enough). At the next stage we ignore all this and start precompute with GEMMA on the BXD. ## Precompute DB We will use a database to track precompute updates (see below). On the computing host (or client) we should track the following: * Dataset (phenotypes) * time: MySQL time ProbeSetData table was last updated (if identical we can skip) * Hash on DB inputs (for DB table updates) * Genotypes * Algorithm * Hash on run inputs (phenotypes, genotypes, algorithm, invocation) * time: of initiate run * time: of completion * Hostname of run (this) * File path (this) * Hash on output data (for validation) * array of rec: DB hostnames + time stamps: successfully updated DB table for these servers The logic is that if the DB table was changed we should recompute the hash on inputs. Note the ProbeSetData table is the largest at 200G including indices. If that hash changed the mapping algorithm should rerun. A new record will be created on the new inputs. A flag is set for the updated DB for precomputed values. We can garbage collect when multiple entries for `Dataset' exist. What database should we use? Ultimately precompute is part of the main GN setup. So, one could argue MariaDB is a good choice. We would like to share precompute with other setups, however. Also we typically do not want to run the precompute on the production host. That means we should be able to rebuild the database from a precompute output directory and feed the update to the running (production) server. We want to track compute so we can distribute running the algorithms across servers and/or PBS. This implies the compute machines have to be able to query the DB in some way. Basically a machine has a 'runner' that checks the DB for updates and fetches phenotypes and genotypes. A run is started and on completion the DB is notified and updated. Note that a runner can not be parallel on a single results directory, so one runner per target output directory. We can have different runners, one for local machine, one for PBS and one for remotes. This also implies that 'state' is handled with the output dir (on a machine). We will create a JSON record for every compute. This state can be pushed into 'any' GN instance at any point. For running jobs the DB can be queried through a (same) REST API. The REST API has to handle the DB, so it has to be able to access DB and files, i.e. it should be on the same (production) machine. On the DB we'll create a Hash table on inputs of ProbeSetData. This way we don't have to walk the giant table for every query. This can be handled through a CRON job. So there will be a table: * Dataset * Hash on relevant phenotypes (ProbeSetData) * time: Job started * host: host running job * status: (final) job status * time: DB updated ## Tracking updates to DB This brings us to CRON jobs. There are several things that ought to be updated when the DB changes. Xapian being one example and now this table. These should run on a regular basis and only update what really changed. We need to support that type of work out of the box with an hourly runner (for precompute) and a daily runner (for xapian). One possibility is to use MySQL triggers to track changes to the database table. Even a hash can be computed as suggested here: => https://stackoverflow.com/questions/38732185/hash-of-two-columns-in-mysql The problem is that this relative expensive computation done on every row update. I don't think it is feasible for updating phenotypes. Arthur may find the updates go very slow. It is possible, however, to just update a boolean value, or even a time stamp, that is cheap. That will probably increase the update time by 50-100% and that may be acceptable. We can only find out by trying. But first I am going to simply generate the hashes from the running DB by scanning the full table. If that runs in minutes we should be good with a CRON job for now. Also, in the medium term, we may replace these tables with lmdb files and it is no good to depend on MysQL/MariaDB for this type of service. Note that ProbeSetXRef is indexed on DataId, ProbeSetFreezeId, ProbeSetId and Locus. The following should be unique and ProbeSetFreezeId+ProbeSetId, indexed by unique DataId, is the actual dataset, so that should be reasonable fast. `ProbeSetFreezeID` is a unique identifier to a dataset, such as "Hippocampus_M430_V2_BXD_PDNN_Jun06". `ProbeSetId` is the trait in the collection, e.g. '100001_at'. `DataId` is the counter in ProbeSetXRef table. So now this should be clear on update: ``` update ProbeSetXRef set Locus=%s, LRS=%s, additive=%s where ProbeSetId=%s and ProbeSetFreezeId=%s ``` Every trait gets an LRS and effect size this way. ## Writing script to Hash ProbeSetData tux02 has a running database, so it is good for trials. The first step is to add hashes to ProbeSetData and compute them. Two questions we can ask: (1) should be host that data in the same table and (2) should we use another DB? For (1) we should *not* use the same table. The current table is reasonably small and if we expand 28,335,955 rows on ProbeSetXRef it is going to be huge. This table is used a lot, so we should create a new table for this specific purpose. Now that means, even though this table is obviously linked to the running DB, we can actually use a different storage. And for this particular purpose I think sqlite will do great. Our preferred 'transient' databases, these days, are lmdb for disk b-trees, redis for in-RAM b-trees, sqlite for tables and virtuoso for RDF. This may change again, but for now these are great default choices. 'transient' means that the DB can be regenerated from some source of truth. This is not necessarily the case today - particularly authorization and session management are now using sqlite and redis as primary data sources. That is fine, as long as we are fully aware of the choices and what that implies. We already use SQLite for authorization. Fred has been adding that to GN3 and now to its own authorization service: => https://github.com/genenetwork/gn-auth In the README you can see the DB is located with a parameter 'AUTH_DB'. There is no example, so I check our production setup. Zach has it setup with ``` LOGLEVEL = "DEBUG" SECRET_KEY = '9a4a2a...' SQL_URI = "mysql://user:pwd@127.0.0.1:3306/db_webqtl" AUTH_DB = "/home/gn2/auth_copy.db" AUTH_MIGRATIONS = "/home/zas1024/gn-auth/migrations" ``` The DB file is world writeable - and we should probably change that. Also it needs a better storage location. Essentially GN should have a limited number of directories that handle state: * SQL server (mysql) - read-write * Other databases - read-write * Upload files - read-write * Reference files - read-only * Secret files - read-only * Cache files - read-write Within these directories there should be clear subdirectories for each use case. So, for authorization we could have `/data/gn/databases/auth/production/auth.db'. Likewise, the precompute hash table will be in `/data/gn/databases/precompute/production/hashes.db' and we'll introduce an environment variable "GN_HASH_DB". We'll take care of that later. For our hashing purposes we'll first create a DB named 'hashes.db'. This table can also be used for correlations etc. down the line by using the 'algorithm' field and its combined input hash. It is generic, in other words. ## Getting precompute running Now we have pretty much figured out what to do it is time to set up precompute. We can start with one single server that has access to mariadb. The new genoa Tux04 will be our replacement of Tux01. It has some larger Crucial P3 NVME SSDs rated at 3.5Gb/s. The Dell drive is 1.4 TB and is rated to be twice as fast. So we'll use that for the MariaDB now. The first step is to make space for all the GN2+GN3 stuff. Next we test if we can query tux01 ports. tux04 can see tux01 because they are both inside the UTHSC firewall. I just need to allow the mysql port on tux01 and we keep using that DB for now. The first phase is to simply start running gemma on tux05 and tux04 using the BXD. I mean, before computing 28,335,955 phenotypes we'll have to have a more serious compute setup! ``` MariaDB [db_webqtl]> select * from Strain where SpeciesId=1 limit 200; +-----+------------+------------+-----------+--------+------------+ | Id | Name | Name2 | SpeciesId | Symbol | Alias | +-----+------------+------------+-----------+--------+------------+ | 1 | B6D2F1 | B6D2F1 | 1 | NULL | NULL | | 2 | C57BL/6J | C57BL/6J | 1 | B6J | P | | 3 | DBA/2J | DBA/2J | 1 | D2J | P | | 4 | BXD1 | BXD1 | 1 | NULL | NULL | | 5 | BXD2 | BXD2 | 1 | NULL | NULL | | 6 | BXD5 | BXD5 | 1 | NULL | NULL | | 7 | BXD6 | BXD6 | 1 | NULL | NULL | select * from Strain where SpeciesId=1 and Name like '%BXD%'; returns 1371 rows ``` The originally named StrainXRef table links StrainID with SpeciesID: ``` MariaDB [db_webqtl]> select * from StrainXRef limit 3; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 1 | 1 | 10 | N | MP_F1 | | 1 | 2 | 20 | N | Mat | | 1 | 3 | 30 | N | Pat | +-------------+----------+---------+------------------+----------------+ ``` ## Getting all BXD datasets Together with the InbredSet table we have the information we need to fetch BXD datasets. ``` select Id,InbredSetId,InbredSetName,Name,SpeciesId,FullName,public,MappingMethodId,GeneticType | Family,FamilyOrder,MenuOrderId,InbredSetCode from InbredSet limit 3; +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ | 1 | 1 | BXD | BXD | 1 | BXD Family | 2 | 1 | 0 | 1 | 0 | BXD | | 2 | 2 | B6D2F2 OHSU Brain | B6D2F2 | 1 | B6D2F2 OHSU Brain | 2 | 1 | 0 | 3 | 0 | NULL | | 4 | 4 | AXB/BXA | AXBXA | 1 | AXB/BXA Family | 2 | 1 | NULL | 1 | 0 | AXB | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+----------------------+-------------+-------------+---------------+ ``` So ``` MariaDB [db_webqtl]> select * from StrainXRef where InbredSetID=1 limit 5; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 1 | 1 | 10 | N | MP_F1 | | 1 | 2 | 20 | N | Mat | | 1 | 3 | 30 | N | Pat | | 1 | 4 | 40 | Y | RI_progeny | | 1 | 5 | 50 | Y | RI_progeny | +-------------+----------+---------+------------------+----------------+ ``` ``` select * from ProbeSetData where StrainId = 4 limit 5; ``` Fetches all data where strain `BXD1' was used. That is 14,829,435 of data points! Now it should be clear the database is designed to go from dataset -> strain and not the other way round. We can, however, walk ProbeSetXRef and get the ProbeSetId. Use that to query ProbeSetData and use the StrainID to make sure it belongs to the BXD for compute. Next, the fields `Locus_old | LRS_old | pValue_old' are not used that I can tell from the source code. So, we could use these to track updates. I am going to set these _old fields in ProbeSetXRef. After checking the backups we will first set all Locus_old to NULL. If a value is NULL we can precompute. Afther precompute the current Locus_old value is taken from the reaper compute, as well as LRS_old and p_Value_old. We will add a column for additive_old. Let's fetch one where Locus_old is now NULL: ``` select ProbeSetFreezeId,ProbeSetId,DataId,Locus_old,Locus from ProbeSetXRef where Locus_old is NULL limit 1; +------------------+------------+--------+-----------+-----------+ | ProbeSetFreezeId | ProbeSetId | DataId | Locus_old | Locus | +------------------+------------+--------+-----------+-----------+ | 31 | 48552 | 458384 | NULL | D10Mit194 | +------------------+------------+--------+-----------+-----------+ 1 row in set (0.00 sec) ``` DataId is really ProbeSetDataId, so: ``` MariaDB [db_webqtl]> select * from ProbeSetData where ProbeSetData.id = 458384 limit 1; +--------+----------+--------+ | Id | StrainId | value | +--------+----------+--------+ | 458384 | 42 | 10.306 | +--------+----------+--------+ 1 row in set (0.00 sec) ``` And StrainId can be found in StrainXRef with ``` MariaDB [db_webqtl]> select * from Strain where Strain.id=42; +----+---------+---------+-----------+--------+-------+ | Id | Name | Name2 | SpeciesId | Symbol | Alias | +----+---------+---------+-----------+--------+-------+ | 42 | CASE001 | CASE001 | 1 | NULL | NULL | +----+---------+---------+-----------+--------+-------+ ``` and ``` MariaDB [db_webqtl]> select * from StrainXRef where StrainId=42; +-------------+----------+---------+------------------+----------------+ | InbredSetId | StrainId | OrderId | Used_for_mapping | PedigreeStatus | +-------------+----------+---------+------------------+----------------+ | 2 | 42 | 1 | N | NULL | +-------------+----------+---------+------------------+----------------+ ``` ``` MariaDB [db_webqtl]> select * from InbredSet where Id=2;; +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ | Id | InbredSetId | InbredSetName | Name | SpeciesId | FullName | public | MappingMethodId | GeneticType | Family | FamilyOrder | MenuOrderId | InbredSetCode | Description | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ | 2 | 2 | B6D2F2 OHSU Brain | B6D2F2 | 1 | B6D2F2 OHSU Brain | 2 | 1 | intercross | Crosses, AIL, HS | 3 | 0 | NULL | NULL | +----+-------------+-------------------+--------+-----------+-------------------+--------+-----------------+-------------+------------------+-------------+-------------+---------------+-------------+ ``` which is not part of the BXD (InbredSetId=1). So, StrainXRef is crucial to get to the actual BXD. Let's look how it is used in existing code. In GN2 it is used in 2 places only: router and sample list: ``` SELECT Strain.Name FROM Strain, StrainXRef WHERE StrainXRef.StrainId = Strain.Id AND StrainXRef.InbredSetId = 1; ``` lists all strain *names* based on Strain.id. The nomenclature is bogglingly bad. Anyway, it confirms that we can use this to see if something is part of the BXD. After some thought I decided we will create a new table that mirrors what we'll do with sqlite down the line. The new table will also allow for referring to time stamps and multiple hits. So, in addition to time stamps and hash values, we'll add to the above update record: * top hit info (locus, LRS, p-value, effect) * number of other hits * file with other hits And down the line we can point to a different DB table that records significant hits. As a first check it will be interesting to see how these values differ between qtlreaper and GEMMA! ## Create Hash table This new Hash table can be called 'SignificantHits' or something. ProbeSetXRef is defined as ``` CREATE TABLE ProbeSetXRef ( ProbeSetFreezeId smallint(5) unsigned NOT NULL DEFAULT 0, ProbeSetId int(10) unsigned NOT NULL DEFAULT 0, DataId int(10) unsigned NOT NULL DEFAULT 0, Locus_old char(20) DEFAULT NULL, LRS_old double DEFAULT NULL, pValue_old double DEFAULT NULL, mean double DEFAULT NULL, se double DEFAULT NULL, Locus varchar(50) DEFAULT NULL, LRS double DEFAULT NULL, pValue double DEFAULT NULL, additive double DEFAULT NULL, h2 float DEFAULT NULL, PRIMARY KEY (DataId), ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ``` Not all choices are that sane, but we'll leave it like it is for now (it is a small table). The new table will look like (LRS=Log wise ratios) ``` CREATE TABLE SignificantHits ( Id INT UNSIGNED NOT NULL DEFAULT 0, ProbeSetFreezeId INT UNSIGNED NOT NULL DEFAULT 0, <- points to dataset info ProbeSetId INT UNSIGNED NOT NULL DEFAULT 0, <- points to trait info ProbeSetDataId INT UNSIGNED NOT NULL DEFAULT 0, <- id, strain value OR ID of the table? mean double DEFAULT NULL, se double DEFAULT NULL, Locus varchar(50) DEFAULT NULL, LRS double DEFAULT NULL, pValue double DEFAULT NULL, significant double DEFAULT NULL, additive double DEFAULT NULL, h2 float DEFAULT NULL, PRIMARY KEY (Id), ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ``` Note prepending Id and a rename from DataId to ProbeSetDataId because they refer to the same: ``` MariaDB [db_webqtl]> select max(DataId) from ProbeSetXRef limit 10; | max(DataId) | | 92213693 | select max(Id) from ProbeSetData limit 3; | max(Id) | | 92213693 | ``` In the future this table will be the replacement for the badly named ProbeSetXRef ## Creating a list of work on tux04 We'll use a test table first on tux02. It should not matter for precompute because we'll regenerate hashes from the DB input and GEMMA output. I.e., we only need to recompute if something has changed in phenotypes or genotypes and that is quite rare. We will work on tux04 and I'll first install the database there. In the process we'll need to get the basic environment going that we need for work. First I thought I'd copy the backup from space - as I thought it has 10Gbs, but actually it uses 1Gbs and the backup is a month old (need to fix that too!). Alright, let's use tux02 for the DB. On the new machine tux04 we disable /etc/profile and /etc/bash.bashrc - people can add that themselves and it messes up guix profiles. After creating my environment it looks like any other machine I use and I have the same keyboard short cuts. Note that my UID may be wrong for Octopus, but that requires a more heavy surgery anyway. Next we create a drop for the backups from another machine following: => ../backup_drops.gmi Copy the backup which takes a while over the slow network and unpack with the great tool borg ``` time borg extract --progress /export/backup/bacchus/drop/tux01/borg-tux01::borg-backup-mariadb-20231111-06:39-Sat ``` Next install a recent Mariadb from guix as a container using the instructions in => mariadb.gmi => setting-up-local-development-database.gmi move the database files into, for example, /export/mysql/var/lib/mysql. chown files to your user account. Next ``` cd /export/mysql mkdir tmp mkdir run tux04:/export/mysql$ ~/opt/guix-pull/bin/guix shell -C -N coreutils sed mariadb --share=/export/mysql/var=/var --share=/export/mysql/tmp=/tmp export TMPDIR=/tmp mysqld_safe --datadir='/var/lib/mysql/' --protocol tcp --port=3307 --user=$USER --group=users --nowatch ``` and a client with: ``` /export/mysql$ ~/opt/guix-pull/bin/guix shell mysql -- mysql --socket=var/run/mysqld/mysqld.sock -uwebqtlout -pwebqtlout db_webqtl ``` ## Preparing for GEMMA Meanwhile I have prepared tux04 and tux05 for the new runs. Next step is to query the DB and run GEMMA. ### Testing a script to access the database A first script on tux02 runs: ```scheme (use-modules (dbi dbi)) (define db_webqtl (dbi-open "mysql" "webqtlout:webqtlout:db_webqtl:tcp:127.0.0.1:3306")) (dbi-query db_webqtl "SELECT * FROM ProbeSetXRef LIMIT 1") (display (dbi-get_row db_webqtl)) ((ProbeSetFreezeId . 1) (ProbeSetId . 1) (DataId . 1) (Locus_old . 10.095.400) (LRS_old . 13.3971627898894) (pValue_old . 0.163) (mean . 5.48794285714286) (se . 0.08525787814808819) (Locus . rs13480619) (LRS . 12.590069931048) (pValue . 0.269) (additive . -0.28515625) (h2 . 0.0)) ``` Note that you can run this over guile-geiser remote. ### Install and run gemma Now we have a working DB query method on a compute node we can start running GEMMA. GEMMA is packaged as part of GNU Guix, so that part is straightforward. After above guix pull ``` . ~/opt/guix-pull/etc/profile <- don't do this, it does not work on Debian Guix images! git clone https://git.genenetwork.org/guix-bioinformatics/ git clone https://gitlab.inria.fr/guix-hpc/guix-past ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -A gemma gemma-gn2 0.98.5-8cd4cdb out /home/wrk/guix-bioinformatics/gn/packages/gemma.scm:31:2 ``` and install gemma with ``` wrk@tux05:~$ ~/opt/guix-pull/bin/guix package -L ~/guix-bioinformatics/ -L ~/guix-past/modules -i gemma -p ~/opt/gemma ``` ### Creating the precompute-runner The precompute-runner can be part of `gn-guile/scripts/precompute`, i.e., the new GN guile repo's maintenance script dir: => https://git.genenetwork.org/gn-guile/tree/scripts/precompute See the README.md for more information on running the script etc. As we are doing the BXD's first we first fetch a record from ProbeSetXRef that has Locus_old set to NULL AND matches a BXD trait. ## More complicated datasets A good dataset to take apart is => http://genenetwork.org/show_trait?trait_id=1436869_at&dataset=HC_M2_0606_P because it has 71 BXD samples and 32 other samples.