diff options
Diffstat (limited to 'issues/gn-uploader')
9 files changed, 135 insertions, 24 deletions
diff --git a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi index 50a236d..262ad19 100644 --- a/issues/gn-uploader/AuthorisationError-gn-uploader.gmi +++ b/issues/gn-uploader/AuthorisationError-gn-uploader.gmi @@ -2,7 +2,7 @@ ## Tags * assigned: fredm -* status: open +* status: closed, obsoleted * priority: critical * type: error * key words: authorisation, permission @@ -64,3 +64,7 @@ Genetic type: intercross And when pressed the `Create Population` icon, it led to the error above. +## Closed as Obsolete + +* The service this was happening on (https://staging-uploader.genenenetwork.org) is no longer running +* Most of the authorisation issues are resolved in newer code diff --git a/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi new file mode 100644 index 0000000..3ef05cd --- /dev/null +++ b/issues/gn-uploader/export-uploaded-data-to-RDF-store.gmi @@ -0,0 +1,88 @@ +# Export Uploaded Data to LMDB and RDF Stores + +## Tags + +* assigned: fredm, bonz +* priority: medium +* type: feature-request +* status: open +* keywords: API, data upload, gn-uploader + +## Description + +With the QC/Data Upload project nearing completion, and being placed in front of the initial user-testing cohort, we need a way for exporting all data that is uploaded into the RDF store, either at upload time, or a short time after. + + +Users will use the QC/Data upload project[1] to upload their data to GeneNetwork. This will mostly be numerical data in Tab-Separated-Values (.tsv) files. + +Once this is done, we do want to have this data available to the user on GeneNetwork as soon as possible so that they can start doing their analyses with the data. + +Following @Munyoki's work[2] on getting the data endpoints on GN3, it should, hypothetically, be possible for the user to simply upload the data, and using the GN3 API, immediately begin their analyses on the data. In practice, however, it will need that we export the uploaded data into LMDB, and possibly any related metadata into virtuoso to enable this to work. + +This document explores what is needed to get that to work. + +## Exporting Sample Data + +We can export the sample (numeric) data to LMDB with the "dataset->lmdb" project[3]. + +The project (as of 2023-11-14T10:12+03:00UTC) does not define an installable binary/script, and therefore cannot be simply added to the data upload project[1] as a dependency and invoked in the background. + +### Data Differences + +The first line of the .tsv file uploaded is a header line indicating what each field is. +The first field of the .tsv is a trait's name/identifier. All other fields are numerical strain/sample values for each line/record in the file. + +A sample of a .tsv for upload +=> https://gitlab.com/fredmanglis/gnqc_py/-/blob/main/tests/test_data/average.tsv?ref_type=heads can be found here + +From +=> https://github.com/BonfaceKilz/gn-dataset-dump/blob/main/README.org the readme +it looks like the each record/line/trait from the .tsv file will correspond to a "db-path" in the LMDB data store. This path could be of the form: + +``` +/path/to/lmdb/storage/directory/<group-or-inbredset>/<trait-name-or-identifier>/ +``` + +where + +* `<group-or-inbredset>` is a population/group of sorts, e.g. BXD, BayXSha, etc +* `<trait-name-or-identifier>` is the value in the first field for each and every line + +**NB**: Verify this with @Munyoki + +### TODOs + +* [ ] build an entrypoint binary/script to invoke from other projects +* [ ] verify initial inference/assumptions regarding data with @Munyoki +* [ ] translate the uploaded data into a form ingestable by the export program. This could be done in either one of the projects -- I propose the QC/Data Upload project +* [ ] figure out and document new GN3 data endpoints for users +* [ ] + +## Exporting Metadata + +Immediately after upload of the data from the .tsv files, the data will most likely have very little metadata attached. Some of the metadata that is assured to be present is: + +* Species: The species that the data regards +* Group/InbredSet +* Dataset: The dataset that the data is attached to + +The metadata is useful for searching for the data. The "metadata->rdf" project[4] is used for exporting the metadata to RDF and will need to be used to initialise the metadata for newly uploaded data. + +### TODOs + +* [ ] How do we handle this? + + +## Related Issues and Topics + +=> https://issues.genenetwork.org/topics/next-gen-databases/design-doc +=> https://issues.genenetwork.org/topics/lmms/rqtl2/using-rqtl2-lmdb-adapter +=> https://issues.genenetwork.org/issues/dump-sample-data-to-lmdb +=> https://issues.genenetwork.org/topics/database/genotype-database + +## Footnotes + +=> https://git.genenetwork.org/gn-uploader/ 1: QC/Data upload project (gn-uploader) repository +=> https://github.com/genenetwork/genenetwork3/pull/130 2: Munyoki's Pull request +=> https://github.com/BonfaceKilz/gn-dataset-dump 3: Dataset -> LMDB export repository +=> https://git.genenetwork.org/gn-transform-databases/ 4: Metadata -> RDF export repository diff --git a/issues/gn-uploader/guix-build-gn-uploader-error.gmi b/issues/gn-uploader/guix-build-gn-uploader-error.gmi index 44a5c4b..aeb6308 100644 --- a/issues/gn-uploader/guix-build-gn-uploader-error.gmi +++ b/issues/gn-uploader/guix-build-gn-uploader-error.gmi @@ -86,7 +86,7 @@ Filesystem Size Used Avail Use% Mounted on so we know that's not a problem. -A similar thing had shown up on space.uthsc.edu. +A similar thing had shown up on our space server. ### More Troubleshooting Efforts diff --git a/issues/gn-uploader/handling-tissues-in-uploader.gmi b/issues/gn-uploader/handling-tissues-in-uploader.gmi index 826af15..0c43040 100644 --- a/issues/gn-uploader/handling-tissues-in-uploader.gmi +++ b/issues/gn-uploader/handling-tissues-in-uploader.gmi @@ -2,11 +2,11 @@ ## Tags -* status: open +* status: closed, wontfix * priority: high * assigned: fredm * type: feature-request -* keywords: gn-uploader, tissues +* keywords: gn-uploader, tissues, archived ## Description @@ -112,3 +112,9 @@ ALTER TABLE Tissue MODIFY Id INT(5) UNIQUE NOT NULL; * [1] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#ProbeFreeze * [2] https://gn1.genenetwork.org/webqtl/main.py?FormID=schemaShowPage#Tissue + +## Closed as WONTFIX + +I am closing this issue because it was created (2024-03-28) while I had a fundamental misunderstanding of the way data is laid out in the database. + +The information on the schema/layout of the tables is still useful, but chances are, we'll look at the tables themselves anyway should we need to figure out the schema. diff --git a/issues/gn-uploader/link-authentication-authorisation.gmi b/issues/gn-uploader/link-authentication-authorisation.gmi index 90b8e5e..b64f887 100644 --- a/issues/gn-uploader/link-authentication-authorisation.gmi +++ b/issues/gn-uploader/link-authentication-authorisation.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed * assigned: fredm * priority: critical * type: feature request, feature-request @@ -13,3 +13,9 @@ The last chain in the link to the uploads is the authentication/authorisation. Once the user uploads their data, they need access to it. The auth system, by default, will deny anyone/everyone access to any data that is not linked to a resource and which no user has any roles allowing them access to the data. We, currently, assign such data to the user manually, but that is not a sustainable way of working, especially as the uploader is exposed to more and more users. + +### Close as Completed + +The current iteration of the uploader does actually take into account the user that is uploading the data, granting them ownership of the uploaded data. By default, the data is not public, and is only accessible to the user who uploaded it. + +The user who uploads the data (and therefore own it) can later grant access to other users of the system. diff --git a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi index 1841d36..af3b274 100644 --- a/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi +++ b/issues/gn-uploader/probeset-not-applicable-to-all-data.gmi @@ -4,7 +4,7 @@ * type: bug * assigned: fredm -* status: open +* status: closed * priority: high * keywords: gn-uploader, uploader, ProbeSet @@ -20,3 +20,10 @@ applicable to our data, I don't think. ``` It seems like some of the data does not require a ProbeSet, and in that case, it should be possible to add it without one. + + +## Notes + +This "bug" is obsoleted by the fact that the implementation leading to it was entirely wrong. + +The feature that was leading to this bug no longer exists, and will have to be re-implemented from scratch with the involvement of @acenteno. diff --git a/issues/gn-uploader/provide-page-for-uploaded-data.gmi b/issues/gn-uploader/provide-page-for-uploaded-data.gmi index 60b154b..5ab7f80 100644 --- a/issues/gn-uploader/provide-page-for-uploaded-data.gmi +++ b/issues/gn-uploader/provide-page-for-uploaded-data.gmi @@ -2,7 +2,7 @@ ## Tags -* status: open +* status: closed, completed * assigned: fredm * priority: medium * type: feature, feature request, feature-request @@ -20,3 +20,8 @@ Once a user has uploaded their data, provide them with a landing page/dashboard Depends on => /issues/gn-uploader/link-authentication-authorisation + + +## Close as complete + +Current uploader directs user to a view of the data they uploader on GN2. This is complete. diff --git a/issues/gn-uploader/replace-redis-with-sqlite3.gmi b/issues/gn-uploader/replace-redis-with-sqlite3.gmi index 3e5020a..d3f94f0 100644 --- a/issues/gn-uploader/replace-redis-with-sqlite3.gmi +++ b/issues/gn-uploader/replace-redis-with-sqlite3.gmi @@ -15,3 +15,15 @@ We currently (as of 2024-06-27) use Redis for tracking any asynchronous jobs (e. A lot of what we use redis for, we can do in one of the many SQL databases (we'll probably use SQLite3 anyway), which are more standardised, and easier to migrate data from and to. It has the added advantage that we can open multiple connections to the database, enabling the different processes to update the status and metadata of the same job consistently. Changes done here can then be migrated to the other systems, i.e. GN2, GN3, and gn-auth, as necessary. + +### 2025-12-31: Progress Update + +Initial basic implementation can be found in: + +=> https://git.genenetwork.org/gn-libs/tree/gn_libs/jobs +=> https://git.genenetwork.org/gn-uploader/commit/?id=774a0af9db439f50421a47249c57e5a0a6932301 +=> https://git.genenetwork.org/gn-uploader/commit/?id=589ab74731aed62b1e1b3901d25a95fc73614f57 + +and others. + +More work needs to be done to clean-up some minor annoyances. diff --git a/issues/gn-uploader/samplelist-details.gmi b/issues/gn-uploader/samplelist-details.gmi deleted file mode 100644 index 2e64d8a..0000000 --- a/issues/gn-uploader/samplelist-details.gmi +++ /dev/null @@ -1,17 +0,0 @@ -# Explanation of how Sample Lists are handled in GN2 (and may be handled moving forward) - -## Tags - -* status: open -* assigned: fredm, zsloan -* priority: medium -* type: documentation -* keywords: strains, gn-uploader - -## Description - -Regarding the order of samples/strains, it can basically be whatever we decide it is. It just needs to stay consistent (like if there are multiple genotype files). It only really affects how the strains are displayed, and any other genotype files we use for mapping needs to share the same order. - -I think this is the case regardless of whether it's strains or individuals (and both the code and files make no distinction). Sometimes it just logically makes sense to sort them in a particular way for display purposes (like BXD1, BXD2, etc), but technically everything would still work the same if you swapped those columns across all genotype files. Users would be confused about why BXD2 is before BXD1, but everything would still work and all calculations would give the same results. - -zsloan's proposal for handling sample lists in the future is to just store them in a JSON file in the genotype_files/genotype directory. |
