From 35c4cec2c3c1593b59bc29fa5a738f857ecc270f Mon Sep 17 00:00:00 2001 From: Arun Isaac Date: Tue, 19 Jul 2022 15:02:48 +0530 Subject: Rescue quality control issues from topics. --- issues/quality-control/qc-checks.gmi | 55 ++++++++++++++++++++++++++++++++++++ issues/quality-control/qc.gmi | 41 +++++++++++++++++++++++++++ issues/quality-control/ui-design.gmi | 49 ++++++++++++++++++++++++++++++++ topics/quality-control/qc-checks.gmi | 55 ------------------------------------ topics/quality-control/qc.gmi | 41 --------------------------- topics/quality-control/ui-design.gmi | 49 -------------------------------- 6 files changed, 145 insertions(+), 145 deletions(-) create mode 100644 issues/quality-control/qc-checks.gmi create mode 100644 issues/quality-control/qc.gmi create mode 100644 issues/quality-control/ui-design.gmi delete mode 100644 topics/quality-control/qc-checks.gmi delete mode 100644 topics/quality-control/qc.gmi delete mode 100644 topics/quality-control/ui-design.gmi diff --git a/issues/quality-control/qc-checks.gmi b/issues/quality-control/qc-checks.gmi new file mode 100644 index 0000000..dc18f94 --- /dev/null +++ b/issues/quality-control/qc-checks.gmi @@ -0,0 +1,55 @@ +# Quality Control Checks + +1. ProbeSetId (Affymetrix format): + +We favour using Illumina, Affimetrix, and other platform formats. + +Custom formats require a new annotation file to be created. + +We usually use Ensemble ID or Gene IDs. + +1.1 Ensemble transcript IDs usually have duplicates that need to be pruned. + +ENSMBL1234 + +## Example Gene Symbol to ProbeSetId + +AFFX-BkGr-GC03_st -> TCO500002136.mm.2 + +2. Inbred Strain names should prefer long form: + +B6 -> C57BL/6 +D2 -> DBA/2 + +3. Probeset IDs that don't have any values should be pruned: + +For example an Affymetrix data set might have ~28,000 entries and the data set that +is allowed into the GeneNetwork will be 22,000 entries. + +4. The standard error between male and female mice has to be computed. + +5. SE values have to be computed to 6 or greater decimal places. + +6. The average between male and female mice has to be computed to 3 decimal places. + +7. Datasets/studies having the same ProbeSetID should be grouped together. + +8. There should be no trailing spaces in data cells. + +9. Entries should have the same capitalization style. + +10. Assesing Phenotypes for normality with Shapiro-Wilk Test. + +11. Check for annotations file. + +12. Check for CRLF. + +13. Check for UTF-8 encoding. + +## Tags + +* assigned: jgart +* type: feature-request +* priority: high +* status: unclear +* keywords: quality control diff --git a/issues/quality-control/qc.gmi b/issues/quality-control/qc.gmi new file mode 100644 index 0000000..7b5d1e4 --- /dev/null +++ b/issues/quality-control/qc.gmi @@ -0,0 +1,41 @@ +# Quality Control Project + +Develop an app with a web interface to automate the job of cleaning tsv data +files for entry. The app would be used by a group of users on a network to +upload data. + +QC should be embedded functionality of the data uploader that Bonface has written. + +* Upload data through REST API - it goes into a temp dir for a user (data is in + escrow) - Bonface wrote this already +* Run QC - what Arthur proposes (start here) +* Show results - run tools (hard part!) +* User can say - please accept data (Bonface wrote this) +* Curator accepts data (different person!) (Bonface wrote this) +* Data gets piped into GN proper + +The QC step consists of + +* Standard checks - some GN tools, such as outliers +* Run mapping + +So, even though the data is in 'escrow' we should be able to use it as +something that is in the database. GN1 does some of that. This is +where Arun comes in - we need to have a common handler for data that +is in the database and data that is in escrow. My idea is that this +will all be text files (truth files). A simple first QC step is to +check that all fields in the table are numbers where should be. Not +text. + +Note we could run QC through the REST API too. That would allow it to +be run from R and Python and Jupyter notebooks. Make it part of GN3. + +The tricky part is still how the data is handled in escrow. + +## Tags + +* assigned: jgart +* priority: high +* type: feature-request +* status: in progress, beta +* keywords: quality control diff --git a/issues/quality-control/ui-design.gmi b/issues/quality-control/ui-design.gmi new file mode 100644 index 0000000..029a2b8 --- /dev/null +++ b/issues/quality-control/ui-design.gmi @@ -0,0 +1,49 @@ +# UI Design + +1. Input/Receive Data in UI (drag and drop/upload submit form) + +2. Select Mouse + +"What type of Group are you using?" + +> (AKXD, BXH, Mouse Diversity Panel, BXD) + +3. "What is your platform?" + +> (Aff, Ilumina, ...) + +If Affymetrix (Aff) is selected then there should be various options +like Clarion S. + +If the platform you chose is not available: + + Tell PI that they should solicit for their platform to be added to the list. + + They can contact us via email. + +4. Allow excel file upload? + +## More Example UI Interactions and Checks + +"If your dataset does not comply with GN then you can try uploading your +dataset so that we can inspect it." + +"Your dataset has two erroneous entries: Gene Accession Gene." + +"The last two columns have the wrong format for the strain name." + +"Here's our format of how your dataset should look like." + +> ProbeSetID Strains ... + +"Inbred Set ID 1 is the same as BXD" + +> These are the strains: ... + +## Tags + +* assigned: jgart +* type: feature-request +* status: unclear +* priority: medium +* keywords: UI, quality control diff --git a/topics/quality-control/qc-checks.gmi b/topics/quality-control/qc-checks.gmi deleted file mode 100644 index dc18f94..0000000 --- a/topics/quality-control/qc-checks.gmi +++ /dev/null @@ -1,55 +0,0 @@ -# Quality Control Checks - -1. ProbeSetId (Affymetrix format): - -We favour using Illumina, Affimetrix, and other platform formats. - -Custom formats require a new annotation file to be created. - -We usually use Ensemble ID or Gene IDs. - -1.1 Ensemble transcript IDs usually have duplicates that need to be pruned. - -ENSMBL1234 - -## Example Gene Symbol to ProbeSetId - -AFFX-BkGr-GC03_st -> TCO500002136.mm.2 - -2. Inbred Strain names should prefer long form: - -B6 -> C57BL/6 -D2 -> DBA/2 - -3. Probeset IDs that don't have any values should be pruned: - -For example an Affymetrix data set might have ~28,000 entries and the data set that -is allowed into the GeneNetwork will be 22,000 entries. - -4. The standard error between male and female mice has to be computed. - -5. SE values have to be computed to 6 or greater decimal places. - -6. The average between male and female mice has to be computed to 3 decimal places. - -7. Datasets/studies having the same ProbeSetID should be grouped together. - -8. There should be no trailing spaces in data cells. - -9. Entries should have the same capitalization style. - -10. Assesing Phenotypes for normality with Shapiro-Wilk Test. - -11. Check for annotations file. - -12. Check for CRLF. - -13. Check for UTF-8 encoding. - -## Tags - -* assigned: jgart -* type: feature-request -* priority: high -* status: unclear -* keywords: quality control diff --git a/topics/quality-control/qc.gmi b/topics/quality-control/qc.gmi deleted file mode 100644 index 7b5d1e4..0000000 --- a/topics/quality-control/qc.gmi +++ /dev/null @@ -1,41 +0,0 @@ -# Quality Control Project - -Develop an app with a web interface to automate the job of cleaning tsv data -files for entry. The app would be used by a group of users on a network to -upload data. - -QC should be embedded functionality of the data uploader that Bonface has written. - -* Upload data through REST API - it goes into a temp dir for a user (data is in - escrow) - Bonface wrote this already -* Run QC - what Arthur proposes (start here) -* Show results - run tools (hard part!) -* User can say - please accept data (Bonface wrote this) -* Curator accepts data (different person!) (Bonface wrote this) -* Data gets piped into GN proper - -The QC step consists of - -* Standard checks - some GN tools, such as outliers -* Run mapping - -So, even though the data is in 'escrow' we should be able to use it as -something that is in the database. GN1 does some of that. This is -where Arun comes in - we need to have a common handler for data that -is in the database and data that is in escrow. My idea is that this -will all be text files (truth files). A simple first QC step is to -check that all fields in the table are numbers where should be. Not -text. - -Note we could run QC through the REST API too. That would allow it to -be run from R and Python and Jupyter notebooks. Make it part of GN3. - -The tricky part is still how the data is handled in escrow. - -## Tags - -* assigned: jgart -* priority: high -* type: feature-request -* status: in progress, beta -* keywords: quality control diff --git a/topics/quality-control/ui-design.gmi b/topics/quality-control/ui-design.gmi deleted file mode 100644 index 029a2b8..0000000 --- a/topics/quality-control/ui-design.gmi +++ /dev/null @@ -1,49 +0,0 @@ -# UI Design - -1. Input/Receive Data in UI (drag and drop/upload submit form) - -2. Select Mouse - -"What type of Group are you using?" - -> (AKXD, BXH, Mouse Diversity Panel, BXD) - -3. "What is your platform?" - -> (Aff, Ilumina, ...) - -If Affymetrix (Aff) is selected then there should be various options -like Clarion S. - -If the platform you chose is not available: - - Tell PI that they should solicit for their platform to be added to the list. - - They can contact us via email. - -4. Allow excel file upload? - -## More Example UI Interactions and Checks - -"If your dataset does not comply with GN then you can try uploading your -dataset so that we can inspect it." - -"Your dataset has two erroneous entries: Gene Accession Gene." - -"The last two columns have the wrong format for the strain name." - -"Here's our format of how your dataset should look like." - -> ProbeSetID Strains ... - -"Inbred Set ID 1 is the same as BXD" - -> These are the strains: ... - -## Tags - -* assigned: jgart -* type: feature-request -* status: unclear -* priority: medium -* keywords: UI, quality control -- cgit v1.2.3