rewrite-qc-and-qc-uploads-in-python.gmi « quality-control « issues - gn-gemtext - GeneNetwork gemtext knowledge base and issue tracker (mirror right now)

blob: a5e28b779f1d0846895246be525b1cb982404f25 (plain)
# Rewrite qc and qc-uploads in Python3

## Tags

* type: rewrite
* priority: high
* assigned: fredm
* status: in progress
* keywords: quality control


## Description

Since the quality control application will mostly be maintained outside active GeneNetwork development, and might actually be handed off to other maintainers, there is a need for it to be in an "accessible" language, so that it is easy to hand it off. This rewrite was therefore found to be necessary.

The original QC app(s) were developed by

* jgart

and were written in Common-Lisp. The two applications are:

=> https://git.genenetwork.org/jgart/qc QC library

=> https://git.genenetwork.org/jgart/qc-uploads QC App Front-end

In this document, the discussions of what is necessary to get the application in an acceptable state will be detailed and discussions to get there will also be included.

Some other related issues for consideration, and superseded by this issue are:

=> /issues/quality-control/enumerate-all-qc-checks Enumeration of QC checks for all platforms to test against
=> /issues/quality-control/gene-symbols Mappings between representations of symbol names
=> /issues/quality-control/qc-checks Some checks to run against files
=> /issues/quality-control/qc Notes on original QC App ideas
=> /issues/quality-control/ui-design Some UI design notes on the QC app


### Requirements

* The first row contains the headings, and determines the number of columns
* Each heading in the first row MUST appear in the first row ONLY ONE time
* no empty data cells
* no data cells with spurious characters like `eeeee`, `5.555iloveguix`, etc...
* decimal numbers must conform to the following criteria:
* * when checking an average file decimal numbers must contain exactly three places to the right side of the dot.
* * when checking a standard error file decimal numbers must contain six or greater places to the right side of the dot.
* * there must be a number to the left side of the dot (e.g. 0.55555 is allowed but .55555 is not).
* check line endings to make sure they are Unix and not DOS
* check strain headers against a source of truth (see strains.csv): get the values from 'Name' and 'Name2' fields

## Questions Awaiting Feedback

* Arthur
* jgart

The following questions require some feedback on your part for further clarity on the requirements.

Please just add the answer below the question.

#### Question 01

In the requirement

* no data cells with spurious characters like `eeeee`, `5.555iloveguix`, etc...

I see us encountering an issue with that requirement, if the first field is ever anything other than a number. For now, the first field is a *ProbeSet ID* which is numerical. If a field is ever, say, something like *Publish ID*, which can take a form like `ILM304582` then this assumption that all fields are numerical would break, and the application would be doing the wrong thing.
Is there a possibility for the first field ever changing?

#### Question 02

The requirement

* check line endings to make sure they are Unix and not DOS

seems a little unnecessary if the files are not used for anything else. Most programming languages these days have facilities for translating the line endings appropriately, and so, we really should not add the manual cognitive overhead to the users, unless it is an absolute necessity, and even then, we will probably be doing something wrong. Is this requirement absolutely necessary?


#### Question 03

Can there be zero values, i.e.

* "0.000", "00.000", ... etc. for average files
* "0.000000, "00.00000000", ... etc. for standard error files

in the files?


## Update 2022-06-23

The rewrite of the data verification part is mostly done.

@Arthur hints that the next step is the quality-control proper, where, if the file(s) pass the verification stage, the user can attach some extra information to the data, such as:
* Species
* Group
* Platform
* Study ID
etc. that will be used when adding the data to the database, to link it properly.

### Answers to Questions

#### Question 01

The first field will be treated as text, and will not undergo any verification

#### Question 02

The line-endings will be handled when the data is being entered into the database. This means there is no need to handle it in this application.

#### Question 03

There can be zero values, and they have been handled in the code.