1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
|
# Rewrite qc and qc-uploads in Python3
## Tags
* type: rewrite
* priority: high
* assigned: fredm
* status: in progress
* keywords: quality control
## Description
Since the quality control application will mostly be maintained outside active GeneNetwork development, and might actually be handed off to other maintainers, there is a need for it to be in an "accessible" language, so that it is easy to hand it off. This rewrite was therefore found to be necessary.
The original QC app(s) were developed by
* jgart
and were written in Common-Lisp. The two applications are:
=> https://git.genenetwork.org/jgart/qc QC library
=> https://git.genenetwork.org/jgart/qc-uploads QC App Front-end
In this document, the discussions of what is necessary to get the application in an acceptable state will be detailed and discussions to get there will also be included.
Some other related issues for consideration, and superseded by this issue are:
=> /issues/quality-control/enumerate-all-qc-checks Enumeration of QC checks for all platforms to test against
=> /issues/quality-control/gene-symbols Mappings between representations of symbol names
=> /issues/quality-control/qc-checks Some checks to run against files
=> /issues/quality-control/qc Notes on original QC App ideas
=> /issues/quality-control/ui-design Some UI design notes on the QC app
### Requirements
* The first row contains the headings, and determines the number of columns
* Each heading in the first row MUST appear in the first row ONLY ONE time
* no empty data cells
* no data cells with spurious characters like `eeeee`, `5.555iloveguix`, etc...
* decimal numbers must conform to the following criteria:
* * when checking an average file decimal numbers must contain exactly three places to the right side of the dot.
* * when checking a standard error file decimal numbers must contain six or greater places to the right side of the dot.
* * there must be a number to the left side of the dot (e.g. 0.55555 is allowed but .55555 is not).
* check line endings to make sure they are Unix and not DOS
* check strain headers against a source of truth (see strains.csv): get the values from 'Name' and 'Name2' fields
## Questions Awaiting Feedback
* Arthur
* jgart
The following questions require some feedback on your part for further clarity on the requirements.
Please just add the answer below the question.
#### Question 01
In the requirement
* no data cells with spurious characters like `eeeee`, `5.555iloveguix`, etc...
I see us encountering an issue with that requirement, if the first field is ever anything other than a number. For now, the first field is a *ProbeSet ID* which is numerical. If a field is ever, say, something like *Publish ID*, which can take a form like `ILM304582` then this assumption that all fields are numerical would break, and the application would be doing the wrong thing.
Is there a possibility for the first field ever changing?
#### Question 02
The requirement
* check line endings to make sure they are Unix and not DOS
seems a little unnecessary if the files are not used for anything else. Most programming languages these days have facilities for translating the line endings appropriately, and so, we really should not add the manual cognitive overhead to the users, unless it is an absolute necessity, and even then, we will probably be doing something wrong. Is this requirement absolutely necessary?
#### Question 03
Can there be zero values, i.e.
* "0.000", "00.000", ... etc. for average files
* "0.000000, "00.00000000", ... etc. for standard error files
in the files?
## Update 2022-06-23
The rewrite of the data verification part is mostly done.
@Arthur hints that the next step is the quality-control proper, where, if the file(s) pass the verification stage, the user can attach some extra information to the data, such as:
* Species
* Group
* Platform
* Study ID
etc. that will be used when adding the data to the database, to link it properly.
### Answers to Questions
#### Question 01
The first field will be treated as text, and will not undergo any verification
#### Question 02
The line-endings will be handled when the data is being entered into the database. This means there is no need to handle it in this application.
#### Question 03
There can be zero values, and they have been handled in the code.
|