Document emergent general QC structure for CSV files.

author: Frederick Muriuki Muriithi 2024-10-22 16:26:09 -0500
committer: Frederick Muriuki Muriithi 2024-10-22 16:26:09 -0500
commit: b09ec88789c7e54c8ca75c0d68c1adf54870aafc (patch)
tree: da701971c7445aa8a1e2640014f5d05eb07eaa0c /docs/dev
parent: 01dadbc45d3bc4ae184d8b4a5f64c1cc6538b2e9 (diff)
download: gn-uploader-b09ec88789c7e54c8ca75c0d68c1adf54870aafc.tar.gz
1 files changed, 52 insertions, 0 deletions
diff --git a/docs/dev/quality_assurance_on_csv_files.md b/docs/dev/quality_assurance_on_csv_files.md
new file mode 100644
index 0000000..02d63c9
--- /dev/null
+++ b/docs/dev/quality_assurance_on_csv_files.md
@@ -0,0 +1,52 @@
+# Quality Assurance/Control on CSV Files
+
+## Abbreviations
+
+- CSV files: Character-separated-values files — these are data files structured in a table-like format, with a specific character chosen as the column/field separator. The comma (,) is the most common field separator used by most csv files. It is, however, possible to encounter files with other characters separating the values.
+
+## General Pattern
+
+A general pattern has emerged when performing quality assurance on the data in
+CSV files — the pseudocode below shows the general pattern:
+
+```python
+def qc_function(filepath, …):
+    open(filepath, …)
+
+    headers = read_first_line(…)
+    perform_qc_on_headings(headers, …)
+
+    for each subsequent line in file:
+        perform_qc_on_first_column(line, …)
+
+        for each subsequent field in line:
+            perform_qc_on_field(field, …)
+```
+
+We want to list the errors found in each file, so it makes sense for the `perform_qc_on*` functions in the pseudocode above to return the list of errors found for each file.
+
+The actual quality assurance done on the headers, first column of data rows, and the fields can differ from one type of file to the next, but the structure remains relatively unchanged.
+
+This implies we could make use of a higher-order function that contains the general structure with the actual qc steps passed in as functions that are called in the higher-order structuring function. This gives something like:
+
+```python
+def qc_function(filepath, headers_qc, first_column_qc, data_qc, …):
+    for line in file:
+        if line is a comment line:
+            skip line and continue iteration
+        if line is first non-comment line:
+            line is the header line
+            call headers_qc on fields in this line
+        if line is not first non-comment line:
+            line is data line
+            call first_column_qc on first field of line
+            call data_qc on each of the subsequent fields of the line
+
+    collect and return errors
+```
+
+## Improvements
+
+- Read the file in a separate generator function
+- Parallelize QC if many files are present
+- Add logging/output for user update (how do we do this correctly?)
author	Frederick Muriuki Muriithi	2024-10-22 16:26:09 -0500
committer	Frederick Muriuki Muriithi	2024-10-22 16:26:09 -0500
commit	b09ec88789c7e54c8ca75c0d68c1adf54870aafc (patch)
tree	da701971c7445aa8a1e2640014f5d05eb07eaa0c /docs/dev
parent	01dadbc45d3bc4ae184d8b4a5f64c1cc6538b2e9 (diff)
download	gn-uploader-b09ec88789c7e54c8ca75c0d68c1adf54870aafc.tar.gz