diff options
Diffstat (limited to 'docs/dev/quality_assurance_on_csv_files.md')
-rw-r--r-- | docs/dev/quality_assurance_on_csv_files.md | 52 |
1 files changed, 52 insertions, 0 deletions
diff --git a/docs/dev/quality_assurance_on_csv_files.md b/docs/dev/quality_assurance_on_csv_files.md new file mode 100644 index 0000000..02d63c9 --- /dev/null +++ b/docs/dev/quality_assurance_on_csv_files.md @@ -0,0 +1,52 @@ +# Quality Assurance/Control on CSV Files + +## Abbreviations + +- CSV files: Character-separated-values files — these are data files structured in a table-like format, with a specific character chosen as the column/field separator. The comma (,) is the most common field separator used by most csv files. It is, however, possible to encounter files with other characters separating the values. + +## General Pattern + +A general pattern has emerged when performing quality assurance on the data in +CSV files — the pseudocode below shows the general pattern: + +```python +def qc_function(filepath, …): + open(filepath, …) + + headers = read_first_line(…) + perform_qc_on_headings(headers, …) + + for each subsequent line in file: + perform_qc_on_first_column(line, …) + + for each subsequent field in line: + perform_qc_on_field(field, …) +``` + +We want to list the errors found in each file, so it makes sense for the `perform_qc_on*` functions in the pseudocode above to return the list of errors found for each file. + +The actual quality assurance done on the headers, first column of data rows, and the fields can differ from one type of file to the next, but the structure remains relatively unchanged. + +This implies we could make use of a higher-order function that contains the general structure with the actual qc steps passed in as functions that are called in the higher-order structuring function. This gives something like: + +```python +def qc_function(filepath, headers_qc, first_column_qc, data_qc, …): + for line in file: + if line is a comment line: + skip line and continue iteration + if line is first non-comment line: + line is the header line + call headers_qc on fields in this line + if line is not first non-comment line: + line is data line + call first_column_qc on first field of line + call data_qc on each of the subsequent fields of the line + + collect and return errors +``` + +## Improvements + +- Read the file in a separate generator function +- Parallelize QC if many files are present +- Add logging/output for user update (how do we do this correctly?) |