# Modelling Phenotype Data

* assigned: robw, bonfacem
* tags: critical
* contact: pjotrp

## Introduction

Consider the following columns from our phenotype
table:

* Pre_publication_description
* Post_publication_description
* Original_description
* Pre_publication_abbreviation
* Post_publication_abbreviation

Ideally, all traits in GeneNetwork have pre- and post- descriptions and abbreviations upon initial data entry.  This however is not the case.

Also, it's not always the case that pre- and post- data are the same as evidenced by:

```
MariaDB [db_webqtl]> SELECT COUNT(*) FROM Phenotype where Pre_publication_description != Post_publication_description AND Post_publication_description IS NOT NULL AND  Pre_publication_description IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|     4684 |
+----------+
1 row in set (0.03 sec)
```

Pre- descriptions/abbreviations are shown until a PMID is attached.  However, for many users, they forget to attach the PMID after the paper has been published.  Regardless, many traits in GN are never published and their value is a function of the full "post" description.

We should explore pre-linking pre-prints with canonical publications---to avoid duplication---after the RDF work.

## Meeting Agenda

Date: TBA

* How do we handle private/public data and metadata?  Data is the vectors of numbers; metadata include pre/post publication/abbreviation.

* Given the above problem, what's the FAIR way to go about it?  How do we allow sharing data that even encourages the paranoid?