|
---
|
|
title: "using_tidyverse"
|
|
author: "Pjotr"
|
|
date: "03/03/2020"
|
|
output: html_document
|
|
---
|
|
|
|
```{r setup, include=FALSE}
|
|
# The following sets up the working directory for
|
|
# the data files. Make sure to amend it to your
|
|
# setup
|
|
data_dir <- "/home/wrk/iwrk/closed/kemri/Francis_Final_TregData_Jan2020/Data/"
|
|
setwd(data_dir)
|
|
knitr::opts_knit$set(echo = TRUE, root.dir=data_dir)
|
|
```
|
|
|
|
## Using R Tidyverse
|
|
|
|
The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book 'R for Data Science'. Please check it out!
|
|
|
|
Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame
|
|
|
|
```{r}
|
|
ind_attr=read.csv("data-202002/Individual_attributes.csv")
|
|
plot(ind_attr$ELISA ~ ind_attr$Time_to_diagnosis)
|
|
```
|
|
|
|
we want to turn ind_attr into a tibble with
|
|
|
|
```{r}
|
|
library(tidyverse)
|
|
tb = as_tibble(ind_attr)
|
|
tb
|
|
```
|
|
|
|
now we try the new way plotting using ggplot and tibble
|
|
|
|
```{r}
|
|
ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))
|
|
```
|
|
|
|
which shows that all high [ELISA](https://en.wikipedia.org/wiki/ELISA) values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.
|
|
|
|
## Correlation
|
|
|
|
Let's try a simple correlation. This site has some
|
|
interesting [ideas](https://paulvanderlaken.com/2018/09/10/simpler-correlation-analysis-in-r-using-tidyverse-priciples/) which we may visit later. Let's correlate
|
|
using the pipes from dplyr:
|
|
|
|
```{r}
|
|
library(dplyr)
|
|
library(Hmisc)
|
|
|
|
cs = cbind(tb$Time_to_diagnosis,tb$ELISA,tb$Age)
|
|
rcorr(cs)
|
|
```
|
|
|
|
From this it is clear that correlations between time to diagnosis, age and ELISA are low.
|
|
|
|
## Schizont
|
|
|
|
The schizont column compares to a log transformed ELISA(?). Let's check that
|
|
|
|
```{r}
|
|
data <- read_csv("data-202003/final_chmi_covariates.csv")
|
|
```
|
|
|
|
Now it shoud be a native tibble. Let's plot:
|
|
|
|
|
|
```{r}
|
|
ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))
|
|
```
|
|
|
|
which does look similar to
|
|
|
|
```{r}
|
|
ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))
|
|
```
|
|
|
|
Let's do some colouring. There are three locations.
|
|
|
|
```{r}
|
|
ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))
|
|
```
|
|
|
|
You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let's try some model:
|
|
|
|
|
|
```{r}
|
|
ggplot(data = data) +
|
|
geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +
|
|
geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)
|
|
```
|
|
|
|
Now there is a trend line towards numbers of days. Very nice. Let's see if we can split by location
|
|
|
|
```{r}
|
|
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)
|
|
```
|
|
|
|
The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this
|
|
figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.
|
|
|
|
|
|
Gender shows no real effect
|
|
|
|
```{r}
|
|
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)
|
|
```
|
|
|
|
Nor does the genotype
|
|
|
|
```{r}
|
|
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=thal_genotype)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
|
|
```
|
|
|
|
Going by phenotype:
|
|
|
|
```{r}
|
|
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=phenotypes_all)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
|
|
```
|
|
|
|
suggests out that a difference in starting schizont count between suscept and febrile. Chronic and PCR-ve are solid in late diagnosis.
|
|
|
|
|