Vous ne pouvez pas sélectionner plus de 25 sujets Les noms de sujets doivent commencer par une lettre ou un nombre, peuvent contenir des tirets ('-') et peuvent comporter jusqu'à 35 caractères.

126 lignes
4.1 KiB

---
title: "using_tidyverse"
author: "Pjotr"
date: "03/03/2020"
output: html_document
---
```{r setup, include=FALSE}
# The following sets up the working directory for
# the data files. Make sure to amend it to your
# setup
data_dir <- "/home/wrk/iwrk/closed/kemri/Francis_Final_TregData_Jan2020/Data/"
setwd(data_dir)
knitr::opts_knit$set(echo = TRUE, root.dir=data_dir)
```
## Using R Tidyverse
The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book 'R for Data Science'. Please check it out!
Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame
```{r}
ind_attr=read.csv("data-202002/Individual_attributes.csv")
plot(ind_attr$ELISA ~ ind_attr$Time_to_diagnosis)
```
we want to turn ind_attr into a tibble with
```{r}
library(tidyverse)
tb = as_tibble(ind_attr)
tb
```
now we try the new way plotting using ggplot and tibble
```{r}
ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))
```
which shows that all high [ELISA](https://en.wikipedia.org/wiki/ELISA) values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.
## Correlation
Let's try a simple correlation. This site has some
interesting [ideas](https://paulvanderlaken.com/2018/09/10/simpler-correlation-analysis-in-r-using-tidyverse-priciples/) which we may visit later. Let's correlate
using the pipes from dplyr:
```{r}
library(dplyr)
library(Hmisc)
cs = cbind(tb$Time_to_diagnosis,tb$ELISA,tb$Age)
rcorr(cs)
```
From this it is clear that correlations between time to diagnosis, age and ELISA are low.
## Schizont
The schizont column compares to a log transformed ELISA(?). Let's check that
```{r}
data <- read_csv("data-202003/final_chmi_covariates.csv")
```
Now it shoud be a native tibble. Let's plot:
```{r}
ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))
```
which does look similar to
```{r}
ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))
```
Let's do some colouring. There are three locations.
```{r}
ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))
```
You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let's try some model:
```{r}
ggplot(data = data) +
geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +
geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)
```
Now there is a trend line towards numbers of days. Very nice. Let's see if we can split by location
```{r}
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)
```
The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this
figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.
Gender shows no real effect
```{r}
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)
```
Nor does the genotype
```{r}
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=thal_genotype)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
```
Going by phenotype:
```{r}
ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=phenotypes_all)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
```
suggests out that a difference in starting schizont count between suscept and febrile. Chronic and PCR-ve are solid in late diagnosis.