|
- ---
- title: "using_tidyverse"
- author: "Pjotr"
- date: "03/03/2020"
- output: html_document
- ---
-
- ```{r setup, include=FALSE}
- # The following sets up the working directory for
- # the data files. Make sure to amend it to your
- # setup
- data_dir <- "/home/wrk/iwrk/closed/kemri/Francis_Final_TregData_Jan2020/Data/"
- setwd(data_dir)
- knitr::opts_knit$set(echo = TRUE, root.dir=data_dir)
- ```
-
- ## Using R Tidyverse
-
- The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book 'R for Data Science'. Please check it out!
-
- Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame
-
- ```{r}
- ind_attr=read.csv("data-202002/Individual_attributes.csv")
- plot(ind_attr$ELISA ~ ind_attr$Time_to_diagnosis)
- ```
-
- we want to turn ind_attr into a tibble with
-
- ```{r}
- library(tidyverse)
- tb = as_tibble(ind_attr)
- tb
- ```
-
- now we try the new way plotting using ggplot and tibble
-
- ```{r}
- ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))
- ```
-
- which shows that all high [ELISA](https://en.wikipedia.org/wiki/ELISA) values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.
-
- ## Correlation
-
- Let's try a simple correlation. This site has some
- interesting [ideas](https://paulvanderlaken.com/2018/09/10/simpler-correlation-analysis-in-r-using-tidyverse-priciples/) which we may visit later. Let's correlate
- using the pipes from dplyr:
-
- ```{r}
- library(dplyr)
- library(Hmisc)
-
- cs = cbind(tb$Time_to_diagnosis,tb$ELISA,tb$Age)
- rcorr(cs)
- ```
-
- From this it is clear that correlations between time to diagnosis, age and ELISA are low.
-
- ## Schizont
-
- The schizont column compares to a log transformed ELISA(?). Let's check that
-
- ```{r}
- data <- read_csv("data-202003/final_chmi_covariates.csv")
- ```
-
- Now it shoud be a native tibble. Let's plot:
-
-
- ```{r}
- ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))
- ```
-
- which does look similar to
-
- ```{r}
- ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))
- ```
-
- Let's do some colouring. There are three locations.
-
- ```{r}
- ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))
- ```
-
- You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let's try some model:
-
-
- ```{r}
- ggplot(data = data) +
- geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +
- geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)
- ```
-
- Now there is a trend line towards numbers of days. Very nice. Let's see if we can split by location
-
- ```{r}
- ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)
- ```
-
- The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this
- figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.
-
-
- Gender shows no real effect
-
- ```{r}
- ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)
- ```
-
- Nor does the genotype
-
- ```{r}
- ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=thal_genotype)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
- ```
-
- Going by phenotype:
-
- ```{r}
- ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=phenotypes_all)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
- ```
-
- suggests out that a difference in starting schizont count between suscept and febrile. Chronic and PCR-ve are solid in late diagnosis.
-
|