126 行 4.1 KiB Raw パーマリンク Blame 履歴

 `---` `title: "using_tidyverse"` `author: "Pjotr"` `date: "03/03/2020"` `output: html_document` `---` ``` ``` ````{r setup, include=FALSE}` `# The following sets up the working directory for` `# the data files. Make sure to amend it to your` `# setup` `data_dir <- "/home/wrk/iwrk/closed/kemri/Francis_Final_TregData_Jan2020/Data/"` `setwd(data_dir)` `knitr::opts_knit\$set(echo = TRUE, root.dir=data_dir)` ````` ``` ``` `## Using R Tidyverse` ``` ``` `The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book 'R for Data Science'. Please check it out!` ``` ``` `Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame` ``` ``` ````{r}` `ind_attr=read.csv("data-202002/Individual_attributes.csv")` `plot(ind_attr\$ELISA ~ ind_attr\$Time_to_diagnosis)` ````` ``` ``` `we want to turn ind_attr into a tibble with` ``` ``` ````{r}` `library(tidyverse)` `tb = as_tibble(ind_attr)` `tb` ````` ``` ``` `now we try the new way plotting using ggplot and tibble` ``` ``` ````{r}` `ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))` ````` ``` ``` `which shows that all high [ELISA](https://en.wikipedia.org/wiki/ELISA) values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.` ``` ``` `## Correlation` ``` ``` `Let's try a simple correlation. This site has some` `interesting [ideas](https://paulvanderlaken.com/2018/09/10/simpler-correlation-analysis-in-r-using-tidyverse-priciples/) which we may visit later. Let's correlate` `using the pipes from dplyr:` ``` ``` ````{r}` `library(dplyr)` `library(Hmisc)` ``` ``` `cs = cbind(tb\$Time_to_diagnosis,tb\$ELISA,tb\$Age)` `rcorr(cs)` ````` ``` ``` `From this it is clear that correlations between time to diagnosis, age and ELISA are low.` ``` ``` `## Schizont` ``` ``` `The schizont column compares to a log transformed ELISA(?). Let's check that` ``` ``` ````{r}` `data <- read_csv("data-202003/final_chmi_covariates.csv")` ````` ``` ``` `Now it shoud be a native tibble. Let's plot:` ``` ``` ``` ``` ````{r}` `ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))` ````` ``` ``` `which does look similar to` ``` ``` ````{r}` `ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))` ````` ``` ``` `Let's do some colouring. There are three locations.` ``` ``` ````{r}` `ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))` ````` ``` ``` `You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let's try some model:` ``` ``` ``` ``` ````{r}` `ggplot(data = data) + ` ` geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +` ` geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)` ````` ``` ``` `Now there is a trend line towards numbers of days. Very nice. Let's see if we can split by location` ``` ``` ````{r}` `ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)` ````` ``` ``` `The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this ` `figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.` ``` ``` ``` ``` `Gender shows no real effect` ``` ``` ````{r}` `ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)` ````` ``` ``` `Nor does the genotype` ``` ``` ````{r}` `ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=thal_genotype)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)` ````` ``` ``` `Going by phenotype:` ``` ``` ````{r}` `ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=phenotypes_all)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)` ````` ``` ``` `suggests out that a difference in starting schizont count between suscept and febrile. Chronic and PCR-ve are solid in late diagnosis.` ``` ``` ``` ```