You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

127 lines
4.1 KiB

  1. ---
  2. title: "using_tidyverse"
  3. author: "Pjotr"
  4. date: "03/03/2020"
  5. output: html_document
  6. ---
  7. ```{r setup, include=FALSE}
  8. # The following sets up the working directory for
  9. # the data files. Make sure to amend it to your
  10. # setup
  11. data_dir <- "/home/wrk/iwrk/closed/kemri/Francis_Final_TregData_Jan2020/Data/"
  12. setwd(data_dir)
  13. knitr::opts_knit$set(echo = TRUE, root.dir=data_dir)
  14. ```
  15. ## Using R Tidyverse
  16. The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book 'R for Data Science'. Please check it out!
  17. Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame
  18. ```{r}
  19. ind_attr=read.csv("data-202002/Individual_attributes.csv")
  20. plot(ind_attr$ELISA ~ ind_attr$Time_to_diagnosis)
  21. ```
  22. we want to turn ind_attr into a tibble with
  23. ```{r}
  24. library(tidyverse)
  25. tb = as_tibble(ind_attr)
  26. tb
  27. ```
  28. now we try the new way plotting using ggplot and tibble
  29. ```{r}
  30. ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))
  31. ```
  32. which shows that all high [ELISA](https://en.wikipedia.org/wiki/ELISA) values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.
  33. ## Correlation
  34. Let's try a simple correlation. This site has some
  35. interesting [ideas](https://paulvanderlaken.com/2018/09/10/simpler-correlation-analysis-in-r-using-tidyverse-priciples/) which we may visit later. Let's correlate
  36. using the pipes from dplyr:
  37. ```{r}
  38. library(dplyr)
  39. library(Hmisc)
  40. cs = cbind(tb$Time_to_diagnosis,tb$ELISA,tb$Age)
  41. rcorr(cs)
  42. ```
  43. From this it is clear that correlations between time to diagnosis, age and ELISA are low.
  44. ## Schizont
  45. The schizont column compares to a log transformed ELISA(?). Let's check that
  46. ```{r}
  47. data <- read_csv("data-202003/final_chmi_covariates.csv")
  48. ```
  49. Now it shoud be a native tibble. Let's plot:
  50. ```{r}
  51. ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))
  52. ```
  53. which does look similar to
  54. ```{r}
  55. ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))
  56. ```
  57. Let's do some colouring. There are three locations.
  58. ```{r}
  59. ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))
  60. ```
  61. You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let's try some model:
  62. ```{r}
  63. ggplot(data = data) +
  64. geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +
  65. geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)
  66. ```
  67. Now there is a trend line towards numbers of days. Very nice. Let's see if we can split by location
  68. ```{r}
  69. ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)
  70. ```
  71. The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this
  72. figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.
  73. Gender shows no real effect
  74. ```{r}
  75. ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)
  76. ```
  77. Nor does the genotype
  78. ```{r}
  79. ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=thal_genotype)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
  80. ```
  81. Going by phenotype:
  82. ```{r}
  83. ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=phenotypes_all)) + geom_point() + geom_smooth(method=lm,formula = y ~ x,se=FALSE)
  84. ```
  85. suggests out that a difference in starting schizont count between suscept and febrile. Chronic and PCR-ve are solid in late diagnosis.