Browse Source

Add dissertation proposal and it's short summary

main
Bonface Munyoki 2 weeks ago
parent
commit
2481cbd8c8
Signed by: bonfacemunyoki GPG Key ID: F5BBAE1E0392253F
  1. 57
      grant-proposal/short-summary.org
  2. 23
      proposal/config.org
  3. 6
      proposal/problem-statement.org
  4. 266
      proposal/proposal.bib
  5. 91
      proposal/proposal.org

57
grant-proposal/short-summary.org

@ -0,0 +1,57 @@
* Short Summary
GeneNetwork (GN) is one of the oldest omics databases in the world. It has been collecting data since 1994 and to date, data collection is on-going. This project aims at making this data easier to use, particularly for AI/ML algorithms. Naturally, this implies having a well annotated data set that's machine readable and discoverable. This project aims at transforming specific GN data types---files on disk and tables in a SQL server---in to an RDF graph-database and exposing this through SPARQL endpoints. To ensure data integrity, this project will use existing studies based on GN, and try to replicate them with the new graph-database and make
measurable comparisons between SQL and RDF+SPARQL. The final step in this project would be to replicate this GN infrastracture here in Kenya for use by local facilities--- KEMRI/Wellcome Trust---for further research.
The estimated total contribution for this proposal is USD 5000. The grant will help me support my studies so I can focus fully on this data science project.
* Motivational Statement
Some time back, I got diagnosed with prediabetes. Troubling as this news was to me, and being fortunate enough to work with bio-informaticians and geneticists as part of the GeneNetwork development efforts, I learnt how little we know about diseases, and most importantly, how hard it is to share data (in a safe way) around diseases. It's even more concerning to read about how bias data can get. Efforts to work around such bias is hampered by well-meaning privacy laws, and in some cases a lack of infrastructure to acquire the data. The on-going Covid19 pandemic clearly highlights this. This research work is important to me because it's a small effort in enabling researchers here in Kenya to share their data in a FAIR way, and most importantly, in a practical convenient way, thereby enabling research on conditions/diseases that may be contextualised to Kenya.
* Activities and Methodology
Activities:
- Take existing studies as an example and convert a specific set of GN tables to RDF graph-database
- Expose GN data through graph-database SPARQL endpoints
- Host GN at KEMRI/Wellcome Trust to allow researchers in Africa to upload and analyse data
- Write a peer-reviewed publication on this work
Methodology:
The activities of this project will proceed incrementally. If advancements in one activity are beneficial to another aim, such changes will be made.
In this project, I will use the GeneNetwork (GN) database as my primary data source. Because I actively work in the GN project, getting this data is not a hindrance. After transforming specific tables in GN from RDF to SQL, I intend to replicate previous GN studies to be able to compare the performance between SQL and RDF+SQL. This comparison will involve: measuring and contrasting the speed of complex queries; assessing the complexity of queries; and assessing the difficulty of common data operations, particularly insertions and deletions. Similar to any other complex data science project, should new relevant measurements come in to focus, they will be made. Finally, using GNU Guix as a strong infrastracture backbone, I will replicate the GN environment here in Kenya.
* Expected Outputs
- A subset of GN tables converted to an RDF graph-database
- Exposed GN data through graph-database SPARQL endpoints
- Hosted GN at KEMRI/Wellcome Trust to allow researchers in Africa to upload and analyse data
- An assessment on ethical aspects of access to human data in the Kenyan context
- A peer-reviewed publication on this work
* Ethical and Environmental implications
This research primarily involves database systems for major model organisms supplied by GeneNetwork (GN), and will grow to accommodate human data.
One point of ethical concern for working with such a data set is privacy and regulations around the same. A major perceived risk when working with data, particularly with human data, is that it is prone to misuse. I will take the opportunity to discuss and research access to human data and ethics. We can ask: Is there a way to anonymize data in a way that you could still perform useful research on it? There has been active research on this, and in fact, GN has demonstrated one such technique---/homomorphic encryption/---in anonymizing genotype and phenotype data. Homomorphic encryption and other techniques such as differential privacy will be explored. The most practical and feasible technique for data anonymization will be applied so that we can guarantee that human data used is safe. The alternative is to make FAIR metadata available that can tell researchers how to reach that data.
With regards to the environment, the effect this project will have on the environment will be in compute power, particularly required in running GN.
* Risks and Mitigations
One risk will be that this project won't get hardware support. My supervisor at KEMRI/Wellcome runs a high-performance compute cluster with large storage and we want to host GeneNetwork there. If that turns out to be a problem I may use a cluster at Pwani University. Another fallback is the University of Tennesee GeneNetwork servers where I have access [https://genenetwork.org/facilities/] and that is always available for GeneNetwork related work.
Another risk may be that I get no collaborators in Kenya to share their scientific requirements and studies. My supervisors are involved with a great number of them and are positive that scientists are eager to collaborate with me on data-science targets. We need more data scientists in the biomedical sciences in Kenya.
* Timeline/ Work schedule
May -- Sep 2022:
- Take existing studies as an example and convert a specific set of GN tables to RDF graph-database
- Expose GN data through graph-database SPARQL endpoints
Oct -- Dec 2022
- Host GN at KEMRI/Wellcome Trust to allow researchers in Africa to upload and analyse data
- Provide assistance in adding Kenyan data to our local instance
Jan -- Apr 2023
- Assess ethical aspects of access to human data in the Kenyan context
- Write a peer-reviewed publication on this work

23
proposal/config.org

@ -0,0 +1,23 @@
#+LaTeX_HEADER: \usepackage[utf8]{inputenc}
#+LaTeX_HEADER: \usepackage[T1]{fontenc}
#+LaTeX_HEADER: \usepackage{palatino}
#+LaTeX_HEADER: \usepackage{enumitem}
@@latex:\linespread{1.05}@@
@@latex:\renewcommand*\familydefault{\sfdefault}@@
#+LaTeX_HEADER: \usepackage{amsfonts, amsmath, amsthm, amssymb}
#+LaTeX_HEADER: \usepackage{graphicx}
#+LaTeX_HEADER: \usepackage{booktabs}
#+LaTeX_HEADER: \usepackage{wrapfig}
#+LaTeX_HEADER: \usepackage[labelfont=bf]{caption}
#+latex_header: \usepackage{svg}
#+latex_header: \setlength{\parindent}{0cm}
#+latex_header: \setlength{\parskip}{5pt}
#+latex_header: \setlist[itemize]{noitemsep}
#+latex_header: \usepackage{titlesec}
#+latex_header: \usepackage{enumitem}
#+latex_header: \setlist{nolistsep}
#+latex_header: \usepackage{caption}
#+latex_header: \captionsetup[figure]{font=small,labelfont=small}
@@latex:\pagestyle{empty}@@
@@latex:\hyphenation{ionto-pho-re-tic iso-tro-pic fortran}@@
#+latex_header: \usepackage[sorting=none]{biblatex}

6
proposal/problem-statement.org

@ -0,0 +1,6 @@
Problem Statement:
With technological improvements, both in terms of computational and storage power, there is proliferation of data in biomedical AI/ML. Current efforts in this space are hampered by data that is difficult to access and which is often siloed. Even public data is hard to process because relationships between data elements have to be guessed from limited information, such as column headers and row names. For complex data to be truly useful, we need to enable machines to not only access the data, but also interpret and incorporate this data into their algorithms. This will enable automatic inferencing beyond one-to-one equivalence matches to more complicated relationships across datasets.
This dissertation aims to provide new opportunities for the African scientific community; in particular, African health research on diabetes, hypertension and longevity. It aims to do so by outlining and demonstrating, using GeneNetwork as an example, how to reshape existing data, and model it into a more semantic machine-ingestible way in an extensible way. It will be encouraging to give scientists easier access to well-annotated localised Kenyan data to answer context-specific questions such as: "What traits are risk factors in conjunction with diabetes in Kenya?"

266
proposal/proposal.bib

@ -0,0 +1,266 @@
@article{wilkinson2016fair,
title={The FAIR Guiding Principles for scientific data management and stewardship},
author={Wilkinson, Mark D and Dumontier, Michel and Aalbersberg, IJsbrand Jan and Appleton, Gabrielle and Axton, Myles and Baak, Arie and Blomberg, Niklas and Boiten, Jan-Willem and da Silva Santos, Luiz Bonino and Bourne, Philip E and others},
journal={Scientific data},
volume={3},
number={1},
pages={1--9},
year={2016},
publisher={Nature Publishing Group}
}
@inproceedings{dwork2008differential,
title={Differential privacy: A survey of results},
author={Dwork, Cynthia},
booktitle={International conference on theory and applications of models of computation},
pages={1--19},
year={2008},
organization={Springer}
}
@article{mott2020private,
title={Private Genomes and Public SNPs: Homomorphic encryption of genotypes and phenotypes for shared quantitative genetics},
author={Mott, Richard and Fischer, Christian and Prins, Pjotr and Davies, Robert William},
journal={Genetics},
volume={215},
number={2},
pages={359--372},
year={2020},
publisher={Oxford University Press}
}
@article{sloan2016genenetwork,
title={GeneNetwork: framework for web-based genetics},
author={Sloan, Zachary and Arends, Danny and Broman, Karl W and Centeno, Arthur and Furlotte, Nicholas and Nijveen, Harm and Yan, Lei and Zhou, Xiang and Williams, Robert W and Prins, Pjotr},
journal={Journal of Open Source Software},
volume={1},
number={2},
pages={25},
year={2016}
}
@article{wang2003webqtl,
title={WebQTL},
author={Wang, Jintao and Williams, Robert W and Manly, Kenneth F},
journal={Neuroinformatics},
volume={1},
number={4},
pages={299--308},
year={2003},
publisher={Springer}
}
@incollection{mulligan2017genenetwork,
title={GeneNetwork: a toolbox for systems genetics},
author={Mulligan, Megan K and Mozhui, Khyobeni and Prins, Pjotr and Williams, Robert W},
booktitle={Systems Genetics},
pages={75--120},
year={2017},
publisher={Springer}
}
@article{apweiler2004uniprot,
title={UniProt: the universal protein knowledgebase},
author={Apweiler, Rolf and Bairoch, Amos and Wu, Cathy H and Barker, Winona C and Boeckmann, Brigitte and Ferro, Serenella and Gasteiger, Elisabeth and Huang, Hongzhan and Lopez, Rodrigo and Magrane, Michele and others},
journal={Nucleic acids research},
volume={32},
number={suppl\_1},
pages={D115--D119},
year={2004},
publisher={Oxford University Press}
}
@article{gymrek2013identifying,
title={Identifying personal genomes by surname inference},
author={Gymrek, Melissa and McGuire, Amy L and Golan, David and Halperin, Eran and Erlich, Yaniv},
journal={Science},
volume={339},
number={6117},
pages={321--324},
year={2013},
publisher={American Association for the Advancement of Science}
}
@article{buil2013federating,
title={Federating queries in SPARQL 1.1: Syntax, semantics and evaluation},
author={Buil-Aranda, Carlos and Arenas, Marcelo and Corcho, Oscar and Polleres, Axel},
journal={Journal of Web Semantics},
volume={18},
number={1},
pages={1--17},
year={2013},
publisher={Elsevier}
}
@inproceedings{wang2009learning,
title={Learning your identity and disease from research papers: information leaks in genome wide association study},
author={Wang, Rui and Li, Yong Fuga and Wang, XiaoFeng and Tang, Haixu and Zhou, Xiaoyong},
booktitle={Proceedings of the 16th ACM conference on Computer and communications security},
pages={534--544},
year={2009}
}
@article{homer2008resolving,
title={Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays},
author={Homer, Nils and Szelinger, Szabolcs and Redman, Margot and Duggan, David and Tembe, Waibhav and Muehling, Jill and Pearson, John V and Stephan, Dietrich A and Nelson, Stanley F and Craig, David W},
journal={PLoS genetics},
volume={4},
number={8},
pages={e1000167},
year={2008},
publisher={Public Library of Science San Francisco, USA}
}
@article{erlich2014routes,
title={Routes for breaching and protecting genetic privacy},
author={Erlich, Yaniv and Narayanan, Arvind},
journal={Nature Reviews Genetics},
volume={15},
number={6},
pages={409--421},
year={2014},
publisher={Nature Publishing Group}
}
@article{de2013users,
title={Users’ attitudes, perception, and concerns in the era of whole genome sequencing},
author={De Cristofaro, Emiliano},
journal={CoRR},
year={2013}
}
@inproceedings{johnson2013privacy,
title={Privacy-preserving data exploration in genome-wide association studies},
author={Johnson, Aaron and Shmatikov, Vitaly},
booktitle={Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining},
pages={1079--1087},
year={2013}
}
@article{martens2018importance,
title={The importance of data access regimes for artificial intelligence and machine learning},
author={Martens, Bertin},
year={2018},
publisher={JRC Digital Economy Working Paper 2018-09}
}
@misc{stall2019make,
title={Make scientific data FAIR},
author={Stall, Shelley and Yarmey, Lynn and Cutcher-Gershenfeld, Joel and Hanson, Brooks and Lehnert, Kerstin and Nosek, Brian and Parsons, Mark and Robinson, Erin and Wyborn, Lesley},
year={2019},
publisher={Nature Publishing Group}
}
@misc{crocker2011addressing,
title={Addressing scientific fraud},
author={Crocker, Jennifer and Cooper, M Lynne},
journal={Science},
volume={334},
number={6060},
pages={1182--1182},
year={2011},
publisher={American Association for the Advancement of Science}
}
@article{sandve2013ten,
title={Ten simple rules for reproducible computational research},
author={Sandve, Geir Kjetil and Nekrutenko, Anton and Taylor, James and Hovig, Eivind},
journal={PLoS computational biology},
volume={9},
number={10},
pages={e1003285},
year={2013},
publisher={Public Library of Science San Francisco, USA}
}
@inproceedings{bajpai2017challenges,
title={Challenges with reproducibility},
author={Bajpai, Vaibhav and K{\"u}hlewind, Mirja and Ott, J{\"o}rg and Sch{\"o}nw{\"a}lder, J{\"u}rgen and Sperotto, Anna and Trammell, Brian},
booktitle={Proceedings of the Reproducibility Workshop},
pages={1--4},
year={2017}
}
@article{theGuixEnvironment2022Gnu,
title={Invoking guix container (GNU Guix Reference Manual)},
year={2022},
url={https://guix.gnu.org/manual/en/html_node/Invoking-guix-container.html}
}
@article{nlnetVariationGraph,
title={variation graph (vgteam)},
year={2022},
url={https://nlnet.nl/project/VariationGraph/}
}
@article{hegp22,
title={HEGP Challenge},
year={2022},
url={https://hegp.genenetwork.org/}
}
@inproceedings{erxleben2014introducing,
title={Introducing Wikidata to the linked data web},
author={Erxleben, Fredo and G{\"u}nther, Michael and Kr{\"o}tzsch, Markus and Mendez, Julian and Vrande{\v{c}}i{\'c}, Denny},
booktitle={International semantic web conference},
pages={50--65},
year={2014},
organization={Springer}
}
@inproceedings{malyshev2018getting,
title={Getting the most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph},
author={Malyshev, Stanislav and Kr{\"o}tzsch, Markus and Gonz{\'a}lez, Larry and Gonsior, Julius and Bielefeldt, Adrian},
booktitle={International Semantic Web Conference},
pages={376--394},
year={2018},
organization={Springer}
}
@article{redaschi2009uniprot,
title={UniProt in RDF: tackling data integration and distributed annotation with the semantic web},
author={Redaschi, Nicole and UniProt Consortium and others},
journal={Nature precedings},
pages={1--1},
year={2009},
publisher={Nature Publishing Group}
}
@misc{gnGitStats,
title = {Genenetwork},
year = {2022},
howpublished = {\url{https://github.com/genenetwork/genenetwork2/pulse}},
}
@inproceedings{munoz2014using,
title={Using linked data to mine RDF from wikipedia's tables},
author={Mu{\~n}oz, Emir and Hogan, Aidan and Mileo, Alessandra},
booktitle={Proceedings of the 7th ACM international conference on Web search and data mining},
pages={533--542},
year={2014}
}
@article{anderson2021highlights,
title={Highlights from the Era of open source web-based tools},
author={Anderson, Kristin R and Harris, Julie A and Ng, Lydia and Prins, Pjotr and Memar, Sara and Ljungquist, Bengt and F{\"u}rth, Daniel and Williams, Robert W and Ascoli, Giorgio A and Dumitriu, Dani},
journal={Journal of Neuroscience},
volume={41},
number={5},
pages={927--936},
year={2021},
publisher={Soc Neuroscience}
}
@article{gunturkun2022genecup,
title={GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships},
author={Gunturkun, Mustafa Hakan and Flashner, Efraim and Wang, Tengfei and Mulligan, Megan K and Williams, Robert W and Prins, Pjotr and Chen, Hao},
journal={G3 Genes| Genomes| Genetics},
year={2022}
}
@misc{2022gnRefs,
title = {Papers and References to GeneNetwork},
year = {2022},
howpublished = {\url{http://genenetwork.org/references/}},
}

91
proposal/proposal.org

@ -0,0 +1,91 @@
#+TITLE: Unlocking Biomedical Data for AI Health Research in Africa Using GeneNetwork as an Example
#+OPTIONS: toc:nil title:t num:nil author:nil date:nil
#+latex_class: article
#+latex_class_options: [notitlepage,11pt]
#+CITE_EXPORT: biblatex
#+INCLUDE: config.org
#+bibliography: proposal.bib
@@latex:\vspace{-5em}@@
* Introduction
Current efforts in biomedical AI/ML are hampered by data that is difficult to access and siloed [cite:@martens2018importance;@stall2019make;@wilkinson2016fair]. Even public data is hard to process because relationships between data elements have to be guessed from limited information, such as column headers and row names. For complex data to be truly useful, we need to enable machines to not only access the data, but also interpret and incorporate this data into their algorithms. This will enable automatic inferencing beyond one-to-one equivalence matches to more complicated relationships across datasets.
This dissertation aims to provide new opportunities for the African scientific community; in particular, African health research on diabetes, hypertension and longevity. At a personal level, being diagnosed prediabetic, I believe it will be encouraging to give scientists easier access to well-annotated localised Kenyan data to answer context-specific questions such as: "What traits are risk factors in conjunction with diabetes in Kenya?" An example of a service that does this is GeneCup[fn:genecupweb], a tool that efficiently and comprehensively answers the question: "What do we know about these genes and the topic I study?" [cite:@gunturkun2022genecup]. As an example of using GeneCup as a tool, Figure [[fig:genecup]] shows a graph that uses the keyword: /diabetes/. Through text mining abstracts from PubMed, and applied deep learning, GeneCup finds a relation between diabetes and gene symbols. i.e., GeneCup focuses on the relationship between genes and a set of keywords organized as an ontology [cite:@gunturkun2022genecup].
The long term vision of this data science dissertation is to unlock Kenyan biological data, so that it can be made available for machine analysis in services such as GeneCup. I will start with the public GeneNetwork (GN) resource that contains over 20 years of heterogeneous experimental data that has resulted in thousands of publications [cite:@2022gnRefs]. GN is a long-running biomedical web service that has been in existence since 1994 [cite:@anderson2021highlights;@mulligan2017genenetwork;@sloan2016genenetwork;@wang2003webqtl]. GN enables researchers with little programming experience to access large data [fn:omics-data] through a web interface and to run complex statistical models.
Data is also presented through API endpoints---accessible in programming environments like R, Python and live Jupyter notebooks. GN is uniquely centered on hosting genetics data, phenotyping, Quantitative Trait Loci (QTL) and Genome Wide Association (GWA) studies in human and model species [fn:model-species], such as mouse and rat. Having all this data from different experiments spanning over 20 years in one database with a simple web UI allows any researcher to uniquely run analyses across many studies.
During my master's studies at Strathmore University, I have written functionality for the web front-end of GN. In the course of this work, I have come to realize that the underlying data structures---some 80 cross-referenced SQL tables and many different additional file types---make it hard to access this data for more general data mining.
A REST API was created to access data. A REST API, however, is rigid and predicts how people want to access data.
REST APIs are not very useful for automated data-science analysis because machines cannot reason about the relationships between the provided resources unless an ontology is provided. It is clear that more flexible approaches are needed to increase the value of the service. Unlocking the data means reorganizing the data and providing a way of automated data discovery. Additionally, I aim to make a copy of the current widely used service that is physically hosted in Memphis TN, USA, and host it in Kenya at KEMRI/Wellcome Trust Kilifi [fn:howKemri]. This will allow us to add African studies and data to GN using the latest analysis and storage methods, including homomorphic encryption [cite:@mott2020private] and differential privacy [cite:@dwork2008differential;@nlnetVariationGraph] for human data.
Underpinning the GN service is a mountain of data that is growing steadily (currently about 0.5 TB on the server and 30 TB from wider resources, such as Uniprot, Wikidata and PubMed), and with it new challenges, in particular: espousing clear relations between the data given a history of ad-hoc storage; fast data retrieval; non-obvious complex queries to do seemingly straight-forward tasks; connecting with other open databases (e.g. Uniprot) in a coherent way; and unclear means of adding new data that does not align with existing schemas. To address this I will introduce a graph-based data model, with web standards RDF and SPARQL, to replace the existing database.
In the GN software model, I found 163 specialised SQL queries for search, data fetching and analysis. Providing a new graph-based RDF data model with built-in relationships and machine-readable annotations will make machine-based data mining possible [cite:@munoz2014using]. The approach is not new. In fact, two successful, very large and high performing RDF+SPARQL graph databases are Wikidata---the back-end for Wikipedia---and Uniprot [cite:@erxleben2014introducing;@redaschi2009uniprot].
To get a good understaning of data-science requirements, my approach will be to take three or more examples of published GeneNetwork (GN) studies [cite:@2022gnRefs], e.g., on diabetes, obesity and longevity, and replicate the analyses of those studies building up the new graph-based database structure. I will apply ML/AI techniques and link out the data to external resources, similar to what GeneCup achieves with deep learning PubMed abstracts (Figure [[fig:genecup]]). I will show that the data from these examples can be discovered and reasoned on by machines. Next, through KEMRI/Wellcome and Strathmore University, I will work with biomedical researchers to see how we can facilitate their research using the GN infrastructure and hosting that in Kenya.
#+CAPTION: Example using GeneCup to find relationships between the keyword: /diabetes/ and ontological categories and traits. GeneCup mines open data abstracts of the NIH PubMed server programmatically and uses deep learning to automatically distinguish sentences describing disease related terms [cite:@gunturkun2022genecup]
#+LABEL: fig:genecup
#+NAME: fig:genecup
#+ATTR_ORG: :width 550px
#+ATTR_LATEX: :width 550px :center t
[[../grant-proposal//genecup-graph.png]]
*Research Hypothesis*: Graph-databases with semantic ontologies will unlock biomedical data and make it available for ML/AI analysis.
*Expected Outcomes*:
- A subset of GN tables converted to an RDF graph-database
- Exposed GN data through graph-database SPARQL endpoints
- Hosted GN at KEMRI/Wellcome Trust to allow researchers in Africa to upload and analyse data
- An assessment on ethical aspects of access to human data in the Kenyan context
- A peer-reviewed publication on this work
Hosting the computational resources and biomedical data in Kenya and giving access to African researchers is aligned with African Sustainable Development Goals as outlined by the United Nations (\url{https://www.undp.org/sustainable-development-goals}).
The two deliverables for this research dissertation are:
*** 1: Convert existing data types to RDF graph-databases and make them available as machine discoverable resources with SPARQL and ontologies
Currently, most of GN data is stored as SQL tables in MariaDB, with some data stored as files on disk. Forming meaningful queries on such a structure involves formulating complex joins and prior understanding of data relationships. Additionally, data entry tasks may sometimes require adding new tables or files to accomodate new studies. While working with GN, I have observed that most of the data is graph-like in nature, albeit being forced into a 2D structure (tables). This modus operandi---forcing graph-like data on 2D structures--- is limiting when adding new data with unforeseen dimensions that our rigid existing 2D schema cannot accomodate. I aim to assess graph-like entries in the GN database and transform them to RDF, and to annotate the generated graph. By so doing, I want to demonstrate improved semantic quality of GN data which can be extended to accomodate new data sets, particularly for the Kenyan context.
*** 2: Move GeneNetwork and the results of deliverable 1 to hosted facilities in Kenya
Sharing data for ML/AI purposes requires data access and ideally the data should be FAIR[cite:@wilkinson2016fair]. Stringent laws around data privacy in different countries make research difficult for an African researcher who requires access to that data. As an example, accessing GN data in it's entirety outside the US requires bureaucratic processes to work through. Similarly, here in Kenya, accessing health-care related data is restricted to our /[Kenya's]/ borders. For the purpose of my research this presents an opportunity: hosting our own GN instance here in Kenya with data that's stored in a more semantic way. To work around sharing sensitive data, well annotated and useful metadata can be released to the public; and this will point curious and potential end-users to the right places to seek data access, even if it resides behind closed walls. SPARQL allows for federated queries [cite:@buil2013federating] where multiple database instances living in different spaces over the internet act as one single instance. Such linked data allows exploration and evaluation by removing the tedium of designing an integrative schema or working out ways to normalise and/or warehouse a data subset in order to ask /domain-spanning/ questions.
One risk will be that we don't get hardware support. My supervisor at KEMRI/Wellcome runs a high-performance compute cluster with large storage and we want to host GeneNetwork there. If that turns out to be a problem I may use a cluster at Pwani University. Another fallback is the University of Tennesee GeneNetwork servers where I have access [https://genenetwork.org/facilities/] and that is always available for GeneNetwork related work. Another risk may be that I get no collaborators in Kenya to share their scientific requirements and studies. My supervisors are involved with a great number of them and are positive that scientists are eager to collaborate with me on data-science targets. We need more data scientists in the biomedical sciences in Kenya.
* Ethical Review
This research primarily involves database systems for major model organisms supplied by GN, and will grow to accommodate human data.
One point of ethical concern for working with such a data set is privacy and regulations around the same. A major perceived risk when working with data, particularly with human data, is that it is prone to misuse [cite:@gymrek2013identifying;@wang2009learning;@erlich2014routes]. I will take the opportunity to discuss and research access to human data and ethics. We can ask: /Is there a way to anonymize data in a way that you could still perform useful research on it?/ There has been active research on this, and in fact, GN has demonstrated one such technique---/homomorphic encryption/---in anonymizing genotype and phenotype data (\url{https://hegp.genenetwork.org/}) [cite:@mott2020private]. Homomorphic encryption and other techniques such as differential privacy [cite:@dwork2008differential] will be explored. The most practical and feasible technique for data anonymization will be applied so that we can guarantee that human data used is safe. The alternative is to make FAIR metadata available that can tell researchers how to reach that data:
- Machines can easily "discover" and infer (/access)/ the data they need;
- Having a shared vocabulary in a machine accessible format that guarantees interoperability;
- Having contextual information that allows proper and correct interpretation thereby making the data more reusable; and
- Allowing the attachment of rich provenance information to a graph-like structure which allows for accurate citation.
The effect of this work on people and society is improving the underlying health care infrastructure, particularly when it comes to storing and accessing local omics data. This would have a cascading effect of improving research that needs local data and, similar to shining a beacon in the dark, lead to newer perspectives of data sets we already have.
* Supervisors
My /[external]/ supervisors for this disseration are: Prof. Shelby Solomon, who is an expert in data science, AI and ML; Prof. Pjotr Prins, University of Tennessee USA, who contributes to GeneNetwork and co-authored GeneCup and Homomorphic Encryption papers; and Dr George Githinji, KEMRI/Wellcome Trust, Kenya, who is the bio-informatics and data science lead and Adjunct Professor at Pwani University. Computational facilities and storage will be made available by KEMRI/Wellcome Trust, Strathmore University, and the University of Tennessee. With the help of Strathmore University, I am in the process of looking for an internal supervisor.
* References :ignore:
#+print_bibliography:
* Footnotes :ignore:
[fn:guix-containers] https://guix.gnu.org/manual/en/html_node/Invoking-guix-container.html}
[fn:storing-sql] Storing data in SQL is the current standard in biology
[fn:omics-data] https://en.wikipedia.org/wiki/Omics
[fn:model-species] https://en.wikipedia.org/wiki/Model_organism
[fn:ratspub] https://rats.pub/
[fn:genecupweb] [[https://genecup.org/]]
[fn:howKemri] My supervisors will assist me to co-ordinate the actual details of getting a server to use from KEMRI/Wellcome Trust Kilifi.
# local variables:
# eval: (progn (require 'ox-extra) (ox-extras-activate '(ignore-headlines)))
# end:
Loading…
Cancel
Save