summaryrefslogtreecommitdiff
path: root/topics/gn-learning-team/progress-hurdles-lessons-learned-journey.gmi
blob: ed83b8136ec83fe8471ae6222e155cd5f20732b5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# My Software Development Journey so far,


* The following includes a brief story reflecting my progress so far in learning software development as part of the GeneNetwork team.
* I am currently a bioinformatics expert by profession. My sole responsibility is to use computational tools and knowledge in statistics and mathematics to answer biological questions and problems.
  This is done by analyzing a bunch of biological data generated from a set of experiments.
* I developed a keen interest in software development after understanding the enormous power software tools can provide to scientists regarding data analysis. Many scientists and bioinformaticists have the
  ability to do data analysis. But very few appreciate learning or becoming competent in being able to write their software tools to facilitate bioinformatics data analysis. And with this, my interest in 
  developing softwares for bioinformatics purposes started to grow.
* Being part of the GeneNetwork team, I have had, so far, the best experience growing as a software developer as well as a data engineer. I would love to share my progress so far, the current ongoing work, 
  lessons I have learned so far, challenges encountered, how I managed to solve them, and the overall working environment with the team.

## Early on Tasks 
* Among the first tasks I was assigned involved understanding the general aspect of APIs (Application Programming Interfaces): how they work, different types (in this case, REST api), and  how to use and build them. I managed to work
  on some of the tasks corresponding to this area. For more info, you can check out the following link below:
=> https://github.com/fetche-lab/GeneNetwork_23FL/blob/main/API/python_REST-API_code.md
* The other task involved experimenting with the SQLite tool in the process of understanding how to use the SQL database management system. The link to this task is:
=> https://github.com/fetche-lab/GeneNetwork_23FL/tree/main/python_sql
* Meanwhile, I am also taking the liberty of learning Python programming and getting familiar with contributing to the GeneNetwork web service.

## Current and ongoing Tasks 
* The current and ongoing tasks have mainly revolved around data curation and uploading them to the GeneNetwork web service. The primary focus involved uploading the test data as a demonstration, as well as uploading
  Arabidopsis and C elegans phenotype datasets from known public sources (mostly NCBI, and AraQTL) 
* For data curation before uploading to the GeneNetwork database, which involved several data transformation steps, was important to ensure that there were no invalid dataset values to prevent the file to be
  uploaded successfully. 
* The biggest challenge so far has been to validate the strain names (represented as column headers in each dataset uploaded). The team responsible is currently working on this bug. 

### Examples of datasets and scripts used in data preprocessing and transformation before a successful data upload 
The following link will direct you to the GitHub page, which contains a markdown document containing a series of Python scripts that were written to perform data wrangling and transformation 
focusing on the previously mentioned datasets and how each of these datasets was being explored uniquely to achieve the desired goal.

=> https://github.com/fetche-lab/GeneNetwork_23FL/blob/main/GeneNetwork_QC/test_uploads/testdata_upload.md
* You will observe that the only blocking factor to a successful data upload is the absence of strain names of most public datasets in the GeneNetwork database. This presents itself as a window
of opportunity to improve the functionality of the uploader, where a user can directly update the names when discovering them to be missing in the database. 

### Why choose the above datasets?..,
Arabidopsis dataset from NCBI (GSE247158)
* In summary, this experiment aimed to explore the genetic interactions of three MADS-box genes (XAL2/AGL14, SOC1, and AGL24), which are crucial in Arabid primary root development. The findings revealed that XAL2, SOC1, and AGL24 exhibit differential expression under osmotic stress conditions and that XAL2 also regulates several key genes involved in cell differentiation as well as in osmotic stress responses. Also, AGL24, SOC1, and XAL2 participate in primary root growth in all the osmotic stress conditions used.

Arabidopsis dataset from AraQTL (shared by Harm) 
* In this experiment, the goal was to understand the regulation of gene expression mechanisms during seed germination in Arabidopsis thaliana. An eQTL(expression Quantitative Trait Loci) mapping was performed at four important seed germination stages (primary dormant, after-ripened, six-hour after imbibition, and radicle protrusion stage), using  Arabidopsis thaliana Bay x Sha recombinant inbred lines (RILs). Each stage had a distinct eQTL landscape. An eQTL hotspot on chromosome five is collocated with hotspots for phenotypic and metabolic QTL in the same population. It was then revealed that genetic regulation of gene expression along the course of seed germination is dynamic, and after a network analysis, transcription factors  DEWAX and ICE1, as the most likely regulatory genes for the hotspot. 

Mice experiment dataset from NCBI (GSE241528)
* This experiment focused on understanding the prevalence of Post Stroke Epilepsy (PSE), which is a condition likely to occur after an individual suffers from a stroke, with the age factor. The idea is to monitor GABAA receptor-mediated seizure susceptibility (GABAA is a receptor for GABA neurotransmitter, responsible for regulating neuronal excitability), in individuals who have suffered from stroke, in two groups, the elderly and the young ones. This study revealed that GABAA receptor-mediated seizure susceptibility increased more in elderly individuals as compared to young individuals. 


### Then, what is next after a successful upload of the dataset? .., 
Verify that the data is indeed in the database.., 
Take a further step and perform the analyses with the available datasets.,