gnqa/paper2_eval/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

# Paper 2 Evaluation


This directory contains the code created to evaluate questions submitted to GNQA.
Unlike the evaluation in paper 1, this work uses different LLMs and a different RAG engine.
RAGAS is still used to evaluate the queries.

The RAG engine being used is [R2R](https://github.com/SciPhi-AI/R2R). It is open source and has performance similar to the engine we used for our 1st GNQA paper.

The evaluation workflow is organized around reading questions that can be organized with two sets of categories, e.g. category 1 - who asked the questions, category 2 - the field to which the question belongs.
In our initial work our category 1 consists of citizen scientists and domain experts.
While category 2 consists of three fields or specializations: Genenetwork.org systems genetics, the genetics of diabetes and the genetics of aging.

We will have make the code more configurable by pulling the categories out of the source code and keeping them strictly in settings files.

It is best to define a structure for your different types of data: sets, lists, responses, and scores.

| File Operator | From directory | To directory | command |
|:---:|---:|---:|:--|
| create_dataset | list | dataset | python create_dataset.py \
| | | | &nbsp;&nbsp;&nbsp; ../data/lists/list_catA_catB.json \ |
| | | | &nbsp;&nbsp;&nbsp; ../data/dataset/catA_catB.json |
| run_questions | list | responses |
| | | | &nbsp;&nbsp;&nbsp; ../data/list/catA_question_list.json \ |
| | | | &nbsp;&nbsp;&nbsp; ../data/responses/resp_catA_catB.json |
| parse_r2r_result | responses | dataset | |
| | | | &nbsp;&nbsp;&nbsp; ../data/responses/resp_catA_catB.json \ |
| | | | &nbsp;&nbsp;&nbsp; ../data/dataset/intermediate_files/catA_catB_.json |
| ragas_eval | dataset | scores | python3 ragas_eval.py \ |
| | | | &nbsp;&nbsp;&nbsp; ../data/datasets/catA/catB_1.json \ |
| | | | &nbsp;&nbsp;&nbsp; ../data/scores/catA/catB_1.json \ |
| | | | &nbsp;&nbsp;&nbsp; 3 # run evaluation 3 times |