|
1 year ago | |
---|---|---|
summary-stats | 1 year ago | |
test | 1 year ago | |
.gitignore | 1 year ago | |
LICENSE | 1 year ago | |
README.org | 1 year ago |
EBI is providing an update of their GWAS resource in the form of an API that can get summary statistics of SNP candidates (JSON output only). They also provide the classic downloadable tabular data resources which we do not want to use for two reasons:
We want to use the latest information automatically
For EBI it is easier to track use and fund future initiatives
The latter reason is more important than meets the eye. Despite appearances these resources are not free.
This code is written in Racket. Why Racket? To me Racket is everything Python and Ruby have to offer and more.
Enjoy,
Pjotr Prins (2019)
Let's start with mouse gene Shh (human alias SHH) and rat Brca2 (human alias BRCA2). For the human variant the wikidata page for BRCA2 points to the NCBI gene description (which is usefully elaborate). That page in turn points to the Uniprot resource.
Wikidata also has the start and en position of the gene for two reference genomes. hg38 is known as GRCh38.p2 (2014) which is now at p13 (2019). So, which one does EBI use? GRCh38.p12 according to the FAQ:
curl 'https://www.ebi.ac.uk/gwas/rest/api/metadata' -i -H 'Accept: application/json'
(Note that the API returns JSON even if you ask for XML. For this query requesting XML gives a correct notification: the resource identified by this request is only capable of generating responses with characteristics not acceptable according to the request "accept" headers).
{ "_embedded" : { "mappingMetadatas" : [ { "ensemblReleaseNumber" : 97, "genomeBuildVersion" : "GRCh38.p12", "dbsnpVersion" : 151, "usageStartDate" : "2019-07-03T13:00:03.625+0000" } ] } }
Note: this metadata tells us the GWAS REST API uses GRCh38.p12. Summary statistics API, however, may be different (but probably not far off).
If we look at BRCA2 on version p13 the location is
chr13:32315480..32399672
while wikidata has it at p2
chr13:32315086..32973805
. In other words we'll need to fetch the
latest somehow (though the difference may not be too critical for our
purposes).
The FAQ suggests the API supports searching by gene which should make a lookup easier. In the full API docs this query is shown
curl 'https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene?geneName=BRCA2' -i -H 'Accept: application/json' > BRCA2.json
results in a list of SNPs with top/curated associations. For summary statistics there is no lookup by gene yet - I have put in a request. So, for now we need to figure out the gene position first using the ENSEMBL API.
Wikidata includes an ENSEMBL identifier which makes this relatively straightforward. In fact, that can even be skipped because ENSEMBL takes gene names. E.g.
curl 'https://rest.ensembl.org/lookup/symbol/homo_sapiens/BRCA2?expand=0&format=full' -H 'Content-type:application/json'
{"db_type":"core", "end":32400266, "biotype":"protein_coding", "assembly_name":"GRCh38", "start":32315086,"object_type": "Gene","strand":1, "description":"BRCA2 DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]", "seq_region_name":"13", "source":"ensembl_havana", "version":15, "display_name":"BRCA2", "logic_name":"ensembl_havana_gene_homo_sapiens", "id":"ENSG00000139618", "species":"homo_sapiens"}
Note that this uses another reference genome (again!) so positions will be off.
Coming to think of it, for summary statistics it should not matter too much because most associated SNPs will be outside the coding region anyway. The simple solution is to take anything that is (say) 100Kbp around the gene under study. In fact that is what the old findByGene variant API does above. In gene dense regions this may lead to problems, but hey, this is a discovery tool. We will use above coordinates to signify whether a SNP is (likely) outside the gene region. So we can start with these positions and query the recently added summary statistics API with this for your query:
curl "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations?bp_lower=32315086&bp_upper=32400266"
If you want to look at JSON in a nice way pipe the output into the excellent jq tool.
This will return all associations within that region, so variants will be duplicated. If you want to filter by p-value that is also possible (see below).
In the result list there is a URL for example for SNP
curl https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations/rs9534262
To view/filter JSON the command line tool jq comes in handy
jq < summary.json
"associations": { "0": { "base_pair_location": 32315226, "chromosome": 13, "beta": null, "effect_allele_frequency": null, "ci_lower": null, "ci_upper": null, "other_allele": null, "odds_ratio": null, "p_value": 0.826716297590478, "variant_id": "rs3092989", "code": 14, "effect_allele": null, "study_accession": "GCST000392", "trait": [ "EFO_0001359" ], "_links": { "variant": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations/rs3092989" }, "trait": [ { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/traits/EFO_0001359" } ], "self": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations/rs3092989?study_accession=GCST000392 " }, "study": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/studies/GCST000392" } } }
Which lists the European ancestry study and a trait EFO_0001359 which is type I diabetes mellitus: chronic condition characterized by minimal or absent production of insulin by the pancreas and is part of BRCA2.
The population size under study here is about 16,500, only a few people will have had type I diabetes, the p-value is a high 0.82, but for some reason this candidate was included in these summary statistics.
The API allows setting the p-value which for human GWAS should be in the order of 10e-8. Let's try that with
curl "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations?start=20&bp_lower=32315086&size=100&bp_upper=32400266&p_upper=0.0000001&p_lower=-0.0"
(which can also be written as p_upper=10E-8) and lists 100 candidates starting from 20:
"associations": { "0": { "effect_allele_frequency": null, "variant_id": "rs4942505", "beta": -0.0288, "base_pair_location": 32389570, "ci_lower": null, "other_allele": "C", "ci_upper": null, "p_value": 2.204e-08, "chromosome": 13, "odds_ratio": null, "code": 11, "effect_allele": "T", "study_accession": "GCST002222", "trait": [ "EFO_0004611" ], "_links": { "variant": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations/rs4942505" }, "trait": [ { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/traits/EFO_0004611" } ], "self": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations/rs4942505?study_accession=GCST002222 " }, "study": { "href": "https://www.ebi.ac.uk/gwas/summary-statistics/api/studies/GCST002222" } } },
The trait EFO_0004611 is low density lipoprotein cholesterol
measurement with a p-value of 2.204e-08
and the European ancestry
study contains about 100K individuals.
To get at the description of the study we can use the included API call. For the trait description, however, we can use the somewhat oddly formed
curl "https://www.ebi.ac.uk/ols/api/ontologies/efo/terms?iri=http://www.ebi.ac.uk/efo/EFO_0004611"
"terms": [ { "iri": "http://www.ebi.ac.uk/efo/EFO_0004611", "label": "low density lipoprotein cholesterol measurement", "description": [ "The measurement of LDL cholesterol in blood used as a risk indicator for heart disease." ], "annotation": { "database_cross_reference": [ "SNOMEDCT:113079009", "NCIt:C105588" ], "gwas_trait": [ "true" ], "term editor": [ "Helen Parkinson" ] }, "synonyms": [ "LDL measurement" ], "ontology_name": "efo", "ontology_prefix": "EFO", "ontology_iri": "http://www.ebi.ac.uk/efo/efo.owl", "is_obsolete": false, "term_replaced_by": null, "is_defining_ontology": true, "has_children": true, "is_root": false, "short_form": "EFO_0004611", "obo_id": "EFO:0004611", (...)
At this point all the pieces are together here. We can get a gene region. We can find the SNPs associated with a gene region. We can get info on the SNP and traits. The only thing to note is the SNP information is paged - so to get all SNPs we have to query page by page. This can take a while so if you have a (web) UI you may want to fetch an process the results as they come in.
To do a full search on significant SNPs takes about 30 seconds
and renders 30 SNP candidates with a reasonable p-value. Setting
the p-value to 10e-3
renders 1,325 candidates in 60
seconds. That is after removing p-values set to -99. When you
follow the link for a SNP at p-value -99 the EBI server gives an
error.
To zoom in on results with jq use
cat test.json |jq '._embedded.associations."90"'
to get the 90th record and to show all p-values
cat test.json |jq '._embedded.associations | map(.p_value)'
and to filter on p_values not -99
cat test.json |jq '._embedded.associations | map(select(.p_value != -99))'
To do a query in Racket Scheme you can paste something like this in the REPL (DrRacket or Emacs)
(require json) (require net/url) (require net/uri-codec) (define header '("Accept: application/json")) (define (ebi-gwas-json query) (call/input-url (string->url (string-append "https://www.ebi.ac.uk/gwas/rest/api/" (uri-encode query))) get-pure-port (lambda (port) (string->jsexpr (port->string port)) ) header ))
The traditional way of unpacking
(define ht (hash "apple" (hash "berry" 'red) "banana" 'yellow)) (hash-ref (hash-ref ht "apple") "berry") 'red
can be avoided with
(require nested-hash) (nested-hash-ref ht "apple" "berry") 'red
which is good enough for unpacking most JSON results
The match operator may also be used. This returns
(match ht [(hash-table ("apple" b)) (match b [(hash-table (_ c)) c])]) 'red
I have not found how to nest hash-table in a match pattern.
Anyway, to get at the versions because they are in a list
(define vers (nested-hash-ref meta '_embedded 'mappingMetadatas)) (hash-ref (first vers) 'genomeBuildVersion) "GRCh38.p12"
(ebi-sumstat-genome-build)
does exactly that. See /pjotrp/racket-summary-stats/src/branch/master/test/ebi.rkt.
In the next step we fetch SNPs for BRCA2. The summary statistics web interface BRCA2 search returns 23 candidates and 20 studies. In fact you can download the file and there are only 4 SNPs of interest
cat ~/Downloads/gwas-association-downloaded_2019-11-01-ensemblMappedGenes_BRCA2.tsv |cut -f22|sort|uniq rs11571818 rs11571833 rs1799955 rs4942486
The phenotypes show
cat ~/Downloads/gwas-association-downloaded_2019-11-01-ensemblMappedGenes_BRCA2.tsv |cut -f8|sort|uniq Breast cancer Breast cancer (estrogen-receptor negative) Cancer Cancer (pleiotropy) LDL cholesterol LDL cholesterol levels Low density lipoprotein cholesterol levels Lung cancer Lung cancer in ever smokers Small cell lung carcinoma Squamous cell lung carcinoma
This is useful information for testing our API. The first version misses one and adds a few. There are 567 results with 7 that are not -99.
"variant_id": "rs9534262", "variant_id": "rs4942486", "variant_id": "rs7330025", "variant_id": "rs190434310", "variant_id": "rs11571815", "variant_id": "rs11571818", "variant_id": "rs11571833",
Remember the API URL was
curl "https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/13/associations?start=20&bp_lower=32315086&size=100&bp_upper=32400266&p_upper=1e-8&p_lower=-0.0"
so we create (ebi-sumstat-chr-pos-json chr start stop)
that returns the full result and handles the paging. So we can do
(ebi-sumstat-chr-pos-json "13" 32315086 32400266)
The result is the full set. This looks good, but why return a
JSON structure (even if it is an jsexp
)? Also we probably don't
need all data, so a transform and filter makes sense that get
executed when data flows in.
In the first step we create a generic struct related to a SNP record.
(struct snp-record (snp-name snp-variant chr pos p-value gene snp-uri trait trait-uri))
This is the information that gets passed back to GeneNetwork.
Next steps:
x Add unit test infra that can read data from file
Use jsexp to filter out p-value is -99
return snp-record: snp, variant, chr, pos, p-value, trait
add description for trait
we can use a link like https://www.ebi.ac.uk/ols/api/ontologies/efo/terms?iri=http://www.ebi.ac.uk/efo/EFO_0000305
cache result
add to GN3 and provide to GN2 web interface
API https://www.ebi.ac.uk/gwas/summary-statistics/docs/
To run the tests
racket test/ebi-test.rkt
and with internet access
racket test/ebi-web-test.rkt
Copyright (c) 2019 Pjotr Prins. This code is published under the GPL3, see /pjotrp/racket-summary-stats/src/branch/master/LICENSE for details.