You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Pjotr Prins 2e1004f964 Prepare transforming to snp-record 2 years ago
summary-stats Prepare transforming to snp-record 2 years ago
test Prepare transforming to snp-record 2 years ago
.gitignore Initial commit 2 years ago
LICENSE README 2 years ago Prepare transforming to snp-record 2 years ago

Mining summary statistics with Racket

EBI Summary Statistics

EBI is providing an update of their GWAS resource in the form of an API that can get summary statistics of SNP candidates (JSON output only). They also provide the classic downloadable tabular data resources which we do not want to use for two reasons:

  1. We want to use the latest information automatically

  2. For EBI it is easier to track use and fund future initiatives

The latter reason is more important than meets the eye. Despite appearances these resources are not free.

This code is written in Racket. Why Racket? To me Racket is everything Python and Ruby have to offer and more.


Pjotr Prins (2019)

Fetching summary stats (outline)

Start with a gene

Let's start with mouse gene Shh (human alias SHH) and rat Brca2 (human alias BRCA2). For the human variant the wikidata page for BRCA2 points to the NCBI gene description (which is usefully elaborate). That page in turn points to the Uniprot resource.

Wikidata also has the start and en position of the gene for two reference genomes. hg38 is known as GRCh38.p2 (2014) which is now at p13 (2019). So, which one does EBI use? GRCh38.p12 according to the FAQ:

 curl '' -i -H 'Accept: application/json'

(Note that the API returns JSON even if you ask for XML. For this query requesting XML gives a correct notification: the resource identified by this request is only capable of generating responses with characteristics not acceptable according to the request "accept" headers).

  "_embedded" : {
    "mappingMetadatas" : [ {
      "ensemblReleaseNumber" : 97,
      "genomeBuildVersion" : "GRCh38.p12",
      "dbsnpVersion" : 151,
      "usageStartDate" : "2019-07-03T13:00:03.625+0000"
    } ]

Note: this metadata tells us the GWAS REST API uses GRCh38.p12. Summary statistics API, however, may be different (but probably not far off).

If we look at BRCA2 on version p13 the location is chr13:32315480..32399672 while wikidata has it at p2 chr13:32315086..32973805. In other words we'll need to fetch the latest somehow (though the difference may not be too critical for our purposes).

The FAQ suggests the API supports searching by gene which should make a lookup easier. In the full API docs this query is shown

curl '' -i -H 'Accept: application/json' > BRCA2.json

results in a list of SNPs with top/curated associations. For summary statistics there is no lookup by gene yet - I have put in a request. So, for now we need to figure out the gene position first using the ENSEMBL API.

Wikidata includes an ENSEMBL identifier which makes this relatively straightforward. In fact, that can even be skipped because ENSEMBL takes gene names. E.g.

curl '' -H 'Content-type:application/json'
   "description":"BRCA2 DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101]",

Note that this uses another reference genome (again!) so positions will be off.

Coming to think of it, for summary statistics it should not matter too much because most associated SNPs will be outside the coding region anyway. The simple solution is to take anything that is (say) 100Kbp around the gene under study. In fact that is what the old findByGene variant API does above. In gene dense regions this may lead to problems, but hey, this is a discovery tool. We will use above coordinates to signify whether a SNP is (likely) outside the gene region. So we can start with these positions and query the recently added summary statistics API with this for your query:

curl ""

If you want to look at JSON in a nice way pipe the output into the excellent jq tool.

This will return all associations within that region, so variants will be duplicated. If you want to filter by p-value that is also possible (see below).

Get associated SNP

In the result list there is a URL for example for SNP


To view/filter JSON the command line tool jq comes in handy

jq < summary.json
    "associations": {
      "0": {
        "base_pair_location": 32315226,
        "chromosome": 13,
        "beta": null,
        "effect_allele_frequency": null,
        "ci_lower": null,
        "ci_upper": null,
        "other_allele": null,
        "odds_ratio": null,
        "p_value": 0.826716297590478,
        "variant_id": "rs3092989",
        "code": 14,
        "effect_allele": null,
        "study_accession": "GCST000392",
        "trait": [
        "_links": {
          "variant": {
            "href": ""
          "trait": [
              "href": ""
          "self": {
            "href": "
          "study": {
            "href": ""

Which lists the European ancestry study and a trait EFO_0001359 which is type I diabetes mellitus: chronic condition characterized by minimal or absent production of insulin by the pancreas and is part of BRCA2.

The population size under study here is about 16,500, only a few people will have had type I diabetes, the p-value is a high 0.82, but for some reason this candidate was included in these summary statistics.

Search by p-value

The API allows setting the p-value which for human GWAS should be in the order of 10e-8. Let's try that with

curl ""

(which can also be written as p_upper=10E-8) and lists 100 candidates starting from 20:

    "associations": {
      "0": {
        "effect_allele_frequency": null,
        "variant_id": "rs4942505",
        "beta": -0.0288,
        "base_pair_location": 32389570,
        "ci_lower": null,
        "other_allele": "C",
        "ci_upper": null,
        "p_value": 2.204e-08,
        "chromosome": 13,
        "odds_ratio": null,
        "code": 11,
        "effect_allele": "T",
        "study_accession": "GCST002222",
        "trait": [
        "_links": {
          "variant": {
            "href": ""
          "trait": [
              "href": ""
          "self": {
            "href": "
          "study": {
            "href": ""

Identify trait & study

The trait EFO_0004611 is low density lipoprotein cholesterol measurement with a p-value of 2.204e-08 and the European ancestry study contains about 100K individuals.

To get at the description of the study we can use the included API call. For the trait description, however, we can use the somewhat oddly formed

curl ""
    "terms": [
        "iri": "",
        "label": "low density lipoprotein cholesterol measurement",
        "description": [
          "The measurement of LDL cholesterol in blood used as a risk indicator for heart disease."
        "annotation": {
          "database_cross_reference": [
          "gwas_trait": [
          "term editor": [
            "Helen Parkinson"
        "synonyms": [
          "LDL measurement"
        "ontology_name": "efo",
        "ontology_prefix": "EFO",
        "ontology_iri": "",
        "is_obsolete": false,
        "term_replaced_by": null,
        "is_defining_ontology": true,
        "has_children": true,
        "is_root": false,
        "short_form": "EFO_0004611",
        "obo_id": "EFO:0004611",

At this point all the pieces are together here. We can get a gene region. We can find the SNPs associated with a gene region. We can get info on the SNP and traits. The only thing to note is the SNP information is paged - so to get all SNPs we have to query page by page. This can take a while so if you have a (web) UI you may want to fetch an process the results as they come in.

To do a full search on significant SNPs takes about 30 seconds and renders 30 SNP candidates with a reasonable p-value. Setting the p-value to 10e-3 renders 1,325 candidates in 60 seconds. That is after removing p-values set to -99. When you follow the link for a SNP at p-value -99 the EBI server gives an error.

To zoom in on results with jq use

cat test.json |jq '._embedded.associations."90"'

to get the 90th record and to show all p-values

cat test.json |jq '._embedded.associations | map(.p_value)'

and to filter on p_values not -99

cat test.json |jq '._embedded.associations | map(select(.p_value != -99))'

Racket API

Fetching JSON

To do a query in Racket Scheme you can paste something like this in the REPL (DrRacket or Emacs)

(require json)
(require net/url)
(require net/uri-codec)

(define header
  '("Accept: application/json"))

(define (ebi-gwas-json query)
  (call/input-url (string->url (string-append "" (uri-encode query)))
                  (lambda (port)
                    (string->jsexpr (port->string port))

The traditional way of unpacking

(define ht (hash "apple" (hash "berry" 'red) "banana" 'yellow))
(hash-ref (hash-ref ht "apple") "berry")

can be avoided with

(require nested-hash)
(nested-hash-ref ht "apple" "berry")

which is good enough for unpacking most JSON results

The match operator may also be used. This returns

(match ht [(hash-table ("apple" b)) (match b [(hash-table (_ c)) c])])

I have not found how to nest hash-table in a match pattern.

Anyway, to get at the versions because they are in a list

(define vers (nested-hash-ref meta '_embedded 'mappingMetadatas))
(hash-ref (first vers) 'genomeBuildVersion)

(ebi-sumstat-genome-build) does exactly that. See /pjotrp/racket-summary-stats/src/branch/master/test/ebi.rkt.

Fetch SNPs

In the next step we fetch SNPs for BRCA2. The summary statistics web interface BRCA2 search returns 23 candidates and 20 studies. In fact you can download the file and there are only 4 SNPs of interest

cat ~/Downloads/gwas-association-downloaded_2019-11-01-ensemblMappedGenes_BRCA2.tsv |cut -f22|sort|uniq

The phenotypes show

cat ~/Downloads/gwas-association-downloaded_2019-11-01-ensemblMappedGenes_BRCA2.tsv |cut -f8|sort|uniq
Breast cancer
Breast cancer (estrogen-receptor negative)
Cancer (pleiotropy)
LDL cholesterol
LDL cholesterol levels
Low density lipoprotein cholesterol levels
Lung cancer
Lung cancer in ever smokers
Small cell lung carcinoma
Squamous cell lung carcinoma

This is useful information for testing our API. The first version misses one and adds a few. There are 567 results with 7 that are not -99.

    "variant_id": "rs9534262",
    "variant_id": "rs4942486",
    "variant_id": "rs7330025",
    "variant_id": "rs190434310",
    "variant_id": "rs11571815",
    "variant_id": "rs11571818",
    "variant_id": "rs11571833",

Remember the API URL was

curl ""

so we create (ebi-sumstat-chr-pos-json chr start stop) that returns the full result and handles the paging. So we can do

(ebi-sumstat-chr-pos-json "13" 32315086 32400266)

The result is the full set. This looks good, but why return a JSON structure (even if it is an jsexp)? Also we probably don't need all data, so a transform and filter makes sense that get executed when data flows in.

In the first step we create a generic struct related to a SNP record.

(struct snp-record
  (snp-name snp-variant chr pos p-value gene snp-uri trait trait-uri))

This is the information that gets passed back to GeneNetwork.

Next steps:

x Add unit test infra that can read data from file



Run tests

To run the tests

racket test/ebi-test.rkt

and with internet access

racket test/ebi-web-test.rkt


Copyright (c) 2019 Pjotr Prins. This code is published under the GPL3, see /pjotrp/racket-summary-stats/src/branch/master/LICENSE for details.