Skip to content

Integration of bioinformatic and epidemiological data

To facilitate a flexible and dynamic development environment, we developed a data integration and graph construction method composed of multiple integration methods wrapped up in a SnakeMake pipeline (Köster and Rahmann, 2018). For large data sets, e.g. millions of rows, we used the Neo4j import tool which will ingest 100’s of millions of rows in minutes. Data ingested in this way need to be in simple pre-processed text files, avoiding duplicates and conflicting entries. After this bulk ingest, the remaining data are added using the LOAD CSV method. This is slower than the import tool, however, it provides flexibility during the import, e.g. checking for existing entries, modifying entries on insert, etc. Whilst collating and creating the data ready for insert takes many hours, building the graph takes ~15 minutes, providing an agile modification and rebuild method. The following gives a brief description of each data set, the source of the raw data and the order in which they were inserted into the graph.

Genetic variants

Source: OpenGWAS (Ben Elsworth et al., 2020), MR-EvE (Hemani et al., 2017) and xQTL (Zheng et al., 2019)

Multiple sources of genetic variant data were combined into a single set of unique variants. We resisted adding a complete set of data, e.g. dbSNP, instead opting to reduce the size of the graph, and only include variants with links to other data. This included top hits for each GWAS (variants with LD R2 < 0.001 within a 10Mb window and with P-value < 5 × 10-8), variants derived from MR-EvE and the xQTL data. This unique list of variants was annotated with potential gene effects using a local instance of Ensembl Variant Effect Predictor (VEP) v37 (McLaren et al., 2016).

Genome wide association studies

Source: OpenGWAS (Ben Elsworth et al., 2020)

The IEU OpenGWAS platform contains GWAS summary data for over 11,000 full and 20,000 partial data sets. Each full data set represents a measured variable (phenotype) relating to a specific area of human health and each partial data set represents the set of quantitative trait loci (QTL) for a particular molecular trait (e.g. gene expression, methylation or metabolite levels). Molecular QTL data are typically not genome-wide, focusing instead on cis genomic regions and strongly associated trans-QTL. For each GWAS we incorporated the meta data for and the subset of genetic variants most associated with that trait into EpiGraphDB. These data were extracted via the OpenGWAS API (

Genes and Proteins

Source: BioMart (Smedley et al., 2009)

We used Biomart build 37 via the biomart python client to create our set of genes and proteins. For simplicity and to avoid issues with conflicting IDs, we restricted our gene search to gene names and Ensembl IDs, and for proteins we used UniProt IDs only.


Source: Mondo Disease Ontology (Mondo) (Mungall et al., 2017)

Representing disease knowledge is problematic as there are numerous ontologies and sources of information. We chose the Mondo Disease Ontology (Mondo) as it has the explicit aim to harmonize disease definitions across different ontologies, and appears to be the most complete such resource, providing links to many other ontologies.

Observational Correlation

Source: UK Biobank (UK Biobank, 2014)

We used PHESANT (Millard et al., 2018) to create a modified version of variables from the UK Biobank Using a basic spearman rank correlation, we produced pairwise correlation coefficients for all UK Biobank phenotypes for which we also have GWAS results from OpenGWAS ( These pairwise correlation coefficients create observational correlation relationships between the Gwas nodes.

Mendelian Randomization

Source: MR-EvE (Hemani et al., 2017)

Using a mixture of experts model, MR-EvE estimated pairwise causal relationships using Mendelian randomization (MR) between many of the traits for which we have GWAS results integrated from OpenGWAS. These data are used to create MR relationships between the Gwas nodes.


Source: SemMedDB (Kilicoglu et al., 2012)

SemMedDB is a collection of semantic relations, created from the titles and abstracts of PubMed. Each record is a triple (subject-predicate-object) with the subject and object mapping to either the UMLS Metathesaurus (National Library of Medicine, 2009) or Entrez gene ID (National Center for Biotechnology Information, 2020). We included unique semantic triple data from the PREDICATION table of SemMedDB (version semmedVER40_R) with terms and predicates matching specific criteria (Elsworth, 2020).

For more information on the construction of SemMedDB and descriptions of the predication data used here please refer to these publications (Kilicoglu et al., 2011; Kilicoglu et al., 2020).

GWAS Literature

Source: OpenGWAS (Ben Elsworth et al., 2020), SemMedDB (Kilicoglu et al., 2012)

MELODI (Elsworth et al., 2018) uses literature annotations, e.g. SemMedDB to derive connections between two search terms, e.g. an exposure and outcome. Using a modified version of this approach (Elsworth, 2020) which restricts the search space to particular subjects, objects and predicates, we created enriched literature objects for each GWAS trait, using the GWAS trait text. These were then be used to identify overlapping terms between two traits. The derivation of these literature connections for a pair of traits is as follows:

  • For each trait a PubMed search was performed (using the trait name).
  • The PubMed IDs returned by this search were used to retrieve matching triples of data (subject-predicate-object) from a local instance of SemMedDB (v40).
  • Each triple was counted and compared to the background count to produce an enrichment P-value.
  • Enriched, overlapping triples (object from exposure, subject from outcome) were returned and form the literature relationship between pairs of traits.

GWAS semantic similarity

Source: OpenGWAS (Ben Elsworth et al., 2020)

Many of the GWAS in OpenGWAS are of a similar nature. For example, a search of “weight” returns 24 records, and there are many similar trait names that don’t contain the word “weight”. We developed a platform to help address the issue of understanding the sematic similarity between biomedical variables which uses sentence embedding (Benjamin Elsworth et al., 2020). This can be used to search a set of variables using a model of knowledge extracted from the biomedical literature in place of simple text matching, and can also be used to create distance/similarity scores between two pieces of text. In this case, we compared the sentence embedding vectors derived from the GWAS traits to create similarity scores.


Source: Reactome (Jassal et al., 2019)

For biological pathway data we utilised a subset of data available from Reactome. To avoid overloading our graph we extracted a reduced set of information from the Reactome graph, focusing on the connections between pathways and events, literature, disease and proteins.


Source: Open Targets (Carvalho-Silva et al., 2019)

Open Targets is a platform focused on drug targets and discovery, containing data from many sources and provides straightforward programmatic access to specific components. Using their API, we extracted data related to drug, trial phase and target gene.

Drug efficacy

Source: Clinical Pharmacogenetics Implementation Consortium (CPIC) (Relling and Klein, 2011)

CPIC is an international consortium interested in facilitating the clinical implementation of pharmacogenetic tests. They provide a freely available database of peer-reviewed, evidence-based gene/drug clinical practice guidelines, including systematic grading of evidence. We retrieved from this resource data about drug efficacy.

Tissue specific gene expression

Source: Genotype Tissue Expression (GTEx) project (The GTEx Consortium et al., 2015)

Tissue-specific gene expression levels were obtained from GTEx. This public resource contains samples from 54 non-diseased tissue sites across nearly 1,000 individuals.

Protein-protein interactions (1)

Source: IntAct (Orchard et al., 2014)

IntAct is a freely available, open source database of molecular interaction data derived from literature curation and direct user submissions. We distilled the subset of protein-protein interactions (PPI) to incorporate it in our database.

Protein-protein interactions (2)

Source: StringDB (Szklarczyk et al., 2019)

StringDB is another database that contains known and predicted protein-protein interactions, either direct or functional (indirect). Data comes from computational predictions of PPI in different organisms. We selected the direct PPIs for homo sapiens with a confidence probability greater than 0.7.

Druggable genes

Source: The druggable genome study (Finan et al., 2017)

Finan et al. (2017) presented an approach to validate drug targets using genomic information. Data from genome-wide association studies were used to connect complex disease- and biomarker-associated loci, meaning that associations of variants in genes that encode a target mimic the effect of modifying pharmacologically these targets. From there we extracted a set of genes encoding druggable human proteins.

Genetic correlations

Source: Neale Lab UK Biobank genetic correlation study (Abbot et al., 2020)

In October 2019 the Ben Neale Lab released genetic correlation data for over 4,000 GWAS. We incorporated the correlation data that mapped to the GWAS trait names in EpiGraphDB.

GWAS trait to UMLS

Source: OpenGWAS (Ben Elsworth et al., 2020)

MetaMap Lite (Demner-Fushman et al., 2017) was used to create a relationship between the GWAS traits names and SemMedDB terms (based on the UMLS metathesaurus).

Polygenic Risk Scores

Source: PRS Atlas (Richardson et al., 2019)

Richardson et al. (2019) conducted a systematic analysis on the association between 162 polygenic risk scores for different phenotypes (PRS; derived from published GWAS in OpenGWAS) and 551 traits from UK Biobank. We integrated the associations provided by this study ( by mapping the PRS and UK Biobank to the EpiGraphDB GWAS traits, creating new relationships representing the published PRS associations.

pQTL and eQTL MR

Source: xQTL study (Zheng et al., 2019)

Zheng et al. (2019) conducted systematic MR and colocalization analyses of 1,740 plasma proteins (pQTLs) and 16,058 blood transcripts (eQTLs) on 576 phenotypes in Europeans on the drug target prioritization for complex diseases. We integrated these results in EpiGraphDB for both single SNP MR results and multi SNP MR results (IVW and Egger MR methods).

Experimental Factor Ontology

Source: EFO (Malone et al., 2010)

In addition to a disease ontology we required an established ontology that covered a broad range of biomedical variables. We selected EFO as a well-established and comprehensive ontology. Each class and parent-child relationship were downloaded via the EBI SPARQL endpoint ( on January 28th 2020.

To create links between GWAS and EFO, we utilised the Vectology platform (Benjamin Elsworth et al., 2020). This used sentence embedding methods to create vectors of each GWAS trait and EFO term, then performed a simple distance measure to identify closest matches. This was chosen in preference to alternative methods, such as Zooma ( and OnToma ( as we experienced superior results using embedding methods.


  • Abbot,L. et al. (2020) Genetic correlation between traits and disorders in the UK Biobank.
  • Carvalho-Silva,D. et al. (2019) Open Targets Platform: new developments and updates two years on. Nucleic Acids Res, 47, D1056–D1065.
  • Demner-Fushman,D. et al. (2017) MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, ocw177.
  • Elsworth,B. et al. (2018) MELODI: Mining Enriched Literature Objects to Derive Intermediates. International Journal of Epidemiology, 47, 369–379.
  • Elsworth,B. (2020) MRCIEU/MELODI-Presto
  • Elsworth,Ben et al. (2020) The MRC IEU OpenGWAS data infrastructure bioRxiv.
  • Elsworth,Benjamin et al. (2020) Vectology – exploring biomedical variable relationships using sentence embedding and vectors. Proceedings DSRS-Turing’19. London, 21-22nd Nov, 2019.
  • Finan,C. et al. (2017) The druggable genome and support for target identification and validation in drug development. Science translational medicine, 9, eaag1166.
  • Hemani,G. et al. (2017) Automating Mendelian randomization through machine learning to construct a putative causal map of the human phenome. bioRxiv.
  • Jassal,B. et al. (2019) The reactome pathway knowledgebase. Nucleic Acids Research, gkz1031.
  • Kilicoglu,H. et al. (2011) Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics, 12: 486.
  • Kilicoglu,H. et al. (2012) SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics, 28, 3158–3160.
  • Kilicoglu,H. et al. (2020) Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics, 21: 188.
  • Köster,J. and Rahmann,S. (2018) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 34, 3600–3600.
  • Malone,J. et al. (2010) Modeling sample variables with an Experimental Factor Ontology. Bioinformatics, 26, 1112–1118.
  • McLaren,W. et al. (2016) The Ensembl Variant Effect Predictor. Genome Biol, 17, 122.
  • Millard,L.A. et al. (2018) Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. International Journal of Epidemiology, 47, 29–35.
  • Mungall,C.J. et al. (2017) The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res, 45, D712–D722.
  • National Center for Biotechnology Information (2020) Home - Gene.
  • National Library of Medicine (2009) Metathesaurus. UMLS Reference Manual.
  • Orchard,S. et al. (2014) The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucl. Acids Res., 42, D358–D363.
  • Relling,M.V. and Klein,T.E. (2011) CPIC: Clinical Pharmacogenetics Implementation Consortium of the Pharmacogenomics Research Network. Clin Pharmacol Ther, 89, 464–467.
  • Richardson,T.G. et al. (2019) An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. eLife, 8, e43657.
  • Smedley,D. et al. (2009) BioMart – biological queries made easy. BMC Genomics, 10, 22.
  • Szklarczyk,D. et al. (2019) STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res, 47, D607–D613.
  • The GTEx Consortium et al. (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348, 648–660.
  • UK Biobank (2014) About UK Biobank.
  • Zheng,J. et al. (2019) Systematic Mendelian randomization and colocalization analyses of the plasma proteome and blood transcriptome to prioritize drug targets for complex disease