Deriving GWAS phenotype communities using distance data and graph algorithms

Benjamin Elsworth, Valeriia Haberland, Yi Liu, Tom R Gaunt

The IEU GWAS database now contains over 22,000 GWAS, representing a broad spectrum of human traits. Recent additions based on our systematic GWAS of the UKBiobank have greatly increased the size and complexity of this resource, creating challenges in identifying the most appropriate GWAS to use for post-GWAS analyses of a particular trait.

Understanding how GWAS traits are related (conceptually, phenotypically and genetically) has many important benefits. For example, selecting a representative trait, or group of traits for Mendelian randomization, appropriately accounting for multiple testing (and/or reducing the number of redundant tests), identifying distinct risk factor areas from a set of identified traits and simply increasing the usability and accessibility of the data.

Manually annotating these GWAS using a pre-defined ontology is not feasible at this scale, especially as the IEU GWAS database continues to grow rapidly. In addition, many traits (in particular questionnaire data) don’t map well to ontologies, and no single ontology covers all trait domains effectively. However, by combining multiple distance measures and correlation coefficients, we can begin to construct data-driven communities, and identify representative traits for each community.

We have applied a multi-modal approach to phenotype clustering and mapping, using natural language processing, genetic and non-genetic data. We have calculated the distance between trait names using sentence embedding vectors, UKBiobank phenotype correlation coefficients, significant SNP correlation coefficients and manually assigned ontology hierarchy distances. These distances have been incorporated into EpigraphDB ( to enable us to run community detection algorithms and identify robust communities of GWAS.