The Emedgene AI knowledge graph currently holds 28M+ genomics-related publications. Yes, that’s not a typo, over 28 million.

Our goal is to hold a current representation of the entire body of genomics knowledge in our AI knowledge graph. A complete and continuously updated web of variants, genes, pathways, phenotypes, and diseases.

Information in publications is a crucial piece of this knowledge since much of genomics knowledge is still in development. In fact, we’re adding thousands of new data points every week.

Publications are used to:

  • Find gene-disease connections that are not reported in medical sources (OMIM, CGD etc).
  • Enrich case data with additional data layers like pathways, interactions, expressions and more (for GUS analysis).
  • Find additional variants that have been reported in the literature, but are not available in mutation databases.
  • Find phenotypes for diseases that were not reported in public databases.
  • Research and obtain metadata for disease and genes (inheritance modes, treatments, effect of drugs, disease mechanisms).

In order to uncover all of this information, we need to extract both unstructured and structured data from publications – everything a geneticist would understand when reading. We do this using NLU (Natural Language Understanding), a branch of NLP that uses AI to process unstructured information that is governed by loose and flexible rules.

We’ve taught our crawler to identify entities, disambiguate and also codify the relationship between entities. Here’s an example of our AI engine, Ada, analyzing the publication ‘Insights from retinitis pigmentosa into the roles of isocitrate dehydrogenases in the Krebs cycle by Hartong DT et al.

Article snapshot

Ada found the causative connection between the IDH3B gene to the Retinitis Pigmentosa Disease, this is already hard to do.

BUT, Ada also understands many different complex layers hidden in the article like the relevant body parts (retina, eye, the pathway (Krebs Cycle), enzyme in play (NAD-IDH), properties of the variants that cause the disease (Loss of Function, Homozygous) and disease properties (Autosomal Recessive, High Penetrance).

Other than these Ada also notes that this happened in 8 different cases described in the article, and is not confused by the appearance of the IDH2 enzyme which invokes the same reaction and doesn’t cause Retinitis Pigmentosa.

Imagine what all this data can be used for, in addition to replacing hours spent finding and reading publications relevant to your case (while remaining vulnerable to human limitations).

  • Flexible phenotype definition – Ability to understand phenotypes and all of their possible description forms for maximum accuracy in finding relevant publications.
  • Phenotype variability for diseases – Different phenotypes appear in different patients for the same disease, understanding that different phenotypic setup in patients is actually the same disease.
  • Dynamic panels – The ability to deduce a set of genes related to a given set of phenotypes, which could lead to automatic panel design and reduce the cost of sequencing while increasing its availability.
  • New gene-disease discoveries – Discover unpublished or indirect associations between genes and diseases either by triangulating information from several publications or by inferring connections through additional pieces of information.

All of this new knowledge will benefit, among others:

  • Research – Support for new cohort analysis research using broader available knowledge.
  • Pharma – Improvement of drug response analysis (efficacy) and target discovery through enriched knowledge.

Since we don’t believe AI in medicine should be a black box, we’ll expand a little more on the NLU challenges we’ve solved in our next post. Stay tuned to learn more about NLU of publications.

Learn more about Emedgene’s AI