Even in 2020, doing bioinformatics means ID mapping. Genes are probably the most important concept in biology but identifying and naming them is difficult.

In this post, I show how to model ID mapping in Neo4j with the example of microRNA functions. This is a simplified version of the database behind my microRNA tool miTALOS. Images are taken from a presentation and show examples, not real data.

Why do we need ID mapping?

There are several genome databases that index genomes, define genes and assign IDs to those genes. The most important ones are probably ENSEMBL and Entrez Gene from NCBI. However, next to the big databases there are several more specific ones, such as the mouse-specific MGI. These databases usually have a many-to-many mapping, meaning an ID in ENSEMBL can map to mulitple IDs in Entrez Gene and vice versa. Consequently, working with data that uses different IDs is difficult.

Next to the IDs from genome databases, genes are commonly identified by a name and a shortcode called gene symbol. My favorite human gene has the name ‘forkhead box A2’ and the official gene symbol ‘FOXA2’. For most organisms, there is a central naming authority that defines an official name and symbol, such as HUGO for human. Gene names and symbols are even more difficult to handle than gene IDs. They have a long history because many genes were identified and named before molecular cloning was invented. Some genes were named by several independent groups and were merged only later. The names and symbols change over time, for example when a gene is characterized with new functions. Because of those ambiguities, every official gene symbol has a long list of synonyms. The main issue here is that two distinct official gene symbols can have the same synonym.

Genes in Neo4j

First iteration

When you start building a Neo4j based application you will most probably include genes identified by the ID used in your data/context. Here is an example how to model the interplay of microRNAs, genes and pathways.

I started by adding nodes for microRNAs, genes (from Entrez Gene) and pathways (from KEGG). I added miRNA target information and pathway memebership for genes. In summary: A miRNA can REGULATE a Gene and the Gene is a MEMBER of a Pathway.

image

The query to get the pathways regulated by a microRNA is straightforward:

MATCH (m:miRNA)-[:REGULATES]->(g:Gene)-[:MEMBER]->(p:Pathway)
WHERE m.name = 'miR-221'
RETURN p.name

Add ID mapping to include more data

Later I wanted to add another pathway database that used ENSEMBL gene IDs. I added another set of Gene nodes (with a different ID), another set of Pathway nodes and pathway memberships for those genes.

In order to allow queries from existing data, I used gene ID mapping data provided by ENSEMBL to create MAPS relationships between Gene nodes.

image2

One of the main reasons to use Neo4j is the flexible data model. The beauty of including ID mapping is that your existing queris do not change much when you add the mapping. To get all pathways from both databases you simply add a mapping step:

MATCH (m:miRNA)-[:REGULATES]->(:Gene)-[:MAPS*0..1]-(g:Gene)-[:MEMBER]->(p:Pathway)
WHERE m.name = 'miR-221'
RETURN p.source, p.name

By including a variable length query for the MAPS relationship you collect all mapped genes and continue to Pathways as before. You can define the minimum and maximum length of the variable length query. Here, the minimum is set to 0 (in MAPS*0..1). This also includes paths without a MAPS relationship and collapes all Gene nodes into the variable g.

Continue to add mappings

For another project, I wanted to add more gene-function associations beyond pathways. I chose gene-disease associations to further understand the functional context of a microRNA. First, I added Disease nodes. Another type of gene ID was used and I added more Gene nodes and ASSOCIATED relationships between Gene and Disease. I also added the new gene ID mappings.

image3

The query has to include another target node type and more MAPS relationships:

MATCH (m:miRNA)-[:REGULATES]->(:Gene)-[:MAPS*0..2]-(g:Gene)-[:MEMBER|:ASSOCIATED]->(target)
WHERE m.name = 'miR-221' AND ('Pathway' in labels(target) OR 'Disease' in labels(target))
RETURN labels(target), target.source, target.name

You can query for multiple types of relationships by chaining them with | (as in (g:Gene)-[:MEMBER|:ASSOCIATES]->(target)). The second condition ('Pathway' in labels(target) OR 'Disease' in labels(target)) asks if either Pathway or Disease are in the labels of the target node. A better way to model that would be to add the label Annotation to both Pathway and Disease nodes. Finally, the maximum number of MAPS relationships is increased.

Summary

One of the great features of Neo4j is the flexible and extensible data model. It allows to start simple and add data along the way. Gene ID mapping is not avoidable if you work with data from biology. You can include it step by step into a graph data model with small changes to your first queries.