Even in 2020, doing bioinformatics means ID mapping. Genes are probably the most important concept in biology but identifying and naming them is difficult.
In this post, I show how to model ID mapping in Neo4j with the example of microRNA functions. This is a simplified version of the database behind my microRNA tool miTALOS. Images are taken from a presentation and show examples, not real data.
Why do we need ID mapping?
There are several genome databases that index genomes, define genes and assign IDs to those genes. The most important ones are probably ENSEMBL and Entrez Gene from NCBI. However, next to the big databases there are several more specific ones, such as the mouse-specific MGI. These databases usually have a many-to-many mapping, meaning an ID in ENSEMBL can map to mulitple IDs in Entrez Gene and vice versa. Consequently, working with data that uses different IDs is difficult.
Next to the IDs from genome databases, genes are commonly identified by a name and a shortcode called gene symbol. My favorite human gene has the name ‘forkhead box A2’ and the official gene symbol ‘FOXA2’. For most organisms, there is a central naming authority that defines an official name and symbol, such as HUGO for human. Gene names and symbols are even more difficult to handle than gene IDs. They have a long history because many genes were identified and named before molecular cloning was invented. Some genes were named by several independent groups and were merged only later. The names and symbols change over time, for example when a gene is characterized with new functions. Because of those ambiguities, every official gene symbol has a long list of synonyms. The main issue here is that two distinct official gene symbols can have the same synonym.
Genes in Neo4j
When you start building a Neo4j based application you will most probably include genes identified by the ID used in your data/context. Here is an example how to model the interplay of microRNAs, genes and pathways.
I started by adding nodes for microRNAs, genes (from Entrez Gene) and pathways (from KEGG). I added miRNA target information and pathway memebership for genes. In summary: A
Gene and the
Gene is a
MEMBER of a
The query to get the pathways regulated by a microRNA is straightforward:
Add ID mapping to include more data
Later I wanted to add another pathway database that used ENSEMBL gene IDs. I added another set of
Gene nodes (with a different ID), another set of
Pathway nodes and pathway memberships for those genes.
In order to allow queries from existing data, I used gene ID mapping data provided by ENSEMBL to create
MAPS relationships between
One of the main reasons to use Neo4j is the flexible data model. The beauty of including ID mapping is that your existing queris do not change much when you add the mapping. To get all pathways from both databases you simply add a mapping step:
By including a variable length query for the
MAPS relationship you collect all mapped genes and continue to
Pathways as before. You can define the minimum and maximum length of the variable length query. Here, the minimum is set to
MAPS*0..1). This also includes paths without a
and collapes all
Gene nodes into the variable
Continue to add mappings
For another project, I wanted to add more gene-function associations beyond pathways. I chose gene-disease associations to further understand the functional context of a microRNA. First, I added
Disease nodes. Another type of gene ID was used and I added more
Gene nodes and
ASSOCIATED relationships between
Disease. I also added the new gene ID mappings.
The query has to include another target node type and more
You can query for multiple types of relationships by chaining them with
| (as in
(g:Gene)-[:MEMBER|:ASSOCIATES]->(target)). The second condition
('Pathway' in labels(target) OR 'Disease' in labels(target)) asks if either
Disease are in the labels of the
target node. A better way to model that would be to add the label
Annotation to both
Disease nodes. Finally, the maximum number of
MAPS relationships is increased.
One of the great features of Neo4j is the flexible and extensible data model. It allows to start simple and add data along the way. Gene ID mapping is not avoidable if you work with data from biology. You can include it step by step into a graph data model with small changes to your first queries.