There are quite a few projects using Neo4j. The number of publications is growing on PubMed and bioRxiv. Luckily, the database behind a computational biology application is an implementation detail. Even better, implementation details are not considered to be real science and there is no need to publish and document them. I mean, why would anyone want to reproduce your research?

However, there are some projects that make the full Neo4j database available. This series of posts introduces some of those projects and the underlying data model. It should help you to explore the dataset and understand the pros and cons of different data models.

Hetionet

Overview

Hetionet was built by Daniel Himmelstein. In the publication where the Neo4j implementation was introduced he used the database to prioritize drugs for repurposing. There is an older publication focused on edge prediction in the network. Neo4j is not mentioned there.

The documentation around hetionet is fantastic. There is even a public Neo4j instance. When you open the Neo4j browser there is a description of the graph and a tutorial how to query it. You can replay the tutorial with :play https://neo4j.het.io/guides/hetionet.html.

The data in hetionet is centred around genes, drugs and diseases. The database contains ~47 thousand nodes and ~2.5 million relationships. When you try to understand a Neo4j database, you usually start by looking at the output of CALL db.schema(). This procedure outputs a metagraph of all node labels and relationship types. You can see the core labels Gene, Compound and Disease in the centre of the schema. The orange nodes to the left are GO terms associated to genes which act as grouping nodes and assign a functional context to genes. The dataset is extended with gene expression information. Gene nodes are connected to Anatomy nodes by relationships describing if the Gene is regulated in a specific tissue or cell type.

Hetionet DB Schema

Queries

The data model is easy to understand and follows sort of a white-board logic. Any biologist can understand that genes regulate diseases and are associated to GO terms. Consequently, the queries are easy to read. This will return the genes associated to a particular disease:

MATCH (d:Disease)-[:ASSOCIATES_DaG]->(g:Gene)
WHERE d.name = 'spinal cancer'
RETURN g.name

You can get the compounds targeting genes which are involved in the biological process of myelination (and thus relevant for multiple sklerosis):

MATCH (bp:BiologicalProcess)-[:PARTICIPATES_GpBP]-(g:Gene)-[:BINDS_CbG]-(c:Compound)
WHERE bp.name = 'myelination'
RETURN DISTINCT g.name, c.name

Daniel published a set of queries that perform more advanced analyses. Some of them calculate a GO term enrichment in the graph based on a degree-weighted path count. This is an interesting concept to perform the ubiquitous enrichment analyses in Neo4j.

Data analysis

The data from Daniel’s hetionet was used for a couple of interesting analyses. In the main paper he builds a logistic regression model to predict a probability of treatment for a drug-disease pair. What I find most interesting about this approach is how he used graph properties as features for the model. From the paper: “The features consisted of a prior probability of treatment, node degrees for 14 metaedges, and DWPCs for 123 metapaths that were well suited for modeling.”

The graph properties are not just the input for his model. He uses an elastic net for feature selection and identifies specific metapaths that are predictive for the drug-disease interaction. By identifying paths with high predictive value he can nicely visualize putative mechanisms and functional relationships behind the prediction. There is a nice case study that describes the logic behind the metapaths found in his model.

There is a great blog post on different ways Neo4j can be used in machine learning methods which also has a section on graph based feature engineering: https://neo4j.com/blog/how-graphs-enhance-artificial-intelligence/.

Reactome

Overview

Probably the biggest public Neo4j dataset I know is the Reactome pathway database. It is one of the best resources for detailed pathway networks. Reactome started adopting Neo4j a few years ago and they described the transition from a relational database to Neo4j in a paper. The data behind Reactome is amazing, everything is accessible and the documentation is fantastic.

The database contains ~2 million nodes and ~9 million relationships. There is a developer guide available that explains how to download the database and load it in Neo4j: https://reactome.org/dev/graph-database.

CALL db.schema() is supposed to show an overview of the data model in a Neo4j database. However, Reactome has a pretty complex data model and calling db.schema() produces this hairy ball:

Reactome DB Schema

Every node label is shown as a distinct node in db.schema(). As mentioned before, Reactome has a pretty complex data model. They use an ontology like, hierarchical class system to identify entities. Every entity is an instance of the root class. From the root class grows a tree of children defining sub-classes. A Gene is not just a Gene, it is an entity with the following class hierarchy.

DatabaseObject 
  -> PhysicalEntity
    -> GenomeEncodedEntity
      -> EntityWithAccessionedSequence

The datamodel is explained in the documentation. The class hierarchy of each object in Reactome is represented with multiple labels per node. This leads to a large number of node labels and the messy db.schema(). One of the reasons behind this structure is that features such as localization, modifications and isoforms are considered to be different physical entities. That means a specific molecule (e.g. glucose) is considered to be a different physical entity depending on the localization in the cell. In the same manner, proteins with different post-translational modifications are considered to be different entities. However, all the specific physical entities derived from a protein are connected to a reference entity that holds the invariant protein sequence.

Queries

The data modeling approach described above makes the database complex and difficult to understand. If you want to look into the data you should start at the tutorial on extracting pathway data. A complex data model leads to complex queries. Here is an example query from the tutorial to get all participating molecules from a pathway:

//ALL paticipating molecules for pathway R-HSA-983169
MATCH (p:Pathway{stId:"R-HSA-983169"})-[:hasEvent*]->(rle:ReactionLikeEvent),
      (rle)-[:input|output|catalystActivity|physicalEntity|regulatedBy|regulator|hasComponent|hasMember|hasCandidate*]->(pe:PhysicalEntity),
      (pe)-[:referenceEntity]->(re:ReferenceEntity)-[:referenceDatabase]->(rd:ReferenceDatabase)
RETURN DISTINCT re.identifier AS Identifier, rd.displayName AS Database

The entryoint for the queries is not obvious. If you look for a specific gene, where are you supposed you start? The ReferenceEntity node has a property identifier that could be a Entrez Gene ID.

Data model considerations

Reactome was around a long time before they switched to Neo4j. The data is incredibly dense and I assume they had to transfer the hierarchical entity schema. There are a few data modeling decisions that I find confusing:

  • A node should generally be a concrete thing but the ReferenceEntity contains the type of the entity in a property. Would be easier to give it a specific label such as Gene or Protein.
  • Every node is a DatabaseObject. I think that is redundant.

A ‘flatter’ data model would be easier to use and I suppose you could capture all information from the entity class hierarchy.

Reactome simple schema