I recently published the graphio Python package. The goal was to make it easy to load data sets from existing files, Excel sheets etc. into Neo4j. Documentation is here.

readable code

When you evaluate Neo4j for a new project or build a Neo4j prototype application, the first step is to define a graph data model. Once you feel comfortable with the whiteboard model, you should build a Neo4j database with real world data to test Cypher queries and to get a feeling for the general graphieness of your data.

graphio is meant to make the process of prototyping and bootstrapping easier. The general workflow is to define NodeSets and RelationshipSets, iterate over the files that contain you data to parse nodes and relationships, and finally load data to Neo4j.

I already wrote about gene ID mapping in Neo4j and this unavoidable task is a nice example for a little tutorial.

Use graphio to load gene ID mapping to Neo4j

Different genome databases have different definitions of what a gene is. Consequently, there are different ID spaces with a many-to-many mapping.

To build a gene ID mapping tool with Neo4j, we load data on gene IDs from the NCBI Gene database. We will create mappings to the ENSEMBL Gene database. Thus, we will create (Gene) nodes and connect them with (Gene)-[:MAPS]->(Gene) relationships. The complete script we build in this tutorial is available in the graphio GitHub repository. You only have to define a few settings in the beginning of the file to run it on your machine.

Data files

All data is accessible via the NCBI FTP server. The main data file gene_info.gz contains a list of all genes for all organisms. There are also files for groups of organisms (e.g. mammalia) and individual organisms. We will use the file for human.

This is a nice example for a tabular data file that conatins a lot of condensed information. Each line is a unique gene ID as defined by NCBI Gene. Next to information about the gene, there is mapping data to other databases.

First things first

We begin with importing all the stuff we need for downloading, parsing and loading:

import gzip
import os
import shutil
from urllib.request import urlopen

from graphio import NodeSet, RelationshipSet
import py2neo

Note that graphio uses py2neo to run operations. In future this will also work with the official Neo4j Python driver to minimize dependencies. But for now we have to install and import py2neo as well.

Define a directory where we can download the data file and set Neo4j host/credentials:

DOWNLOAD_DIR = "/set/your/path/here"
DOWNLOAD_FILE_PATH = os.path.join(DOWNLOAD_DIR, 'Homo_sapiens.gene_info.gz')
NEO4J_HOST = 'localhost'
NEO4J_PORT = 7687
NEO4J_USER = 'neo4j'
NEO4J_PASSWORD = 'test'

graph = py2neo.Graph(host=NEO4J_HOST, user=NEO4J_USER, password=NEO4J_PASSWORD)

Here we also create the py2neo.Graph instance which we use later to create data.

Download data file

The data file with only human genes is called Homo_sapiens.gene_info.gz and located in a sub-directory (ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia). With Python 3 you can use the builtin urllib to download files from an FTP server:

with urlopen('ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz') as r:
    with open(DOWNLOAD_FILE_PATH, 'wb') as f:
        shutil.copyfileobj(r, f)

If you still use Python 2 you should migrate to Python 3 instead of googling how to download files with Python 2.

Define NodeSets and RelationshipSets

One of the ideas behind graphio is to write prototyping code that is easy to understand. That is why you define the NodeSet and RelationshipSet objects before you parse files and add data. A NodeSet is a container for nodes with the same label (but different properties). A RelationshipSet is a container for relationships with the same type, and the same start and end node.

ncbi_gene_nodes = NodeSet(['Gene'], ['gene_id'])
ensembl_gene_nodes = NodeSet(['Gene'], ['gene_id'])
gene_mapping_rels = RelationshipSet('MAPS', ['Gene'], ['Gene'], ['gene_id'], ['gene_id'])

We have to NodeSets: One for the genes from NCBI Gene and another for the genes from ENSEMBL. The arguments for a NodeSet are a list of lables and a list of properties that confer uniqueness and are used for MERGE operations (called merge_keys in graphio). We give both nodes the label Gene and define the gene_id as unique property. The merge_key is optional, you can skip it if you only want to CREATE data.

After the NodeSets we create a RelationshipSet for the mapping. The arguments for the RelationshipSet are:

  • relationship type: 'MAPS'
  • labels of start node: ['Gene']
  • labels of end node: ['Gene']
  • properties to identify specific start nodes: ['gene_id']
  • properties to identify specific end nodes: ['gene_id']

This has to correspond to the NodeSets created before. As of now, graphio has not implemented any functions to cross-check the data sets. This is planned for a future release.

Iterate data file and load data

The Homo_sapiens.gene_info.gz data file contains one NCBI Gene ID per line. There is more information such as the official gene symbol, long name of the gene, position in the genome etc.

9606	1	A1BG	-	A1B|ABG|GAB|HYST2477	MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20191220	-
9606	2	A2M	-	A2MD|CPAMD5|FWP007|S863-7	MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899	12	12p13.31	alpha-2-macroglobulin	protein-coding	A2M	alpha-2-macroglobulin	O	alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M	20200113	-
9606	3	A2MP1	-	A2MP	HGNC:HGNC:8|Ensembl:ENSG00000256069	12	12p13.31	alpha-2-macroglobulin pseudogene 1	pseudo	A2MP1	alpha-2-macroglobulin pseudogene 1	O	pregnancy-zone protein pseudogene	20191221	-

This data file is a nice example for the limitations of tabular data. Each line contains a unique NCBI Gene ID because a table needs a certain level of uniqueness to be machine readable. Each line also contains mappings to other data sources. There is one official gene symbol for each gene ID which is easy to model in a table. However, there is a varying number of synonyms for each offical gene symbol (column 5, separated by |) . Also, there are several mappings to other databases (column 6). These mappings are also separated by | and each entry has a database identifier (such as Ensembl, HGNC). The entities in this data file are obviously very graphy because we have defined IDs with various mappings between them.

For this ID mapping tutorial, we extract the NCBI Gene ID from column 2 and the mapped ENSEMBL Gene IDs from column 6. There could be more than one ENSEMBL Gene ID for a NCBI Gene ID and we will take that into account. We then add a node for each NCBI Gene ID and ENSEMBL Gene ID as well as for the mapping.

ensembl_gene_ids_added = set()

with gzip.open(DOWNLOAD_FILE_PATH, 'rt') as file:
    # skip header line
    next(file)
    # iterate file
    for line in file:
        fields = line.strip().split('\t')
        ncbi_gene_id = fields[1]

        # get mapping to ENSEMBL Gene IDs
        mapped_ensembl_gene_ids = []
        # get dbXrefs
        db_xrefs = fields[5]
        for mapped_element in db_xrefs.split('|'):
            if 'Ensembl' in mapped_element:
                ensembl_gene_id = mapped_element.split(':')[1]
                mapped_ensembl_gene_ids.append(ensembl_gene_id)

        # create nodes and relationships
        # add NCBI gene node
        ncbi_gene_nodes.add_node({'gene_id': ncbi_gene_id, 'db': 'ncbi'})
        # add ENSEMBL gene nodes if they not exist already
        for ensembl_gene_id in mapped_ensembl_gene_ids:
            if ensembl_gene_id not in ensembl_gene_ids_added:
                ensembl_gene_nodes.add_node({'gene_id': ensembl_gene_id, 'db': 'ensembl'})
                ensembl_gene_ids_added.add(ensembl_gene_id)

        # add (:Gene)-[:MAPS]->(:Gene) relationship
        for ensembl_gene_id in mapped_ensembl_gene_ids:
            gene_mapping_rels.add_relationship(
                {'gene_id': ncbi_gene_id}, {'gene_id': ensembl_gene_id}, {'db': 'ncbi'}
            )

This is a simple parser that iterates over each line of the file, gets the NCBI Gene ID from column 1 and collects the mapped ENSEMBL Gene IDs from column 6. We then create a nodes for the genes and the relationships.

Note that we want to create unique (Gene) nodes. Each line contains a unique NCBI Gene ID but there could be duplicates in the mapped ENSEMBL Gene ID. We simply collect all ENSEMBL Gene IDs that are already created in a set and check before adding a new node. In the future this functionality could be included in the NodeSet so that you can simply add nodes without duplications. For now you have to check yourself.

When you add a node to a NodeSet you only need the properties of the node. The label is stored in the NodeSet already. For adding relationships, you need the identifying property of the start and end node as well as the relationship properties. Here this means the NCBI Gene ID of the start node and the ENSEMBL Gene ID of the end node:

{'gene_id': ncbi_gene_id}, {'gene_id': ensembl_gene_id}, {'db': 'ncbi'}

In this example we only create nodes for ENSEMBL Gene IDs that are mapped by NCBI Gene. We can then map data from NCBI to ENSEMBL. For a complete bi-directional mapping we would need to download data from ENSEMBL and create the complete set of ENSEMBL genes with all mappings as defined by ENSEMBL.

Load data to Neo4j

Before adding data we create an index on the gene_id property of (Gene) nodes. This drastically increases the performance when creating relationships. Here we use the py2neo.Graph instance created above:

try:
    graph.schema.create_index('Gene', 'gene_id')
except py2neo.database.ClientError:
    pass

Note: With Neo4j 3.5 you can create an existing index but Neo4j 4 throws an error. Catch the error for compatibility with Neo4j 4.

Now we can simply use the create() functions of our NodeSets and RelationshipSets:

ncbi_gene_nodes.create(graph)
ensembl_gene_nodes.create(graph)

gene_mapping_rels.create(graph)

And that’s it! You now have a Neo4j database with gene ID mapping data.

Summary

This tutorial should give you an idea of the scope of graphio: Build a Neo4j database without too much overhead. It is not a replacement for e.g. py2neo but an additional tool for the Python ecosystem.

I will continue working on graphio and extend the functionality. Any contribution is highly appreciated!

Here is a list of things I plan to implement:

  • statistics of data insertions/updates
  • cross-check NodeSet and RelationshipSet to see if the properties match
  • use NodeSet and RelationshipSet to create indexes
  • adapt NodeSet to only contain unique nodes (i.e. do not add duplicates as defined my merge_keys)
  • extend interoperability with py2neo OGM (allow to add GraphObjects)