Developing a Neo4j User-Defined procedure

implementing and debugging Breadth-First Search with depth in a real Neo4j database

7 min readAug 15, 2020

Database and Data

Neo4j is a popular graph database software. Neo4j provides a database querying language called Cypher. Using Cypher you can make queries which are very similar to SQL.

// SQL query which simply returns 10 users from "User" table
SELECT * FROM User LIMIT 10
// Cypher query which returns 10 nodes which are labeled as "User"
MATCH (n:User) RETURN n LIMIT 10

Although Cypher is pretty powerful and descriptive, you might want to run your own algorithms and make your own traversals. For this reason, Neo4j provides a way for writing your own functions/procedures with user-defined procedures and functions.

I will show you all you need from zero in this post. First, you need a Neo4j database. You can use Neo4j Desktop to install and start the database. You can also visualize your database through Neo4j Browser.

Figure 1: Neo4j Desktop seems like this in windows 10 machine

When you start Neo4j Desktop, you will see a list of databases.

You can add/remove a database. For this post, I generated a new database called “movie”.

You can observe your database from an Internet browser. By default, the database will use the port 7474. Go and open your Internet browser and type http://localhost:7474.

Figure 3: viewing database from an Internet Browser

Figure 4: an active neo4j database shown in Neo4j Desktop

You can also open Neo4j Browser. It will open simply the same screen. Here the version number of the database is important. Probably you should use 3.5 version of Neo4j.

Figure 5: Bring a subgraph using Neo4j Browser

Now, the database does not have any data. By typing :play movie-graph and following the instructions, you should have a small database that contains 171 nodes and 253 edges (relationships). You can visualize and query your data. Cypher code match p=(n)-[e]->(n2) return p limit 10 will bring 10 random paths from the database.

Dependencies for development

Now we have database and data. We can start implementing our neo4j procedure. There is a template for jumpstarting neo4j procedures. You can check the GitHub repository. I want to set a name for my repository. So I’m not forking it. I will download the codes as a ZIP file. So that I can start my own git repository.

Figure 6: setting path to JDK inside IntelliJ

Neo4j uses java. To develop a neo4j procedure, I will use IntelliJ IDEA community edition. I first tried to use Visual Studio Code with java extension. But it does not support Java 8. For this reason, I used IntelliJ. I installed Java 8 from AdoptOpenJDK. Inside IntelliJ go to File>Project Structure and add the path to your JDK folder.

Implementation

Now we are ready for implementation of the algorithm. Breadth-First Search (BFS) is a well-known algorithm for traversing graph data structures. You can observe the pseudocode for BFS below. BFS uses queue data structure for traversal.

// src is the starting node for traversal
1  procedure BFS(src) is
2      Q <- empty queue
3      visitedNodeSet = {src} // a set of nodes
4      Q.enqueue(src)
5      while Q is not empty do
6          v := Q.dequeue()
7          for all edges from node v to another node w do
8              if w not in visitedNodeSet then
9                  add w to visitedNodeSet
10                 Q.enqueue(w)
11    return visitedNodeSet

This algorithm is a well-known standard algorithm. Isn’t it look boring ;) So let’s make some customizations on this algorithm.

This algorithm finds all the connected nodes from a starting node. Here my aim is instead of finding a node, I want to find a subset of the graph. For this reason, I will keep track of the depth of the traversal. Also, instead of making a traversal from a single node, we can make a traversal from a set of nodes.

1   procedure BFS(ids, depthLimit, isDirected) {
2       // always return the source nodes    
3       o = {nodes: {ids}, edges: {}}
4       direction = BOTH
5       if isDirected then
6           direction = OUTGOING
7       queue = {}
8       visitedNodes = {ids}
9       currDepth = 0
10      cntElementsInQueueFromUpperLevel = 0
11      isFirstElemInLevel = true
12      while queue is not empty do
13          if cntElementsInQueueFromUpperLevel == 0 then
14              currDepth++
15              isFirstElemInLevel = true
16          if currDepth >= depthLimit + 1 then
17              break
18          n1 = queue.pop()
19          cntElementsInQueueFromUpperLevel--
20          for all neighbors of n1 with respect to direction do
21              if visitedNodes contains n2 then
22                  continue
23              if isFirstElemInLevel then
24                  cntElementsInQueueFromUpperLevel = queue.size()
25                  isFirstElemInLevel = false
26              visitedNodes.add(n1)
27              queue.add(n2.getId())
28              o.edges.add(e)
29              o.nodes.add(n2)
30      return o

You can check the real java codes in my repository.

Testing

Testing your code is always essential. You can see many small bugs that are hiding. Since you need data for testing, you can either generate mock data or use an already existing database.

Figure 7: Debugging the procedure in InteliJ

Both IntelliJ and neo4j are very helpful for testing and debugging. You can run a neo4j database instance, generate some mock data with cypher and call your own procedure in this instance. You can also debug during testing. You can go inside your implementation.

Generating mock data might not be enough. You might need big data to see the realistic performance of your procedure. If you connect to an existing and running neo4j database instance, you should do a little bit more.

Go to the database in Neo4j Desktop. Click to “Open Folder”. Go to the “plugins” folder. Here you should put the JAR file that is generated from building your project. Also, you should restart the database to see that your procedure is active.

The project should be build using “package” command from “Maven” view of IntelliJ.

Figure 9: Running the procedure in a real neo4j database

Figure 9 shows that movie with id 0 is the movie called “The Matrix”. Then we make calls to our own neo4j procedure. It returns a subgraph.

Debugging

I can debug using mock data. But what happens if I want to debug a big neo4j database. Neo4j database uses the JAR file for calling my procedure. Somehow I should debug that JAR file. To do that, I should do 2 things.

Add

dbms.jvm.additional=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

to your neo4j.conf file. This file is inside the conf folder of your main database folder. (shown in figure 8) Restart your database to activate the configuration.

Figure 10: A new configuration for remote debugging in IntelliJ

Add a new Run/Debug Configuration inside IntelliJ. Just use the defaults. We will connect to port 5005 as we indicated in neo4j.conf file configuration.

Figure 11: Yes we can debug a real neo4j database

I searched a bit to debug a real neo4j database. I couldn’t find any answer on the Internet. I hope this post will be helpful.

Bonus Question

Let’s say that there is a bug in a production server. And you can not run IntelliJ/Java in the production server. You check all your tests. They all pass but somehow there is a problem with this particular database server and this particular case. How do you debug?

Answer: You can export only the problematic portion of the production database. Only the interesting part of the graph. To do this I used Apoc. Apoc contains lots of functions/procedures which are useful. It is also a plugin just like we did now.

CALL apoc.export.cypher.query("match (n)-[r]->(n2) return * limit 100", "subset.cypher",
 {format:'plain',separateFiles:false, cypherFormat: 'create', useOptimizations:{type: "NONE", unwindBatchSize: 20}})
 YIELD file, batches, source, format, nodes, relationships, time, rows, batchSize
 RETURN file, batches, source, format, nodes, relationships, time, rows, batchSize;

When you execute match (n)-[r]->(n2) return * limit 100 you will see a subgraph. The above script lets you export the subgraph that you see. It generates a script that generates the data.

So you should first write a cypher which brings the interesting/problematic subgraph. Then using Apoc, you can generate a cypher script to generate the subgraph. That’s my solution.

All my codes are available through Github. https://github.com/canbax/neo4j-procedure-BFS