Interface for the 'Neo4j Bolt' Protocol.
Bolt4jr
bolt4jr
is an R package designed to efficiently query, extract, and process large-scale network data from Neo4j databases using the Bolt protocol, with built-in support for batch processing and data frame conversion.
Overview
bolt4jr
is an R package that facilitates interaction with Neo4j databases using the Bolt protocol. It allows users to efficiently query nodes and edges in a Neo4j graph database, convert results into data frames, and process large datasets in batches. The package is especially useful for extracting large-scale network data for bioinformatics, computational biology, and other applications.
This README provides a comprehensive guide to installing and using the bolt4jr
package for extracting network data from Neo4j.
Installation
To install the bolt4jr
package directly from its GitHub repository, use the remotes
package:
# Install remotes if not already installed
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
# Install bolt4jr from GitHub
remotes::install_github("Broccolito/bolt4jr")
Setting Up Environment Variables
To securely store your Neo4j connection details (URI, username, and password), you can use environment variables. This ensures that sensitive information is not hard-coded in your scripts.
Open your
.Renviron
file:usethis::edit_r_environ()
Add the following lines to the file, replacing placeholders with your connection details:
NEO4J_URI=bolt://<YOUR_NEO4J_URI> NEO4J_USER=<YOUR_USERNAME> NEO4J_PASSWORD=<YOUR_PASSWORD>
Save the file and restart your R session to load the environment variables.
Access the stored variables in R:
uri = Sys.getenv("NEO4J_URI") username = Sys.getenv("NEO4J_USER") password = Sys.getenv("NEO4J_PASSWORD")
Set up conda environment
setup_bolt4jr()
This function initializes the Conda environment required for the
bolt4jr
package. If no Conda binary is found, it installs Miniconda. If the required Conda environment (bolt4jr
) is not found, it creates the environment and installs the necessary dependencies.
Querying Nodes and Edges from Neo4j
Querying Nodes
To query nodes from a Neo4j database, use the run_query
function. Here's an example:
library(bolt4jr)
# Query nodes
nodes = run_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT elementId(n) AS node_id, n
LIMIT 1000"
)
# Examine the structure of the result
unlist(nodes[[1]])
Example Output (Unlisted Structure):
$node_id
[1] "4:c77f6410-bc08-43ba-a172-0503ab1c93db:0"
$n.identifier
[1] "UBERON:0003233"
$n.name
[1] "epithelium of shoulder"
$n.mesh_id
[1] ""
$n.source
[1] "Uberon"
Extract Specific Fields and Convert to a Data Frame:
nodes = convert_df(
nodes,
field_names = c("node_id", "n.identifier", "n.name", "n.source")
)
# View the resulting data frame
head(nodes)
Example Output (Nodes Data Frame):
node_id | n.identifier | n.name | n.source |
---|---|---|---|
4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | UBERON:0003233 | epithelium of shoulder | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 | UBERON:2001901 | ceratobranchial 3 element | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | UBERON:0004321 | middle phalanx of manual digit 3 | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 | UBERON:0002414 | lumbar vertebra | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:4 | UBERON:2005118 | middle lateral line primordium | Uberon |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:5 | UBERON:0034769 | lymphomyeloid tissue | Uberon |
Since some field names (node_id
) are explicitly specified in the query, and some other field names are known common attributes (n.identifier
, n.name
, n.source
), the extraction will work correctly. If mismatched field names are provided, the function may fail.
Querying Edges
Similarly, you can query edges:
# Query edges
edges = run_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT
elementId(r) AS edge_id,
elementId(startNode(r)) AS start_node_id,
elementId(endNode(r)) AS end_node_id,
r
LIMIT 1000"
)
# Examine the structure of the result
unlist(edges[[1]])
# Extract specific fields and convert to a data frame
edges = convert_df(
edges,
field_names = c("edge_id", "start_node_id", "end_node_id")
)
# View the resulting data frame
head(edges)
Example Output (Edges Data Frame):
edge_id | start_node_id | end_node_id |
---|---|---|
4:c77f6410-bc08-43ba-a172-0503ab1c93db:10 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 |
4:c77f6410-bc08-43ba-a172-0503ab1c93db:11 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 |
Since all field names (edge_id
, start_node_id
, and end_node_id
) are explicitly specified in the query, the extraction will work correctly. If mismatched field names are provided, the function may fail.
Extracting Large Datasets in Batches
For large networks, you can use the run_batch_query
function to process data in chunks. This function appends results to a file incrementally, minimizing memory usage.
Extracting Edges in Batches
run_batch_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT
elementId(r) AS edge_id,
elementId(startNode(r)) AS start_node_id,
elementId(endNode(r)) AS end_node_id,
r",
field_names = c("edge_id", "start_node_id", "end_node_id"),
filename = "edges.tsv",
batch_size = 1000
)
Extracting Nodes in Batches
run_batch_query(
uri = uri,
user = username,
password = password,
query = "
MATCH (n)-[r]-(m)
WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA']
RETURN DISTINCT elementId(n) AS node_id, n",
field_names = c("node_id", "n.identifier", "n.name", "n.source"),
filename = "nodes.tsv",
batch_size = 1000
)
Additional Features
Convert Query Results to Data Frames
The convert_df
function simplifies converting Neo4j query results into R data frames.
# Convert query results to a data frame
nodes = convert_df(
nodes,
field_names = c("node_id", "n.identifier", "n.name", "n.source")
)
# View the data frame
head(nodes)
Similar to querying not in batches, please make sure that all field names can be found in the neo4j query or are common attributes. If mismatched field names are provided, the function may fail.
Troubleshooting
- Connection Issues: Ensure that your Neo4j database is running and the URI, username, and password are correct.
- Environment Variables Not Loaded: Verify that the
.Renviron
file is saved correctly and restart your R session. - Large Query Limits: Use
run_batch_query
for datasets exceeding memory limits.
Contributing
Contributions to bolt4jr
are welcome! Submit issues or pull requests on the GitHub repository. Alternatively, please contact Wanjun Gu for questions and clarifications.