Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
95ca948
Add Neo4j GraphRAG example with TMDB movies dataset
kartikeyamandhar Mar 30, 2026
4721cd4
Fix README image URLs with correct GitHub username
kartikeyamandhar Mar 30, 2026
80cb377
Fix .gitignore patterns and add DAG images
kartikeyamandhar Mar 30, 2026
7bb66cd
Fix `@resolve` decorator not calling `validate()` on returned decorat…
jmarshrossney Apr 4, 2026
050050d
Various build & release fixes (#1529)
skrawcz Apr 4, 2026
e59de91
Remove license comments from databackend.py (#1535)
pjfanning Apr 4, 2026
4a8c744
Bump lodash from 4.17.23 to 4.18.1 in /contrib/docs (#1538)
dependabot[bot] Apr 4, 2026
af49f79
Bump brace-expansion from 1.1.12 to 1.1.13 in /contrib/docs (#1537)
dependabot[bot] Apr 4, 2026
6d94279
Bump aiohttp from 3.13.3 to 3.13.4 in /ui/backend (#1536)
dependabot[bot] Apr 4, 2026
a404380
Bump pygments from 2.19.2 to 2.20.0 in /ui/backend (#1534)
dependabot[bot] Apr 4, 2026
81edbe7
Bump path-to-regexp from 0.1.12 to 0.1.13 in /contrib/docs (#1533)
dependabot[bot] Apr 4, 2026
1358b68
Bump brace-expansion in /dev_tools/vscode_extension (#1530)
dependabot[bot] Apr 4, 2026
86dfc8d
Bump requests from 2.32.5 to 2.33.0 in /ui/backend (#1528)
dependabot[bot] Apr 4, 2026
0441b6e
Bump picomatch in /ui/frontend (#1526)
dependabot[bot] Apr 4, 2026
e7b9132
Bump yaml in /ui/frontend (#1525)
dependabot[bot] Apr 4, 2026
f15b89a
Bump braces from 3.0.2 to 3.0.3 in /dev_tools/vscode_extension (#1522)
dependabot[bot] Apr 4, 2026
14eba28
Address PR review comments: add Apache 2 headers, remove unused deps,…
kartikeyamandhar Apr 4, 2026
7f9236a
Merge branch 'main' into examples/neo4j-graph-rag
kartikeyamandhar Apr 4, 2026
deb190d
Various build & release fixes (#1529)
skrawcz Apr 4, 2026
894ba2e
Remove license comments from databackend.py (#1535)
pjfanning Apr 4, 2026
ca523a1
Add Neo4j GraphRAG example to ecosystem page
kartikeyamandhar Apr 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/ecosystem/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ Persist and cache your data:
| <img src="../_static/logos/slack.svg" width="20" height="20" style="vertical-align: middle;"> **Slack** | Notifications and integrations | [Examples](https://github.com/apache/hamilton/tree/main/examples/slack) \| [Lifecycle Hook](../reference/lifecycle-hooks/SlackNotifierHook.rst) |
| <img src="../_static/logos/geopandas.png" width="20" height="20" style="vertical-align: middle;"> **GeoPandas** | Geospatial data analysis | [Type extension](https://github.com/apache/hamilton/blob/main/hamilton/plugins/geopandas_extensions.py) for GeoDataFrame support |
| <img src="../_static/logos/yaml.svg" width="20" height="20" style="vertical-align: middle;"> **YAML** | Configuration management | [IO Adapters](../reference/io/available-data-adapters.rst) |
| **Neo4j** | Knowledge graph RAG | [Examples](https://github.com/apache/hamilton/tree/main/examples/LLM_Workflows/neo4j_graph_rag) |

---

Expand Down
25 changes: 25 additions & 0 deletions examples/LLM_Workflows/neo4j_graph_rag/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# OpenAI
OPENAI_API_KEY=your-openai-api-key-here

# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password
NEO4J_DATABASE=neo4j
44 changes: 44 additions & 0 deletions examples/LLM_Workflows/neo4j_graph_rag/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Environment
Comment thread
skrawcz marked this conversation as resolved.
.env

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
venv/
.venv/

# Data files (download separately per data/README.md)
data/*.json
data/*.csv

# DAG visualisations are committed — ignore regenerated copies at root
/ingest_dag.png
/embed_dag.png
/rag_dag.png

# Neo4j
*.dump

# OS
.DS_Store
Thumbs.db
205 changes: 205 additions & 0 deletions examples/LLM_Workflows/neo4j_graph_rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Neo4j GraphRAG — TMDB Movies

A full GraphRAG pipeline over a movie knowledge graph stored in Neo4j,
built entirely with Apache Hamilton. Ingestion, embedding, and retrieval
are each expressed as first-class Hamilton DAGs — dependencies declared
through function signatures, execution graph built automatically.

## Hamilton DAG visualisations

Run `--visualise` on any mode to regenerate these from source without
executing the pipeline.

### Ingestion DAG

```bash
python run.py --mode ingest --visualise
```

![Ingestion DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/ingest_dag.png)

Raw TMDB JSON flows through parsing nodes into batched Neo4j writes.
Hamilton automatically parallelises the four independent branches
(movies, genres, companies, person edges) from the shared `raw_movies`
and `raw_credits` inputs.

---

### Embedding DAG

```bash
python run.py --mode embed --visualise
```

![Embedding DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/embed_dag.png)

Movie texts are fetched from Neo4j, batched through the OpenAI embeddings
API, written back to Movie nodes, and a cosine vector index is created.

---

### Retrieval + Generation DAG

```bash
python run.py --mode query --visualise
```

![RAG DAG](https://raw.githubusercontent.com/apache/hamilton/examples/neo4j-graph-rag/examples/LLM_Workflows/neo4j_graph_rag/docs/images/rag_dag.png)

The full 13-node RAG pipeline. Hamilton wires all dependencies from
function signatures — no manual orchestration:

```
user_query + openai_api_key + neo4j_driver
-> query_intent classify into VECTOR / CYPHER / AGGREGATE / HYBRID
-> entity_extraction extract persons, movies, genres, companies, filters
-> entity_resolution fuzzy-match each entity against the live graph
-> query_embedding embed query (VECTOR / HYBRID only)
-> vector_results cosine similarity search (VECTOR / HYBRID only)
-> cypher_query LLM generates Cypher from resolved entities
-> cypher_results execute Cypher against Neo4j
-> merged_results combine both retrieval paths
-> retrieved_context format as numbered plain-text records
-> system_prompt inject context into LLM system prompt
-> prompt_messages assemble message list
-> answer gpt-4o generates final answer
```

## What it demonstrates

**Ingestion DAG** (`ingest_module.py`)
Loads TMDB JSON, parses entities and relationships, writes to Neo4j via
batched Cypher `MERGE`.

**Embedding DAG** (`embed_module.py`)
Computes OpenAI `text-embedding-3-small` embeddings over title + overview,
writes vectors to Movie nodes, creates a Neo4j cosine vector index.

**Retrieval DAG** (`retrieval_module.py`)
Classifies each query into one of four strategies, resolves named entities
against the graph to get canonical names, then executes retrieval:

| Strategy | When used | How it retrieves |
|-------------|----------------------------------|-----------------------------------------------|
| `VECTOR` | Thematic / semantic queries | Cosine vector search over Movie embeddings |
| `CYPHER` | Relational / factual queries | LLM-generated Cypher with resolved entities |
| `AGGREGATE` | Counting / ranking queries | Aggregation Cypher with popularity guard |
| `HYBRID` | Filtered + semantic queries | CYPHER + VECTOR, results merged |

The semantic entity resolution layer looks up every extracted entity in
Neo4j before generating Cypher, so "Warner Bros movies" always resolves
to the canonical `"Warner Bros."` name in the graph.

**Generation DAG** (`generation_module.py`)
Formats retrieved records into a grounded system prompt and calls gpt-4o.

## Knowledge graph schema

```
(:Movie {id, title, release_date, overview, popularity, vote_average})
(:Person {id, name})
(:Genre {name})
(:ProductionCompany {id, name})

(:Person)-[:ACTED_IN {order, character}]->(:Movie)
(:Person)-[:DIRECTED]->(:Movie)
(:Movie)-[:IN_GENRE]->(:Genre)
(:Movie)-[:PRODUCED_BY]->(:ProductionCompany)
```

Dataset: 4,803 movies · 56,603 persons · 106,257 ACTED_IN · 5,166 DIRECTED · 20 genres · 5,047 companies

## Prerequisites

- Docker
- Python 3.10+
- OpenAI API key (`gpt-4o` access)
- TMDB dataset (see `data/README.md`)

## Setup

### 1. Start Neo4j

```bash
docker compose up -d
```

Neo4j browser: http://localhost:7474 (user: `neo4j`, password: `password`)

### 2. Install dependencies

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

### 3. Configure environment

```bash
cp .env.example .env
# edit .env — add your OPENAI_API_KEY
```

### 4. Download the dataset

Follow `data/README.md` to download and convert the TMDB dataset.

## Running

```bash
# Step 1 — load graph (takes ~5 seconds)
python run.py --mode ingest

# Step 2 — compute and store embeddings (takes ~2 minutes)
python run.py --mode embed

# Step 3 — query
python run.py --mode query --question "Who directed Inception?"
python run.py --mode query --question "Which movies did Tom Hanks and Robin Wright appear in together?"
python run.py --mode query --question "Which production company made the most action movies?"
python run.py --mode query --question "Recommend movies similar to Inception"
python run.py --mode query --question "Find me war films rated above 7.5"
python run.py --mode query --question "Which actors appeared in both a Christopher Nolan and a Steven Spielberg film?"
```

## Project structure

```
neo4j_graph_rag/
├── docker-compose.yml Neo4j 5 + APOC
├── requirements.txt
├── .env.example
Comment thread
skrawcz marked this conversation as resolved.
├── graph_schema.py Node/relationship definitions and Cypher constraints
├── ingest_module.py Hamilton DAG: JSON -> Neo4j
├── embed_module.py Hamilton DAG: Movie nodes -> embeddings -> vector index
├── retrieval_module.py Hamilton DAG: query -> entity resolution -> retrieval
├── generation_module.py Hamilton DAG: context + query -> gpt-4o -> answer
├── run.py Entry point wiring all three pipelines
├── docs/
│ └── images/
│ ├── ingest_dag.png
│ ├── embed_dag.png
│ └── rag_dag.png
└── data/
└── README.md Dataset download and conversion instructions
```
57 changes: 57 additions & 0 deletions examples/LLM_Workflows/neo4j_graph_rag/data/README.md
Comment thread
skrawcz marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Data

This example uses the [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) from Kaggle.

## Download

1. Create a free Kaggle account at https://www.kaggle.com
2. Go to https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
3. Click **Download** and unzip the archive
4. Place the following two files in this `data/` folder:

```
data/
├── tmdb_5000_movies.json
└── tmdb_5000_credits.json
```

## Note on file format

The Kaggle archive ships the files as CSV (`tmdb_5000_movies.csv`, `tmdb_5000_credits.csv`).
Several columns contain JSON strings (genres, cast, crew, production_companies).

Convert them to JSON before running ingestion:

```python
import pandas as pd, json

movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

with open("tmdb_5000_movies.json", "w") as f:
json.dump(movies.to_dict(orient="records"), f)

with open("tmdb_5000_credits.json", "w") as f:
json.dump(credits.to_dict(orient="records"), f)
```

Run this script once from inside the `data/` folder, then proceed with `python run.py --mode ingest`.
28 changes: 28 additions & 0 deletions examples/LLM_Workflows/neo4j_graph_rag/data/data_refine.py
Comment thread
skrawcz marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.


import pandas as pd, json

movies = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.csv")
credits = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.csv")

with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.json", "w") as f:
json.dump(movies.to_dict(orient="records"), f)

with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.json", "w") as f:
json.dump(credits.to_dict(orient="records"), f)
Loading
Loading