-
Notifications
You must be signed in to change notification settings - Fork 183
Add Neo4j GraphRAG example with TMDB movies dataset #1532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
skrawcz
merged 21 commits into
apache:main
from
kartikeyamandhar:examples/neo4j-graph-rag
Apr 6, 2026
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
95ca948
Add Neo4j GraphRAG example with TMDB movies dataset
kartikeyamandhar 4721cd4
Fix README image URLs with correct GitHub username
kartikeyamandhar 80cb377
Fix .gitignore patterns and add DAG images
kartikeyamandhar 7bb66cd
Fix `@resolve` decorator not calling `validate()` on returned decorat…
jmarshrossney 050050d
Various build & release fixes (#1529)
skrawcz e59de91
Remove license comments from databackend.py (#1535)
pjfanning 4a8c744
Bump lodash from 4.17.23 to 4.18.1 in /contrib/docs (#1538)
dependabot[bot] af49f79
Bump brace-expansion from 1.1.12 to 1.1.13 in /contrib/docs (#1537)
dependabot[bot] 6d94279
Bump aiohttp from 3.13.3 to 3.13.4 in /ui/backend (#1536)
dependabot[bot] a404380
Bump pygments from 2.19.2 to 2.20.0 in /ui/backend (#1534)
dependabot[bot] 81edbe7
Bump path-to-regexp from 0.1.12 to 0.1.13 in /contrib/docs (#1533)
dependabot[bot] 1358b68
Bump brace-expansion in /dev_tools/vscode_extension (#1530)
dependabot[bot] 86dfc8d
Bump requests from 2.32.5 to 2.33.0 in /ui/backend (#1528)
dependabot[bot] 0441b6e
Bump picomatch in /ui/frontend (#1526)
dependabot[bot] e7b9132
Bump yaml in /ui/frontend (#1525)
dependabot[bot] f15b89a
Bump braces from 3.0.2 to 3.0.3 in /dev_tools/vscode_extension (#1522)
dependabot[bot] 14eba28
Address PR review comments: add Apache 2 headers, remove unused deps,…
kartikeyamandhar 7f9236a
Merge branch 'main' into examples/neo4j-graph-rag
kartikeyamandhar deb190d
Various build & release fixes (#1529)
skrawcz 894ba2e
Remove license comments from databackend.py (#1535)
pjfanning ca523a1
Add Neo4j GraphRAG example to ecosystem page
kartikeyamandhar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| # OpenAI | ||
| OPENAI_API_KEY=your-openai-api-key-here | ||
|
|
||
| # Neo4j | ||
| NEO4J_URI=bolt://localhost:7687 | ||
| NEO4J_USERNAME=neo4j | ||
| NEO4J_PASSWORD=password | ||
| NEO4J_DATABASE=neo4j |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| # Environment | ||
| .env | ||
|
|
||
| # Python | ||
| __pycache__/ | ||
| *.pyc | ||
| *.pyo | ||
| *.pyd | ||
| .Python | ||
| venv/ | ||
| .venv/ | ||
|
|
||
| # Data files (download separately per data/README.md) | ||
| data/*.json | ||
| data/*.csv | ||
|
|
||
| # DAG visualisations are committed — ignore regenerated copies at root | ||
| /ingest_dag.png | ||
| /embed_dag.png | ||
| /rag_dag.png | ||
|
|
||
| # Neo4j | ||
| *.dump | ||
|
|
||
| # OS | ||
| .DS_Store | ||
| Thumbs.db | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,205 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| # Neo4j GraphRAG — TMDB Movies | ||
|
|
||
| A full GraphRAG pipeline over a movie knowledge graph stored in Neo4j, | ||
| built entirely with Apache Hamilton. Ingestion, embedding, and retrieval | ||
| are each expressed as first-class Hamilton DAGs — dependencies declared | ||
| through function signatures, execution graph built automatically. | ||
|
|
||
| ## Hamilton DAG visualisations | ||
|
|
||
| Run `--visualise` on any mode to regenerate these from source without | ||
| executing the pipeline. | ||
|
|
||
| ### Ingestion DAG | ||
|
|
||
| ```bash | ||
| python run.py --mode ingest --visualise | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| Raw TMDB JSON flows through parsing nodes into batched Neo4j writes. | ||
| Hamilton automatically parallelises the four independent branches | ||
| (movies, genres, companies, person edges) from the shared `raw_movies` | ||
| and `raw_credits` inputs. | ||
|
|
||
| --- | ||
|
|
||
| ### Embedding DAG | ||
|
|
||
| ```bash | ||
| python run.py --mode embed --visualise | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| Movie texts are fetched from Neo4j, batched through the OpenAI embeddings | ||
| API, written back to Movie nodes, and a cosine vector index is created. | ||
|
|
||
| --- | ||
|
|
||
| ### Retrieval + Generation DAG | ||
|
|
||
| ```bash | ||
| python run.py --mode query --visualise | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| The full 13-node RAG pipeline. Hamilton wires all dependencies from | ||
| function signatures — no manual orchestration: | ||
|
|
||
| ``` | ||
| user_query + openai_api_key + neo4j_driver | ||
| -> query_intent classify into VECTOR / CYPHER / AGGREGATE / HYBRID | ||
| -> entity_extraction extract persons, movies, genres, companies, filters | ||
| -> entity_resolution fuzzy-match each entity against the live graph | ||
| -> query_embedding embed query (VECTOR / HYBRID only) | ||
| -> vector_results cosine similarity search (VECTOR / HYBRID only) | ||
| -> cypher_query LLM generates Cypher from resolved entities | ||
| -> cypher_results execute Cypher against Neo4j | ||
| -> merged_results combine both retrieval paths | ||
| -> retrieved_context format as numbered plain-text records | ||
| -> system_prompt inject context into LLM system prompt | ||
| -> prompt_messages assemble message list | ||
| -> answer gpt-4o generates final answer | ||
| ``` | ||
|
|
||
| ## What it demonstrates | ||
|
|
||
| **Ingestion DAG** (`ingest_module.py`) | ||
| Loads TMDB JSON, parses entities and relationships, writes to Neo4j via | ||
| batched Cypher `MERGE`. | ||
|
|
||
| **Embedding DAG** (`embed_module.py`) | ||
| Computes OpenAI `text-embedding-3-small` embeddings over title + overview, | ||
| writes vectors to Movie nodes, creates a Neo4j cosine vector index. | ||
|
|
||
| **Retrieval DAG** (`retrieval_module.py`) | ||
| Classifies each query into one of four strategies, resolves named entities | ||
| against the graph to get canonical names, then executes retrieval: | ||
|
|
||
| | Strategy | When used | How it retrieves | | ||
| |-------------|----------------------------------|-----------------------------------------------| | ||
| | `VECTOR` | Thematic / semantic queries | Cosine vector search over Movie embeddings | | ||
| | `CYPHER` | Relational / factual queries | LLM-generated Cypher with resolved entities | | ||
| | `AGGREGATE` | Counting / ranking queries | Aggregation Cypher with popularity guard | | ||
| | `HYBRID` | Filtered + semantic queries | CYPHER + VECTOR, results merged | | ||
|
|
||
| The semantic entity resolution layer looks up every extracted entity in | ||
| Neo4j before generating Cypher, so "Warner Bros movies" always resolves | ||
| to the canonical `"Warner Bros."` name in the graph. | ||
|
|
||
| **Generation DAG** (`generation_module.py`) | ||
| Formats retrieved records into a grounded system prompt and calls gpt-4o. | ||
|
|
||
| ## Knowledge graph schema | ||
|
|
||
| ``` | ||
| (:Movie {id, title, release_date, overview, popularity, vote_average}) | ||
| (:Person {id, name}) | ||
| (:Genre {name}) | ||
| (:ProductionCompany {id, name}) | ||
|
|
||
| (:Person)-[:ACTED_IN {order, character}]->(:Movie) | ||
| (:Person)-[:DIRECTED]->(:Movie) | ||
| (:Movie)-[:IN_GENRE]->(:Genre) | ||
| (:Movie)-[:PRODUCED_BY]->(:ProductionCompany) | ||
| ``` | ||
|
|
||
| Dataset: 4,803 movies · 56,603 persons · 106,257 ACTED_IN · 5,166 DIRECTED · 20 genres · 5,047 companies | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Docker | ||
| - Python 3.10+ | ||
| - OpenAI API key (`gpt-4o` access) | ||
| - TMDB dataset (see `data/README.md`) | ||
|
|
||
| ## Setup | ||
|
|
||
| ### 1. Start Neo4j | ||
|
|
||
| ```bash | ||
| docker compose up -d | ||
| ``` | ||
|
|
||
| Neo4j browser: http://localhost:7474 (user: `neo4j`, password: `password`) | ||
|
|
||
| ### 2. Install dependencies | ||
|
|
||
| ```bash | ||
| python -m venv venv | ||
| source venv/bin/activate | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| ### 3. Configure environment | ||
|
|
||
| ```bash | ||
| cp .env.example .env | ||
| # edit .env — add your OPENAI_API_KEY | ||
| ``` | ||
|
|
||
| ### 4. Download the dataset | ||
|
|
||
| Follow `data/README.md` to download and convert the TMDB dataset. | ||
|
|
||
| ## Running | ||
|
|
||
| ```bash | ||
| # Step 1 — load graph (takes ~5 seconds) | ||
| python run.py --mode ingest | ||
|
|
||
| # Step 2 — compute and store embeddings (takes ~2 minutes) | ||
| python run.py --mode embed | ||
|
|
||
| # Step 3 — query | ||
| python run.py --mode query --question "Who directed Inception?" | ||
| python run.py --mode query --question "Which movies did Tom Hanks and Robin Wright appear in together?" | ||
| python run.py --mode query --question "Which production company made the most action movies?" | ||
| python run.py --mode query --question "Recommend movies similar to Inception" | ||
| python run.py --mode query --question "Find me war films rated above 7.5" | ||
| python run.py --mode query --question "Which actors appeared in both a Christopher Nolan and a Steven Spielberg film?" | ||
| ``` | ||
|
|
||
| ## Project structure | ||
|
|
||
| ``` | ||
| neo4j_graph_rag/ | ||
| ├── docker-compose.yml Neo4j 5 + APOC | ||
| ├── requirements.txt | ||
| ├── .env.example | ||
|
skrawcz marked this conversation as resolved.
|
||
| ├── graph_schema.py Node/relationship definitions and Cypher constraints | ||
| ├── ingest_module.py Hamilton DAG: JSON -> Neo4j | ||
| ├── embed_module.py Hamilton DAG: Movie nodes -> embeddings -> vector index | ||
| ├── retrieval_module.py Hamilton DAG: query -> entity resolution -> retrieval | ||
| ├── generation_module.py Hamilton DAG: context + query -> gpt-4o -> answer | ||
| ├── run.py Entry point wiring all three pipelines | ||
| ├── docs/ | ||
| │ └── images/ | ||
| │ ├── ingest_dag.png | ||
| │ ├── embed_dag.png | ||
| │ └── rag_dag.png | ||
| └── data/ | ||
| └── README.md Dataset download and conversion instructions | ||
| ``` | ||
|
skrawcz marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| # Data | ||
|
|
||
| This example uses the [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) from Kaggle. | ||
|
|
||
| ## Download | ||
|
|
||
| 1. Create a free Kaggle account at https://www.kaggle.com | ||
| 2. Go to https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata | ||
| 3. Click **Download** and unzip the archive | ||
| 4. Place the following two files in this `data/` folder: | ||
|
|
||
| ``` | ||
| data/ | ||
| ├── tmdb_5000_movies.json | ||
| └── tmdb_5000_credits.json | ||
| ``` | ||
|
|
||
| ## Note on file format | ||
|
|
||
| The Kaggle archive ships the files as CSV (`tmdb_5000_movies.csv`, `tmdb_5000_credits.csv`). | ||
| Several columns contain JSON strings (genres, cast, crew, production_companies). | ||
|
|
||
| Convert them to JSON before running ingestion: | ||
|
|
||
| ```python | ||
| import pandas as pd, json | ||
|
|
||
| movies = pd.read_csv("tmdb_5000_movies.csv") | ||
| credits = pd.read_csv("tmdb_5000_credits.csv") | ||
|
|
||
| with open("tmdb_5000_movies.json", "w") as f: | ||
| json.dump(movies.to_dict(orient="records"), f) | ||
|
|
||
| with open("tmdb_5000_credits.json", "w") as f: | ||
| json.dump(credits.to_dict(orient="records"), f) | ||
| ``` | ||
|
|
||
| Run this script once from inside the `data/` folder, then proceed with `python run.py --mode ingest`. |
28 changes: 28 additions & 0 deletions
28
examples/LLM_Workflows/neo4j_graph_rag/data/data_refine.py
|
skrawcz marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
|
|
||
| import pandas as pd, json | ||
|
|
||
| movies = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.csv") | ||
| credits = pd.read_csv("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.csv") | ||
|
|
||
| with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_movies.json", "w") as f: | ||
| json.dump(movies.to_dict(orient="records"), f) | ||
|
|
||
| with open("examples/LLM_Workflows/neo4j_graph_rag/data/tmdb_5000_credits.json", "w") as f: | ||
| json.dump(credits.to_dict(orient="records"), f) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.