SpecterOps · d3vzer0 · Apr 16, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -5,14 +5,16 @@ site/
 addons/
 collectors/
 
+# Notebooks/output
 notebooks
-
 output
 graph
 logs
 dbt_packages
-.vscode
 
+# Ignore editor settings
+.vscode
+.idea
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/descriptions/faker/edges/RAND_ExampleReference.md b/descriptions/faker/edges/RAND_ExampleReference.md
@@ -0,0 +1,4 @@
+# Description
+
+This is an example description for the RAND_ExampleReference node. This will page will be embedded/combined with the
+edge description when the pipeline generates automated documentation.
diff --git a/descriptions/faker/nodes/RAND_Computer.md b/descriptions/faker/nodes/RAND_Computer.md
@@ -0,0 +1,4 @@
+# Description
+
+This is an example description for the RAND_Computer node. This will page will be embedded/combined with the node
+description when the pipeline generates automated documentation.
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -0,0 +1,14 @@
+nav:
+  - index.md
+  - getting-started.md
+  - cli.md
+  - logging.md
+  - Development:
+    - "development/creating-collector.md"
+    - "development/collection.md"
+    - "development/modelling.md"
+    - "development/graph.md"
+  - API:
+    - "api/*"
+  - Sources:
+    - "sources/*"
diff --git a/docs/development/collection.md b/docs/development/collection.md
@@ -0,0 +1,90 @@
+# Collecting resources
+This page explains how OpenHound uses `dlt` sources and resources to collect and transform data. A DLT source groups resources and declares configuration items (like secrets), while a resource yield assets (eg. users/computers/roles etc) that are stored in JSONL/Parquet format as part of the pipeline. If you followed the [Creating a new collector](creating-collector.md) process, the cookiecutter template already generates an example source and resource for you that you can modify.
+
+## DLT source
+The `@app.source` function is the starting point of a collector. It receives/parses configuration, builds shared clients or context and returns DLT resources or DLT transformers. The source decides which resources are part of the collection. By default, running the pipeline (ie. using the `collect` cli command) will collect all resources returned by this function, though a subset of resources can be selected as part of the CLI options.
+
+The source is defined in `source.py` (SuperIAM is the sample name for a custom service):
+
+```py
+@app.source(name="SuperIAM", max_table_nesting=0)
+def source(token=dlt.secrets.value, host=dlt.secrets.value):
+    ctx = SourceContext(
+        client=RESTClient(
+            base_url=host,
+            headers={"accept": "application/json"},
+            auth=BearerTokenAuth(token=token),
+            paginator=SinglePagePaginator(),
+        )
+    )
+
+    return (users(ctx),)
+```
+
+**Key points:**
+
+- `token` and `host` are read from `dlt` secrets so they are not hard-coded. These can either be read from your environment or stored inside a .dlt/secrets.toml and/or dlt/config.toml config file.
+- A shared `SourceContext` containing a global request/Rest client is created and passed to all resources.
+- The source returns a tuple of resources (in this case, only: `users`).
+
+
+## DLT resource
+The `@app.resource` function is a wrapper for a DLT resource (@dlt.resource) and yields assets (eg. users/computers/roles etc) to store on disk. In this case, we use DLT's included RESTClient, which automatically handles pagination, retries and rate-limiting. Compared to the original dlt resource, additional exception handling strategies can be added to continue the pipeline in case a single resource fails.
+
+The template includes a simple resource:
+
+```py
+@app.resource(name="users", parallelized=True, columns=User)
+def users(ctx: SourceContext):
+    response = ctx.client.get("/users").json()
+    # Option A:
+    # Yield individual users by iterating over the response
+    for user in response["users"]:
+        yield user
+
+    # Option B:
+    # Or return the list as is if modifications are
+    # not needed
+    yield response["users"]
+```
+
+**Key Points:**
+
+- The resource uses RESTClient as part of the shared context `ctx` to fetch users via the `/users` endpoint.
+- `yield` returns one "row" ie. user at a time.
+- `name="users"` becomes the "table" name by default, in this case the directory name where the files will be stored on disk eg. SuperIAM/users/data.jsonl.gz
+- `parallelized=True` allows the resource to run concurrently. The concurrency limits can be set via the .dlt/config.toml configuration file.
+
+
+## Extending the source
+This structure keeps your collectors clean and easy to extend. To collect another asset, create a new `@app.resource` function and add it to the returned resources as part of your source.
+
+### Example: adding a second resource
+
+```py
+@app.resource(name="computers", columns=Computer)
+def computers(ctx: SourceContext):
+    response = ctx.client.get("/computers").json()
+    yield response["computers"]
+
+
+@app.resource(name="users", parallelized=True, columns=User)
+def users(ctx: SourceContext):
+    response = ctx.client.get("/users").json()
+    yield response["users"]
+
+@app.source(name="SuperIAM", max_table_nesting=0)
+def source(token=dlt.secrets.value, host=dlt.secrets.value):
+    ctx = SourceContext(
+        client=RESTClient(
+            base_url=host,
+            headers={"accept": "application/json"},
+            auth=BearerTokenAuth(token=token),
+            paginator=SinglePagePaginator(),
+        )
+    )
+
+    return (users(ctx), computers(ctx))
+```
+
+That is all for at least the basics of collection. You may have noticed that each resource has a 'columns=' argument, pointing to a particular class. These are custom classess that validates data returned by the resource and provide a schema for the OpenGraph node/edge representation. Next up is defining these [resource models](modelling.md).
diff --git a/docs/development/creating-collector.md b/docs/development/creating-collector.md
@@ -0,0 +1,41 @@
+# Creating a new collector
+
+This page walks you through generating a new collector using Cookiecutter, creating a dedicated virtual environment and
+running your new collector. The template creates a minimal project and intialises the new directory as a git repository.
+
+The collector is automatically registered as a new CLI command after installing the project dependencies. Running
+`python main.py collect --help` confirms the project is configured correctly before you start adding logic.
+
+## 1. Prerequisites
+
+- Python 3.12+
+- [uv](https://docs.astral.sh/uv/)
+
+## 2. Install the OpenHound CLI
+
+```console
+uv tool install openhound
+```
+
+## 3. Create a new collector
+
+Create a new collector using the "create collector" command, which will prompt you for the required details.
+
+```console
+openhound create collector
+```
+
+## 4. Done
+
+You should now be able to see the name of your new collector registered using the following command:
+
+```console
+cd <your_collector_directory>
+> python src/main.py collect --help
+```
+
+Running the newly created collector will generate 100 dummy assets. Run
+`python src/main.py collect <your_service_name> ./output` to test the asset generation process. Next up is writing the
+actual [collection](collection.md) logic.
+
+PS. Executing the command for the first time may take a few seconds.
diff --git a/docs/development/decorators.md b/docs/development/decorators.md
@@ -0,0 +1,113 @@
+# Decorators
+
+OpenHound provides several decorators to simplify the development of collection pipeline. Each decorator dynamically
+registers a CLI command and defines specific stages as part of the collection workflow. The collection decorators wrap
+functions that interact with [DLT](https://dlthub.com/docs/intro) (Data Load Tool) to extract and transform data into
+the OpenGraph format.
+
+## Initialization
+
+A collector is initialized with the `OpenHound` class, which provides decorators for each pipeline stage. The first
+argument is the name of your collector with an additional help/description. The name provided will also register a CLI
+command with the same name, ie. `OpenHound("aws")` will expose an `openhound collect aws` command.
+
+```python
+from openhound import OpenHound
+
+app = OpenHound("aws", help="OpenHound collector for AWS")
+```
+
+## Decorators
+
+### @app.collect()
+
+Registers a CLI command that collects resources from the source system and stores them in the original (optionally
+filtered) format on disk. This function should return your DLT [source](collection.md).
+
+```python
+@app.collect()
+def collect(ctx: CollectContext):
+    from openhound_aws.source import source as aws_source
+    return aws_source()
+```
+
+**Parameters:**
+
+- `ctx` (CollectContext): Pipeline context for the resource collection stage.
+
+### @app.preproc()
+
+Registers a CLI command that (optionally) preprocesses collected resources and builds lookup data for the OpenGraph
+conversion stage. This function should return a dictionary mapping resource names to table names.
+
+Optionally, you can provide a `transformer` function that applies SQL-based transformations to the loaded data in
+DuckDB. The transformer function receives a DuckDB connection and can create new tables derived from the loaded
+resources.
+
+In the example below, the AWS `users` resource will be stored in the `users` table, `groups` in the `groups` table. The
+transformer function is imported from a separate `transforms` module and applies custom SQL/Ibis transformations (
+optional).
+
+```python
+from openhound_aws.transforms import transforms
+
+
+@app.preproc(transformer=transforms)
+def preproc(ctx: PreProcContext):
+    resources = {
+        "resources": "resources",
+        "users": "users",
+        "groups": "groups",
+        "roles": "roles",
+        "policies": "policies",
+        "policy_attachments": "policy_attachments",
+    }
+    return resources
+```
+
+**Parameters:**
+
+- `ctx` (PreProcContext): Pipeline context for the preprocessing stage.
+- `transformer` (Callable, optional): Optional function that takes a DuckDB connection and applies transformations to
+  create custom tables.
+
+### @app.convert()
+
+Registers a CLI command that converts collected resources into OpenGraph nodes and edges. This function should return a
+tuple containing the DLT source and a dictionary of extra context to be added to each asset.
+
+```python
+@app.convert(lookup=AWSLookup)
+def convert(ctx: ConvertContext) -> Tuple[DltSource, dict]:
+    from openhound_aws.source import source as aws_source
+    extras = {}
+    return aws_source(), extras
+```
+
+**Parameters:**
+
+- `lookup`: An optional lookup class for resolving resources during OpenGraph conversion.
+- `ctx` (ConvertContext): Pipeline context for the OpenGraph conversion stage.
+
+## Pipeline Flow
+
+```mermaid
+flowchart LR
+    Raw[(Local Storage)]:::storage
+    subgraph collect["@app.collect()"]
+        Collect[Source API]
+    end
+
+    subgraph preproc["@app.preproc()"]
+        Preproc[Lookup Tables]
+    end
+
+    subgraph convert["@app.convert()"]
+        Convert[OpenGraph]
+    end
+
+    Collect --> |Raw data| Raw[(Local Storage)]
+    Raw --> |Raw data| Preproc
+    Raw --> |Raw data| Convert
+
+```