Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
2c19849
Add skaffold config
d3vzer0 Apr 16, 2026
7d61897
Add static node properties and embed node descriptions
d3vzer0 Apr 20, 2026
e5b9018
Embed edge descriptions
d3vzer0 Apr 20, 2026
40980b2
Embed node properties and resource properties
d3vzer0 Apr 20, 2026
4ddc4a2
Added developer documentation back to repo
d3vzer0 Apr 30, 2026
40deefc
Updated openhound install docs
d3vzer0 Apr 30, 2026
ae5bc2e
Updated openhound install docs
d3vzer0 Apr 30, 2026
86169f4
Icons are no longer part of the OpenHound CLI but configured via the …
d3vzer0 Apr 30, 2026
df31147
Updated list of sources
d3vzer0 Apr 30, 2026
d927148
Clarify that Ibis is no longer used in favor of duckdb SQL
d3vzer0 Apr 30, 2026
cb9d35d
Updated example to use dataclass (vs. Pydantic)
d3vzer0 Apr 30, 2026
5af9a39
Updated graph model examples
d3vzer0 Apr 30, 2026
c9014d7
Updated asset model example
d3vzer0 Apr 30, 2026
eb18605
Added zensical conf
d3vzer0 Apr 30, 2026
06399cc
Merge branch 'main' into fix/devdocs
d3vzer0 Apr 30, 2026
b1b02a5
Add multiple template types for mkdocs and mintlify. Added CLI option…
d3vzer0 Apr 30, 2026
8f1b39c
Added latest version of collector docs
d3vzer0 Apr 30, 2026
ca0ca92
Bump zensical version and updated links to nodes, edges and assets i…
d3vzer0 Apr 30, 2026
ff58046
Added sample descriptions for automated docs
d3vzer0 Apr 30, 2026
bc30368
Remove collector-specific docs and only keep the random-asset generat…
d3vzer0 Apr 30, 2026
bd0526a
Bump package dependencies
d3vzer0 Apr 30, 2026
1ea6d7b
Add .idea to gitignore (PyCharm)
d3vzer0 Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,16 @@ site/
addons/
collectors/

# Notebooks/output
notebooks

output
graph
logs
dbt_packages
.vscode

# Ignore editor settings
.vscode
.idea

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
4 changes: 4 additions & 0 deletions descriptions/faker/edges/RAND_ExampleReference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Description

This is an example description for the RAND_ExampleReference node. This will page will be embedded/combined with the
edge description when the pipeline generates automated documentation.
4 changes: 4 additions & 0 deletions descriptions/faker/nodes/RAND_Computer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Description

This is an example description for the RAND_Computer node. This will page will be embedded/combined with the node
description when the pipeline generates automated documentation.
14 changes: 14 additions & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
nav:
- index.md
- getting-started.md
- cli.md
- logging.md
- Development:
- "development/creating-collector.md"
- "development/collection.md"
- "development/modelling.md"
- "development/graph.md"
- API:
- "api/*"
- Sources:
- "sources/*"
90 changes: 90 additions & 0 deletions docs/development/collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Collecting resources
This page explains how OpenHound uses `dlt` sources and resources to collect and transform data. A DLT source groups resources and declares configuration items (like secrets), while a resource yield assets (eg. users/computers/roles etc) that are stored in JSONL/Parquet format as part of the pipeline. If you followed the [Creating a new collector](creating-collector.md) process, the cookiecutter template already generates an example source and resource for you that you can modify.

## DLT source
The `@app.source` function is the starting point of a collector. It receives/parses configuration, builds shared clients or context and returns DLT resources or DLT transformers. The source decides which resources are part of the collection. By default, running the pipeline (ie. using the `collect` cli command) will collect all resources returned by this function, though a subset of resources can be selected as part of the CLI options.

The source is defined in `source.py` (SuperIAM is the sample name for a custom service):

```py
@app.source(name="SuperIAM", max_table_nesting=0)
def source(token=dlt.secrets.value, host=dlt.secrets.value):
ctx = SourceContext(
client=RESTClient(
base_url=host,
headers={"accept": "application/json"},
auth=BearerTokenAuth(token=token),
paginator=SinglePagePaginator(),
)
)

return (users(ctx),)
```

**Key points:**

- `token` and `host` are read from `dlt` secrets so they are not hard-coded. These can either be read from your environment or stored inside a .dlt/secrets.toml and/or dlt/config.toml config file.
- A shared `SourceContext` containing a global request/Rest client is created and passed to all resources.
- The source returns a tuple of resources (in this case, only: `users`).


## DLT resource
The `@app.resource` function is a wrapper for a DLT resource (@dlt.resource) and yields assets (eg. users/computers/roles etc) to store on disk. In this case, we use DLT's included RESTClient, which automatically handles pagination, retries and rate-limiting. Compared to the original dlt resource, additional exception handling strategies can be added to continue the pipeline in case a single resource fails.

The template includes a simple resource:

```py
@app.resource(name="users", parallelized=True, columns=User)
def users(ctx: SourceContext):
response = ctx.client.get("/users").json()
# Option A:
# Yield individual users by iterating over the response
for user in response["users"]:
yield user

# Option B:
# Or return the list as is if modifications are
# not needed
yield response["users"]
```

**Key Points:**

- The resource uses RESTClient as part of the shared context `ctx` to fetch users via the `/users` endpoint.
- `yield` returns one "row" ie. user at a time.
- `name="users"` becomes the "table" name by default, in this case the directory name where the files will be stored on disk eg. SuperIAM/users/data.jsonl.gz
- `parallelized=True` allows the resource to run concurrently. The concurrency limits can be set via the .dlt/config.toml configuration file.


## Extending the source
This structure keeps your collectors clean and easy to extend. To collect another asset, create a new `@app.resource` function and add it to the returned resources as part of your source.

### Example: adding a second resource

```py
@app.resource(name="computers", columns=Computer)
def computers(ctx: SourceContext):
response = ctx.client.get("/computers").json()
yield response["computers"]


@app.resource(name="users", parallelized=True, columns=User)
def users(ctx: SourceContext):
response = ctx.client.get("/users").json()
yield response["users"]

@app.source(name="SuperIAM", max_table_nesting=0)
def source(token=dlt.secrets.value, host=dlt.secrets.value):
ctx = SourceContext(
client=RESTClient(
base_url=host,
headers={"accept": "application/json"},
auth=BearerTokenAuth(token=token),
paginator=SinglePagePaginator(),
)
)

return (users(ctx), computers(ctx))
```

That is all for at least the basics of collection. You may have noticed that each resource has a 'columns=' argument, pointing to a particular class. These are custom classess that validates data returned by the resource and provide a schema for the OpenGraph node/edge representation. Next up is defining these [resource models](modelling.md).
41 changes: 41 additions & 0 deletions docs/development/creating-collector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Creating a new collector

This page walks you through generating a new collector using Cookiecutter, creating a dedicated virtual environment and
running your new collector. The template creates a minimal project and intialises the new directory as a git repository.

The collector is automatically registered as a new CLI command after installing the project dependencies. Running
`python main.py collect --help` confirms the project is configured correctly before you start adding logic.

## 1. Prerequisites

- Python 3.12+
- [uv](https://docs.astral.sh/uv/)

## 2. Install the OpenHound CLI

```console
uv tool install openhound
```

## 3. Create a new collector

Create a new collector using the "create collector" command, which will prompt you for the required details.

```console
openhound create collector
```

## 4. Done

You should now be able to see the name of your new collector registered using the following command:

```console
cd <your_collector_directory>
> python src/main.py collect --help
```

Running the newly created collector will generate 100 dummy assets. Run
`python src/main.py collect <your_service_name> ./output` to test the asset generation process. Next up is writing the
actual [collection](collection.md) logic.

PS. Executing the command for the first time may take a few seconds.
113 changes: 113 additions & 0 deletions docs/development/decorators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Decorators

OpenHound provides several decorators to simplify the development of collection pipeline. Each decorator dynamically
registers a CLI command and defines specific stages as part of the collection workflow. The collection decorators wrap
functions that interact with [DLT](https://dlthub.com/docs/intro) (Data Load Tool) to extract and transform data into
the OpenGraph format.

## Initialization

A collector is initialized with the `OpenHound` class, which provides decorators for each pipeline stage. The first
argument is the name of your collector with an additional help/description. The name provided will also register a CLI
command with the same name, ie. `OpenHound("aws")` will expose an `openhound collect aws` command.

```python
from openhound import OpenHound

app = OpenHound("aws", help="OpenHound collector for AWS")
```

## Decorators

### @app.collect()

Registers a CLI command that collects resources from the source system and stores them in the original (optionally
filtered) format on disk. This function should return your DLT [source](collection.md).

```python
@app.collect()
def collect(ctx: CollectContext):
from openhound_aws.source import source as aws_source
return aws_source()
```

**Parameters:**

- `ctx` (CollectContext): Pipeline context for the resource collection stage.

### @app.preproc()

Registers a CLI command that (optionally) preprocesses collected resources and builds lookup data for the OpenGraph
conversion stage. This function should return a dictionary mapping resource names to table names.

Optionally, you can provide a `transformer` function that applies SQL-based transformations to the loaded data in
DuckDB. The transformer function receives a DuckDB connection and can create new tables derived from the loaded
resources.

In the example below, the AWS `users` resource will be stored in the `users` table, `groups` in the `groups` table. The
transformer function is imported from a separate `transforms` module and applies custom SQL/Ibis transformations (
optional).

```python
from openhound_aws.transforms import transforms


@app.preproc(transformer=transforms)
def preproc(ctx: PreProcContext):
resources = {
"resources": "resources",
"users": "users",
"groups": "groups",
"roles": "roles",
"policies": "policies",
"policy_attachments": "policy_attachments",
}
return resources
```

**Parameters:**

- `ctx` (PreProcContext): Pipeline context for the preprocessing stage.
- `transformer` (Callable, optional): Optional function that takes a DuckDB connection and applies transformations to
create custom tables.

### @app.convert()

Registers a CLI command that converts collected resources into OpenGraph nodes and edges. This function should return a
tuple containing the DLT source and a dictionary of extra context to be added to each asset.

```python
@app.convert(lookup=AWSLookup)
def convert(ctx: ConvertContext) -> Tuple[DltSource, dict]:
from openhound_aws.source import source as aws_source
extras = {}
return aws_source(), extras
```

**Parameters:**

- `lookup`: An optional lookup class for resolving resources during OpenGraph conversion.
- `ctx` (ConvertContext): Pipeline context for the OpenGraph conversion stage.

## Pipeline Flow

```mermaid
flowchart LR
Raw[(Local Storage)]:::storage
subgraph collect["@app.collect()"]
Collect[Source API]
end

subgraph preproc["@app.preproc()"]
Preproc[Lookup Tables]
end

subgraph convert["@app.convert()"]
Convert[OpenGraph]
end

Collect --> |Raw data| Raw[(Local Storage)]
Raw --> |Raw data| Preproc
Raw --> |Raw data| Convert

```
Loading