Skip to content

Gateway SUP integration#180

Open
julienduquesnay-se wants to merge 5 commits into
pre-draftfrom
gateway-sup
Open

Gateway SUP integration#180
julienduquesnay-se wants to merge 5 commits into
pre-draftfrom
gateway-sup

Conversation

@julienduquesnay-se
Copy link
Copy Markdown

Description

Specification update related to the Gateway SUP. Add support of gateway service.

Issues Addressed

#137

Change Type

Please select the relevant options:

  • [] Fix (change that resolves an issue)
  • New enhancement (change that adds specification content)
  • Content edits (change that edits existing content)

Checklist

  • I have read the CONTRIBUTING document.
  • My changes adhere to the established patterns, and best practices.

@julienduquesnay-se julienduquesnay-se requested a review from a team as a code owner May 14, 2026 14:54
@phil-abb phil-abb self-requested a review May 15, 2026 10:16
@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - I won't be able to review this until next week.

@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@julienduquesnay-se
Copy link
Copy Markdown
Author

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

I rebased via the GitHub PR UI, and it looks ok now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be added to this repo. It should be added to the General Website Content repo

The specification repo should only contain the files related directly to the specification. There is an open PR to delete the other files from this repo, but it's still in draft state (@ajcraig)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the file from this repo and moved it to the correct repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this diagram.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the file from this repo and moved it to the correct repo.

| message | string | Y | Associated error message that provides further details to the WFM about the error that was encountered. |

> Note: Error codes and messages are implementation-specific.
When the error is generated by a see-thru gateway, the source attribute of the error structure MUST be set to the gateway device id, with its full hierachy if applicable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to link to the location where th types of gateways is explained.

Comment on lines +75 to +77
> Note: Most error codes and messages are implementation-specific.

> Note: the purpose of the `source` attribute is to avoid collision between reserved error codes and implementation-specific error codes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these two notes be combined?

Comment thread system-design/specification/margo-management-interface/deployment-status.md Outdated
DeviceId:
# format: "{id}[/{id}[/{id}...]]"
# Top-level id is required and must include only Unreserved Characters as specified in RFC3986.
# Subsequent ids are optional, but if present must include only Unreserved Characters as specified in RFC3986.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to phrase this? They are not really optional because you have to use them to provide the information for child devices.

DeviceId_with_asterisk:
# format: "{id}[/{id}[/{id}...]/*]"
# Top-level id is required and must include only Unreserved Characters as specified in RFC3986.
# Subsequent ids are optional, but if present must include only Unreserved Characters as specified in RFC3986.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread mkdocs.yml Outdated
- concepts/workload-fleet-managers/workload-deployment.md
- Edge Compute Devices:
- concepts/edge-compute-devices/devices.md
- Gateways:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check on this with the new combined website layout. I'm not sure how we add to the navigation now. I'm not sure if this file is used anymore.

@ajcraig do you know?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current location to add another section in the navigation is in the following file:
https://github.com/margo/general_website_content/blob/pre-draft/system-design/concepts/meta.json

@julienduquesnay-se when you go to move this concept page to the general_website_content repo. Ensure you are updating this file as well to add your new navigation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reversed the changes to the file.
@ajcraig How do I generate and test the website with the content of the other repo?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructions for how to render the full website can be found HERE.

Let us know if you run into any issues, Phil was able to render himself when he tested it recently.

julienduquesnay-se and others added 2 commits May 20, 2026 19:24
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Co-authored-by: Philip Presson <philip.presson@us.abb.com>
Signed-off-by: Julien Duquesnay <156128585+julienduquesnay-se@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@matlec matlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements an approved SUP, so I'm not asking to block or restructure it here. However, I'd like to flag a broader question this PR surfaces that's worth picking up separately.

@ajcraig @phil-abb

This is a bigger thread, and it ties into a comment I left on the "Device roles to capabilities" SUP. Working through this gateway PR, I keep bumping into the same question:

When the WFM sends a deployment, what is it targeting?

Today the implicit answer is "a WFM client". Routing happens via clientId in the URL, and the ApplicationDeployment YAML stays target-agnostic. This PR shifts that implicit answer by putting deviceId inside the ApplicationDeployment YAML's metadata. That works neatly for the see-thru gateway case, but it doesn't generalize. Multi-node Kubernetes clusters are the clearest example: kube-scheduler picks where pods run, so the WFM really has no business addressing nodes directly.

That points at something deeper. Standalone Device, Standalone Cluster, and Gateway look like three different things in the spec - but to me they read as the same shape, and the only real difference between them is: how many deployment targets the WFM client speaks for. A Standalone Device speaks for one. A Standalone Cluster speaks for one (the cluster as a whole, single-node or multi-node). A Gateway speaks for several. The PR introduces Gateway as a new category to handle the "several" case; I'd suggest framing it instead as extending the existing pattern to support more than one deployment target per client.

The most concrete future case this opens up is the DFM/WFM split for multi-node clusters. For a multi-node cluster, devices (the nodes) aren't really a WFM concern - the WFM just needs to know there's a deployment target (the cluster) and what it can run. Node-level identity, vendor, lifecycle - that's all DFM territory (post-GA). The PR currently surfaces device identity through WFM artifacts: deviceId in deployment metadata, the capabilities endpoint keyed on deviceId, the Gateway role naming child devices. When the DFM lands and we want to add device-level visibility, we'd ideally do that without having to pull device identity through every WFM interaction. One framing that would help: separate "what the WFM addresses" from "what counts as a device." The WFM-side concept becomes a "deployment target," orthogonal to whatever the DFM tracks. The WFM addresses targets; the DFM (when it arrives) tracks devices; the two surfaces stay independent.

Sketch of the alternative framing

  • WFM Client = protocol participant. Has clientId, X.509 identity. Speaks for one or more deployment targets.
  • Deployment target = an execution surface the WFM can address. Has a targetId unique per client. Has its own capabilities document.
  • Target capabilities are scoped to deployment concerns: supported runtimes, deployment types, resources. Device-shaped fields (vendor, modelNumber, serialNumber) move to the DFM; they describe what the device is, not what it can host.
  • What "resources" means depends on what schedules at this target. A standalone device or a gateway-managed sub-device reports concrete hardware (camera, GPU, ...) because Margo schedules at that level. For a cluster target, it probably makes less sense to report hardware at the cluster-level. In the cluster case, kube-scheduler can handle node-specific / hardware-aware placement via the chart's own affinity rules. Margo doesn't need to know which cluster node has the camera - that's Kubenetes' job.
  • Target identity inherits from the client. No separate certs; the client is the trust boundary.
  • Deployment routing = (clientId, targetId) pair. clientId stays where it is today. targetId lives in the bundle manifest's in a deployment entry - not in the ApplicationDeployment YAML, which stays target-agnostic. One bundle manifest carries routing for all of a client's deployments; a multi-target client fetches once rather than per target.
  • Delegation ("autonomous placement") = a client-level capability flag, orthogonal to target count. To request delegation for a deployment, the WFM omits targetId from the bundle manifest entry; the client picks among its eligible targets. Single-target clients trivially satisfy this. A multi-target client that hasn't reported delegation capability must receive an explicit targetId.
  • No hierarchy in targetId. Flat IDs unique per client. "Parent/child" is the client's internal concern, not protocol-visible.

How this covers the cases

Case clientId Targets (WFM) Devices (DFM, post-GA)
Standalone Device 1 1 (the device) 1 (the device)
Standalone Cluster 1 1 (the cluster) 1 (the node)
Multi-node cluster (when it arrives) 1 1 (the cluster) N (one per node)
See-thru gateway 1 N+1 or N (gw + children) 1 (just the gateway)
Opaque gateway 1 1 (the gateway) 1 (just the gateway)

Clusters don't need special-casing - a cluster is just "1 target." Gateways become "clients with >1 targets." The opaque/see-thru distinction collapses into "how many targets do you expose."

If there's interest, I'm happy to draft this as a SUP (post-PlugFest 2 ;)).

Comment on lines 1 to +184
@@ -9,24 +9,28 @@ To ensure the WFM is kept up to date, the device's client MUST send updated capa
## Route and HTTP Methods

```https
POST /api/v1/clients/{clientId}/capabilities
PUT /api/v1/clients/{clientId}/capabilities
POST /api/v1/clients/{clientId}/capabilities/{deviceId}
PUT /api/v1/clients/{clientId}/capabilities/{deviceId}
DELETE /api/v1/clients/{clientId}/capabilities/{deviceId}
```

### Route Parameters

|Parameter | Type | Required? | Description|
|----------|------|-----------|------------|
| {clientId} | string | Y | The unique identifier of the (device) client registered with the WFM during onboarding. |
| {deviceId} | string | Y | The unique identifier of the device reporting the capabilities. <br/>It must have the following format: "{id}[/{id}[/{id}...]]". The top-level `id` is required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). If reporting capabilties for a child device, the subsequent `id`s are required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). <br/>Using multiple ids in the endpoint does not register multiple devices in a single request, but indicates a hierarchy of devices, with a parent/child relationship. |

### Response Codes

| Code | Description |
|------|-------------|
| 201 OK | The device capabilities document was added, or updated, successfully |
| 204 No Content | The device capabilities document was deleted successfully. |
| 400 Bad Request | Missing or invalid content-digest header. Ensure the SHA256 hash of the base64-encoded payload is included. |
| 401 Unauthorized | Signature verification failed. Ensure you are signing with the correct X.509 private key. |
| 403 Forbidden | Client certificate is not trusted or has been revoked. |
| 404 Not Found | POST, PUT: No client with the given `clientID` was found. <br/> DELETE: No client with the given `clientID` was found or no device with the given `deviceId` was found for the client. |
| 422 Unprocessable Content | Request body includes a semantic error. |

## Request Body Attributes
@@ -41,12 +45,12 @@ PUT /api/v1/clients/{clientId}/capabilities

| Field | Type | Required? | Description |
|-----------------|-----------------|-----------------|-----------------|
| id | string | Y | Unique deviceID assigned to the device via the Device Owner.|
| id | string | Y | Unique deviceID assigned to the device via the Device Owner. It must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3) plus the path separator (i.e. '/'). In case of a device behind a gateway, the id field takes the form of a path with the id of the parent gateway, the id of the child device, and the ids of any intermediate devices, i.e., "{gatewayId}/[{intermediateDeviceId/.../]{deviceId}". |
| vendor | string | Y | Defines the device vendor.|
| modelNumber | string | Y | Defines the model number of the device.|
| serialNumber | string | Y | Defines the serial number of the device.|
| roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, or Standalone Device |
| resources | []Resource | Y | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. |
| roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, Standalone Device, or Gateway |
| resources | []Resource | * | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. <br/> * The element is required if the device has any of the following roles: Standalone Cluster, Cluster Leader, Standalone Device. |

### Resources Attributes
Resources of the specific device being reported to the WFM. Utilized to match with the required resources defined in the application description
@@ -161,4 +165,20 @@ These enumerations are used as vocabularies for attribute values of the `DeviceC
}
}
}
``` No newline at end of file
```

## Gateways considerations

### Opaque gateways

Opaque gateways MUST report the combined capabilities of all the devices they connect to the WFM.

> Example: An opaque gateway has two child-devices. Each child-device has an ARM64 processor with 2 cores, 5 GB of memory, 32 GB of storage, and 1 ethernet interface. The gateway will report capabilities of 2 CPUs (arm64) with 2 cores each, 10 GB of memory, 64 GB of storage, and 2 ethernet interfaces. In addition since the gateway can deploy compose applications on its child-devices it will report the role of "standalone device".

## See-thru gateways

See-thru gateways MUST report their capabilities and the capabilities of each device they connect to the WFM. This is done by calling the `device capabilities` endpoint for the gateway itself and for each device behind the gateway. The `deviceId` in the endpoint is used to indicate the hierarchy of devices, with a parent/child relationship. For example, if a see-thru gateway with `deviceId` "gateway1" connects two devices with `deviceId` "deviceA" and "deviceB", the gateway would call the `device capabilities` endpoint three times with the following `deviceId`s: "gateway1", "gateway1/deviceA", and "gateway1/deviceB".

When reporting its own capabilities, a see-thru gateway MUST report the role "Gateway".

If a see-thru gateway is capable of hosting edge applications it MUST report the corresponding role(s) (i.e., "Standalone Device", "Standalone Cluster, and/or "Cluster Leader") and the resources available for these deployments. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR carries forward the SUP's distinction between "opaque gateway" and "see-thru gateway" into the MUST rules. Reading through, I think only see-thru really needs to live there - and even then, mostly as a name for an underlying mechanism.

From the WFM's side, an opaque gateway looks identical to a regular device: no Gateway role, no hierarchical deviceId, no special error codes, ... The "Opaque gateways MUST report the combined capabilities..." boils down to "report what you can actually offer", which any device does anyway. And since the WFM can't tell an opaque gateway from a regular device, there's no way the spec could enforce that rule even if it wanted to. It's good guidance for someone building an aggregation device, but it doesn't really have a job in the MUST rules. (Side note: the example uses Standalone Device, but the rules don't seem to stop an opaque gateway from reporting Gateway instead - so even the boundary of the category isn't clear to me from the text.)

See-thru is different. It does point at something real. But that "something real" is the combination of mechanisms already in this PR: the Gateway role plus hierarchical deviceId. The name is a convenient handle for that combination, but it's the mechanisms themselves doing the work.

What I'd suggest:

  • Drop opaque from the MUST rules, maybe just provide an informative note such as: "A device may aggregate several sub-devices behind it and report itself as a single Margo device."
  • Rewrite the see-thru MUSTs to point at the mechanism directly: "A WFM client reporting the Gateway role MUST..." You could keep "see-thru gateway" as an informal name in the spec (as a useful shorthand) but don't pin protocol rules to that term.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is addressability. In the case of single node, cluster, and opaque I am addressing the target directly. In the case of transparent I am targeting the leaf device via the gateway. That means I have to consider the leaf device the same way I would any other target, but then have to add in the targeting parameters associated with the gateway it must communicate through.

i.e., I can target a leaf device of a gateway just like any other device but what happens if the gateway does not support the particular constructs that must flow through to it for "starting the camera". If we assume it is "just flow through" it would devalue the gateway and become not much more than a proxy server that security teams may not be happy with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants