Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions SWIPs/SWIP-draft_Data_Provenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
SWIP: <to be assigned>
title: Swarm-Based Data Provenance Framework
author: Črt Ahlin (@crtahlin)
discussions-to: https://discord.com/channels/799027393297514537/1239813439136993280
status: WIP
type: Informational
<!-- category (*only required for Standard Track): <Core | Networking | Interface | ERC> -->
created: 2025-03-13
requires (*optional): <SWIP number(s)>
replaces (*optional): <SWIP number(s)>
---

<!--You can leave these HTML comments in your merged SWIP and delete the visible duplicate text guides, they will not appear and may be helpful to refer to if you edit it again. This is the suggested template for new SWIPs. Note that a SWIP number will be assigned by an editor. When opening a pull request to submit your SWIP, please use an abbreviated title in the filename, `SWIP-draft_title_abbrev.md`. The title should be 44 characters or less.-->

## Simple Summary
<!--"If you can't explain it simply, you don't understand it well enough." Provide a simplified and layman-accessible explanation of the SWIP.-->

This SWIP proposes Swarm as a decentralized storage layer for provenance metadata and data. Provenance, the documented history of a dataset's origin and transformations, is increasingly important for regulatory compliance, ethical AI, and data accountability. A command-line toolkit will provide utilities to:
- Upload/Download: Store and retrieve provenance files (in any format) with Swarm reference hashes.
- Metadata Management: Track storage validity (TTL) and extend it via stamp top-ups.

The framework does not enforce specific provenance standards but ensures compatibility by decoupling metadata (structured JSON) from the actual provenance data (stored as arbitrary files). Developers and enterprises retain full control over their data format and privacy measures.

## Abstract
<!--A short (~200 word) description of the technical issue being addressed.-->

Provenance systems require immutable, scalable storage to track data lineage effectively. This SWIP leverages Swarm’s decentralized network to:
- Store Provenance Data: Users upload files in any format (e.g., W3C PROV-JSONLD, DaTA spec, or custom schemas).
- Manage Metadata: A JSON wrapper includes:
- `content_hash` (SHA-256 for matching provenance data across different systems)
- `provenance_standard` (optional field for self-declared standards)
- `data` (Base64-encoded provenance data)
- `stamp_id` (for TTL tracking and renewal)
- Ensure Flexibility: No Swarm-level validation of provenance formats—compatibility is achieved by design.

A prototype toolkit (developed under the DataFund Fellowship) will provide command-line access to Swarm, enabling integration into existing workflows. Privacy and encryption remain optional, allowing users to comply with regulations like GDPR independently.

## Motivation
<!--The motivation is critical for SWIPs that want to change the Swarm protocol. It should clearly explain why the existing protocol specification is inadequate to address the problem that the SWIP solves. SWIP submissions without sufficient motivation may be rejected outright.-->

Data provenance is critical for ethical AI development, regulatory compliance, and ensuring trust in data-driven systems. Existing provenance solutions often face challenges in terms of vendor lock-in, scalability, and privacy. Centralized systems create dependencies and potential single points of failure. Public blockchains, while immutable, can be costly for large-scale data storage and may not adequately address privacy concerns.

Swarm offers a compelling alternative by acting as a trustless, decentralized storage network:

- **Trusted 3rd Party**: Swarm's decentralized architecture serves as a neutral platform for recording provenance, eliminating single points of control.
- **Cost Considerations**: While centralized cloud storage may offer lower costs for simple storage, Swarm provides a more cost-competitive option compared to blockchain-based solutions.
- **Interoperability**: The toolkit is designed to accommodate various provenance standards (e.g., DaTA, W3C PROV) without enforcing a specific format, allowing users to adopt the standard that best suits their needs.

This proposal aims to align with the Data Spaces Support Centre blueprint for a Technical Building Block covering Provenance & Traceability. By enabling easy uploading, downloading, and management of provenance files, the toolkit empowers users to meet emerging regulatory requirements, such as those outlined in the EU AI Act, and to establish trust and accountability in data-driven systems.

## Specification
<!--The technical specification should describe the syntax and semantics of any new feature. The specification should be detailed enough to allow competing, interoperable implementations for the current Swarm platform and future client implementations.-->

### 1. Provenance Record Structure
The provenance record will be stored as a single JSON file containing both metadata and the actual provenance data. The structure is:

```json
{
"content_hash": "sha256:9f86d...a9e",
"provenance_standard": "DaTA v1.0.0",
"encryption": "none", // Optional field (e.g., "aes-256-gcm" if encrypted)
"data": "",
"stamp_id": "0xfe2f...c3a1"
}
```

**Key Fields**:
- `content_hash`: SHA-256 hash of the raw provenance data (before Base64 encoding) for integrity verification.
- `provenance_standard`: Declares the standard used (e.g., DaTA, W3C PROV, or custom).
- `encryption`: Optional field to indicate encryption method (default: `"none"`).
- `data`: Base64-encoded provenance data (actual content in any format).
- `stamp_id`: Swarm stamp ID used for TTL management, purchased based on the size of the JSON file and desired storage duration.

*This structure ensures self-contained provenance records while maintaining compatibility with any standard. The `data` field can store provenance information in formats like JSON, XML, or binary.*

### 2. Toolkit Features
The toolkit interacts with Swarm to manage provenance records via a single JSON file:

- **Upload**:
- Action: Uploads the JSON file to Swarm.
- Workflow:
1. User prepares provenance data in any format (e.g., DaTA spec, W3C PROV).
2. Toolkit generates SHA-256 hash of raw data, encodes it to Base64, and wraps it into the JSON structure.
3. The toolkit calculates the size of the JSON file to determine the required stamp size.
4. The user specifies a desired TTL for storage, or a default value is used.
5. The toolkit purchases a stamp from the Bee node or gateway using available funds.
6. The returned stamp ID is used to upload the JSON file to Swarm.
7. JSON file is uploaded to Swarm via Bee node or gateway.
- Returns: Swarm reference hash for the JSON file.
- *Note: Acquiring funds (e.g., xBZZ for stamp purchase) is out of scope for this toolkit, it is assumed the node has the funds available.*

- **Download**:
- Action: Retrieves the JSON file using its Swarm reference hash.
- Workflow: Toolkit fetches the JSON, decodes the Base64 `data` field, and verifies integrity via `content_hash`.

- **Check TTL**:
- Action: Queries remaining storage validity for the JSON file.
- Workflow: Toolkit uses the `stamp_id` to check TTL via Swarm’s stamp management system.

- **Top-Up**:
- Action: Extends storage validity for the JSON file.
- Workflow: Toolkit tops up the existing `stamp_id` (assumes user has pre-funded their Bee node).
- *Note: Acquiring funds (e.g., xBZZ) is out of scope for this toolkit.*

- **Existence Check**:
- Action: Verifies if the JSON file exists on Swarm.
- Workflow: Toolkit checks Swarm for the reference hash.

### 3. Privacy Controls
- **Optional Encryption**: Users may encrypt the raw provenance data before Base64 encoding. The `encryption` field can declare the method (e.g., `aes-256-gcm`), but key management is left to the user.
- **No AI Screening**: Privacy checks (e.g., PII detection) are deferred to future enhancements or third-party services.

## Rationale
<!--The rationale fleshes out the specification by describing what motivated the design and why particular design decisions were made. It should describe alternate designs that were considered and related work, e.g. how the feature is supported in other languages. The rationale may also provide evidence of consensus within the community, and should discuss important objections or concerns raised during discussion.-->

The design of this SWIP centers on providing a flexible and future-proof solution for storing provenance data on Swarm. The decision to use a single JSON file structure simplifies both the management and retrieval processes, enabling easy integration with existing provenance standards without enforcing a specific one. This approach acknowledges that while standards like the Data & Trust Alliance (DaTA) specification and the W3C PROV standard exist, the market is still evolving, and imposing a single standard could limit adoption.

By storing the actual provenance data as a Base64-encoded string within the JSON structure, we ensure that any file format can be accommodated. This maintains data integrity across different systems and transfer methods, providing users with the freedom to choose the most appropriate format for their specific use case. Optional encryption addresses potential GDPR and privacy concerns, giving users control over their data security while maintaining ease of use.

To enhance the utility of the provenance data and provide Swarm-specific functionality, additional metadata fields are included in the JSON structure. The inclusion of the `stamp_id` enables users to easily check the storage duration of their provenance data and potentially extend it, aligning with Swarm's storage management mechanisms. Additionally, the `content_hash` provides a means to verify data integrity, particularly when the provenance data is stored in systems other than Swarm, allowing users to match the data across different storage locations.

Swarm's stamp-based TTL management system aligns with the network's existing storage incentive mechanism, offering users a familiar way to control storage duration and cost. The toolkit approach simplifies the integration of Swarm storage for provenance use cases and allows for future extensions, such as AI agents for data validation, without modifying core functionality.

Finally, leveraging Swarm's decentralized network as a trusted third party aligns with Data Spaces Support Centre specifications and offers a more scalable and potentially cost-effective solution compared to blockchain-based alternatives. This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features.

![Image](assets/swip-x-provenance/Provenance-diagram1.svg "Swarm as trusted third party storage - diagram.")

**Diagram Description:**

The diagram illustrates Swarm acting as a trusted third party for storing Provenance and Traceability (P&T) data. It consists of three main entities:

1. **Consumer:** A data consumer, labeled "Consumer".
2. **Provider:** A data provider, labeled "Provider".
3. **VAS Provider:** Represented as a centralized storage, labeled as "Swarm."

Both the Consumer and the Provider connect to the Swarm Network. The Swarm Network stores P&T data and takes the role of "Value Added Service (VAS) Provider" to allow other services or tooling on top of it. This VAS Provider contains "P&T", to indicate that they are providing Provenance data.

This positions Swarm as a flexible and standards-compatible storage layer for provenance data, catering to emerging market needs while leveraging its unique features.

## Backwards Compatibility
<!--All SWIPs that introduce backwards incompatibilities must include a section describing these incompatibilities and their severity. The SWIP must explain how the author proposes to deal with these incompatibilities. SWIP submissions without a sufficient backwards compatibility treatise may be rejected outright.-->

This proposal does not introduce changes to Swarm’s core functionality or protocols. It leverages existing capabilities such as immutable storage and reference hashes, ensuring full compatibility with current implementations.
It operates on top of the existing Swarm infrastructure and adheres to established file storage and retrieval methods. All existing Swarm tools (like the Bee CLI and Dashboard) remain fully compatible with this additional use case.

## Test Cases
<!--Test cases for an implementation are mandatory for SWIPs that are affecting changes to data and message formats. Other SWIPs can choose to include links to test cases if applicable.-->

Given that this is an informational SWIP, the test cases provided here are conceptual and aim to illustrate how the proposed system would work.

1. **Provenance File Upload and Retrieval**:
- Scenario: A user uploads a JSON file containing provenance metadata and data to Swarm.
- Expected Result: The file is successfully stored on Swarm, and the Swarm reference hash is returned to the user. The user can then retrieve the file using the hash and verify that the content matches the original file.

2. **TTL Check and Storage Extension**:
- Scenario: A user checks the remaining TTL for a provenance file stored on Swarm.
- Expected Result: The toolkit queries the Swarm network and returns the remaining TTL for the associated stamp. The user can then extend the storage duration by topping up the stamp.

These test cases provide a high-level overview of the key functionalities of the proposed system and demonstrate its ability to store, retrieve, and manage provenance data on Swarm.

## Implementation
<!--The implementations must be completed before any SWIP is given status "Final", but it need not be completed before the SWIP is accepted. While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of API details.-->

A prototype toolkit is being developed under the DataFund Fellowship with the following components:

- **Core Functionality**:
- The toolkit will be implemented in Python.
- It will provide command-line access to Swarm for uploading, downloading, and managing provenance files.
- It will support calculation of JSON size and purchase stamps based on user-specified or default TTL values.
- It will support the JSON-based provenance record structure as outlined in the Specification section.
- It will use the Bee client to interact with the Swarm network.

- **Key Components**:
- **Upload Module**: Handles the upload of provenance data to Swarm, including the preparation of the JSON metadata file and the selection of appropriate storage options (e.g., appropriate stamp).
- **Download Module**: Retrieves provenance data from Swarm, verifies data integrity using the content hash, and presents the data to the user.
- **TTL Management Module**: Provides functionality to check the remaining storage duration (TTL) for a provenance file and extend the storage by topping up the associated stamp.

- **Integration Points**:
- The toolkit will interact with a Swarm gateway or a local Bee node, as specified by the user.

- **Future Considerations**:
- The toolkit may incorporate support for additional provenance standards and encryption methods.
- It may also include features for managing complex provenance chains, as outlined in the research document.
- Integration with external services, such as AI agents for data validation and interpretation, and services for attestation and notarization, may be explored in future versions of the toolkit.

This implementation aims to provide a practical and user-friendly solution for storing and managing provenance data on Swarm, while also laying the groundwork for future extensions and integrations.

## Copyright
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Loading