Skip to content

Conversation

@allisonwang-db
Copy link
Owner

@allisonwang-db allisonwang-db commented Dec 8, 2025

Implement a PySpark Custom Data Source to write event data to the Meta Conversions API (CAPI). This enables users to send server-side events directly to Meta for ad optimization and measurement.

Architecture

The implementation uses the Python Data Source API, specifically implementing a write-only data source.

Components

  1. MetaCapiDataSource: The entry point, responsible for defining the name (meta_capi) and creating the writer.
  2. MetaCapiWriter: Handles the execution of write operations.
    • Validates configuration (Access Token, Pixel ID).
    • Batches records (Meta CAPI supports up to 1000 events per request).
    • Transforms Spark Rows into CAPI-compliant JSON payloads.
    • Sends POST requests to the Graph API.
    • Handles responses and errors.

Configuration Options

The data source will support the following options via .option():

  • access_token (Required): Meta System User Access Token.
  • pixel_id (Required): The Meta Pixel ID (Dataset ID).
  • api_version (Optional): Graph API version (default: v19.0).
  • batch_size (Optional): Number of events per API request (default: 1000, max is 1000).

Schema & Data Mapping

The data source expects the input DataFrame to contain columns that map to the Meta CAPI Event structure.

To improve usability, the writer will support two modes:

  1. Structured Mode: Users provide columns matching the API structure (e.g., a user_data struct column, custom_data struct column).
  2. Flat Mode (optional/auto-detected): If user_data struct is missing, the writer looks for flat columns with specific prefixes or names and constructs the nested structure.
    • email -> user_data.em (will apply SHA256 if not already hashed - nice to have)
    • phone -> user_data.ph
    • client_ip_address -> user_data.client_ip_address
    • event_name -> event_name
    • event_time -> event_time (converts timestamp to Unix integer)
    • value -> custom_data.value
    • currency -> custom_data.currency

Decision: For the initial implementation, we will prioritize Structured Mode correctness but add basic Flat Mode mapping for common fields (email, event_name, event_time, value, currency) to simplify the user experience.

API Details

  • Endpoint: https://graph.facebook.com/{api_version}/{pixel_id}/events
  • Method: POST
  • Headers: Content-Type: application/json
  • Payload:
    {
      "access_token": "...",
      "data": [
        {
          "event_name": "Purchase",
          "event_time": 1698765432,
          "action_source": "website",
          "user_data": {
            "em": ["7b..."],
            "ph": ["..."]
          },
          "custom_data": {
            "currency": "USD",
            "value": 100.0
          }
        }
      ]
    }
    Note: access_token can be in the query param or body. We will use query param or body as recommended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants