Skip to content

RFC: Structured Audit Log System #505

@lakhansamani

Description

@lakhansamani

RFC: Structured Audit Log System

Phase: 1 — Security Hardening & Enterprise Foundation
Priority: P0 — Critical
Estimated Effort: Medium


Problem Statement

Authorizer only has webhook-based event delivery (8 event types: user.login, user.created, user.signup, user.access_revoked, user.access_enabled, user.deleted, user.deactivated). There is no queryable audit trail. Webhooks are fire-and-forget — if the endpoint is down, events are lost.

Audit logs are required for SOC 2, HIPAA, and GDPR compliance. Every competitor (WorkOS, Keycloak, Clerk) has structured audit logs.


Current Architecture Context

  • Webhook events defined in internal/constants/webhook_event.go (7 event types)
  • Event dispatch in internal/events/events.go — sends HTTP POST to webhook endpoints with 30s timeout
  • Webhook logs (schemas.WebhookLog) store HTTP status + request/response bodies
  • Events are triggered from GraphQL mutation handlers (internal/graphql/)
  • Storage provider interface in internal/storage/provider.go — all 13+ DB providers implement it
  • No structured audit log schema or query API exists

Proposed Solution

1. AuditLog Schema

New schema: internal/storage/schemas/audit_log.go

type AuditLog struct {
    ID             string `json:"id" gorm:"primaryKey;type:char(36)"`
    Timestamp      int64  `json:"timestamp" gorm:"index:idx_audit_timestamp;autoCreateTime"`
    
    // Who performed the action
    ActorID        string `json:"actor_id" gorm:"type:char(36);index:idx_audit_actor"`
    ActorType      string `json:"actor_type" gorm:"type:varchar(30)"`    // user | admin | system | service_account
    ActorEmail     string `json:"actor_email" gorm:"type:varchar(256)"`  // denormalized for query convenience
    
    // What happened
    Action         string `json:"action" gorm:"type:varchar(100);index:idx_audit_action"`
    
    // What was affected
    ResourceType   string `json:"resource_type" gorm:"type:varchar(50);index:idx_audit_resource"`  // user | session | token | webhook | config | role | permission
    ResourceID     string `json:"resource_id" gorm:"type:char(36)"`
    
    // Request context
    IPAddress      string `json:"ip_address" gorm:"type:varchar(45)"`
    UserAgent      string `json:"user_agent" gorm:"type:text"`
    
    // Additional context (JSON)
    Metadata       string `json:"metadata" gorm:"type:text"`  // JSON string — auth method, changed fields, etc.
    
    // Multi-tenancy
    OrganizationID string `json:"organization_id" gorm:"type:char(36);index:idx_audit_org"`
}

Indexes for query performance:

  • (timestamp) — time-range queries, retention cleanup
  • (actor_id) — "what did this user do?"
  • (action) — "show all login failures"
  • (resource_type, resource_id) — "what happened to this resource?"
  • (organization_id, timestamp) — org-scoped audit views

2. Comprehensive Event Types

Expanding from 7 webhook events to 25+ audit event types:

Authentication events:

Action Actor Type Resource Type When
user.login_success user session Successful login (any method)
user.login_failed system user Failed login attempt
user.signup user user New user registration
user.logout user session User logout
user.password_changed user user Password change
user.password_reset user user Password reset via token
user.email_verified user user Email verification completed
user.phone_verified user user Phone verification completed
user.mfa_enabled user user MFA/TOTP enabled
user.mfa_disabled user user MFA/TOTP disabled
user.deactivated user user Self-deactivation

Admin events:

Action Actor Type Resource Type When
admin.user_created admin user Admin creates user
admin.user_updated admin user Admin updates user
admin.user_deleted admin user Admin deletes user
admin.access_revoked admin user Admin revokes access
admin.access_enabled admin user Admin enables access
admin.user_unlocked admin user Admin unlocks locked account
admin.role_assigned admin user Admin assigns role
admin.role_removed admin user Admin removes role
admin.config_changed admin config Admin updates env/config
admin.webhook_created admin webhook Webhook created
admin.webhook_updated admin webhook Webhook modified
admin.webhook_deleted admin webhook Webhook removed

Token events:

Action Actor Type Resource Type When
token.issued system token Access/refresh token issued
token.refreshed system token Token refreshed
token.revoked user/admin token Token explicitly revoked

Session events:

Action Actor Type Resource Type When
session.created system session New session created
session.terminated user/admin session Session ended

3. Audit Logger Service

New package: internal/audit/

type Dependencies struct {
    Log     *zerolog.Logger
    Store   storage.Provider
    Config  *config.Config
}

type Provider interface {
    // Log records an audit event
    Log(ctx context.Context, event AuditEvent) error
    // Query retrieves audit logs with filters
    Query(ctx context.Context, filter AuditFilter, pagination *model.Pagination) ([]*schemas.AuditLog, *model.Pagination, error)
}

type AuditEvent struct {
    ActorID        string
    ActorType      string // user | admin | system | service_account
    ActorEmail     string
    Action         string
    ResourceType   string
    ResourceID     string
    IPAddress      string
    UserAgent      string
    Metadata       map[string]interface{} // serialized to JSON
    OrganizationID string
}

Non-blocking write: Audit logging must not block the request path. Use a buffered channel with a background goroutine flushing to the database:

type auditProvider struct {
    eventChan chan AuditEvent  // buffered channel, size 1000
    store     storage.Provider
    // ...
}

func (a *auditProvider) Log(ctx context.Context, event AuditEvent) error {
    select {
    case a.eventChan <- event:
        return nil
    default:
        // Channel full — log warning, don't block request
        a.log.Warn().Msg("audit log buffer full, event dropped")
        return fmt.Errorf("audit buffer full")
    }
}

// Background goroutine batches writes
func (a *auditProvider) flushLoop() {
    batch := make([]AuditEvent, 0, 100)
    ticker := time.NewTicker(1 * time.Second)
    for {
        select {
        case event := <-a.eventChan:
            batch = append(batch, event)
            if len(batch) >= 100 {
                a.writeBatch(batch)
                batch = batch[:0]
            }
        case <-ticker.C:
            if len(batch) > 0 {
                a.writeBatch(batch)
                batch = batch[:0]
            }
        }
    }
}

4. Integration Points

Wrap existing event dispatchinternal/events/events.go currently fires webhooks. Extend to also write audit logs:

// In each GraphQL handler, after the action:
auditProvider.Log(ctx, audit.AuditEvent{
    ActorID:      user.ID,
    ActorType:    "user",
    ActorEmail:   user.Email,
    Action:       "user.login_success",
    ResourceType: "session",
    ResourceID:   sessionID,
    IPAddress:    ginCtx.ClientIP(),
    UserAgent:    ginCtx.Request.UserAgent(),
    Metadata:     map[string]interface{}{"method": "password", "mfa": false},
})

Helper to extract request context — reduce boilerplate:

func EventFromContext(ctx context.Context, action string, resourceType string, resourceID string) AuditEvent {
    // Extract IP, user agent, actor from Gin context
}

5. Storage Interface Methods

// AddAuditLog writes an audit log entry
AddAuditLog(ctx context.Context, log *schemas.AuditLog) error
// AddAuditLogs batch-writes audit log entries
AddAuditLogs(ctx context.Context, logs []*schemas.AuditLog) error
// ListAuditLogs queries audit logs with filters and pagination
ListAuditLogs(ctx context.Context, filter map[string]interface{}, pagination *model.Pagination) ([]*schemas.AuditLog, *model.Pagination, error)
// DeleteAuditLogsBefore removes logs older than a timestamp (retention)
DeleteAuditLogsBefore(ctx context.Context, before int64) error

6. GraphQL Query API

type AuditLog {
    id: ID!
    timestamp: Int64!
    actor_id: String
    actor_type: String!
    actor_email: String
    action: String!
    resource_type: String
    resource_id: String
    ip_address: String
    user_agent: String
    metadata: Map
    organization_id: String
}

input AuditLogFilter {
    actor_id: String
    actor_type: String
    action: String
    resource_type: String
    resource_id: String
    ip_address: String
    organization_id: String
    from_timestamp: Int64
    to_timestamp: Int64
}

type AuditLogs {
    audit_logs: [AuditLog!]!
    pagination: Pagination!
}

type Query {
    # Admin-only query
    _audit_logs(params: AuditLogFilter, pagination: PaginatedInput): AuditLogs!
}

7. Immutability & Retention

  • Immutable: No update/delete mutations exposed via API. Logs can only be removed by the retention policy.
  • Retention: --audit-log-retention-days=90 (default 90 days)
  • Background goroutine runs daily: DELETE FROM audit_logs WHERE timestamp < (now - retention_days)
  • Same cleanup pattern as LoginAttempt retention from RFC: Rate Limiting & Brute Force Protection #501

CLI Configuration Flags

--enable-audit-log=true                    # Enable/disable audit logging
--audit-log-retention-days=90              # Days to retain audit logs (0 = forever)
--audit-log-buffer-size=1000               # Internal buffer size for async writes

Migration Strategy

  1. Create audit_logs table/collection across all 13+ DB providers
  2. Add composite indexes for all query patterns
  3. Wire audit provider into cmd/root.go initialization (after storage, before HTTP handlers)
  4. Inject into all GraphQL mutation handlers and HTTP handlers
  5. Existing webhook system continues unchanged — audit logs are additive

Testing Plan

  • Unit tests for audit logger buffering and batch flush
  • Integration tests: verify audit log written for each event type
  • Integration tests: query API with various filters
  • Test retention cleanup correctly removes old entries
  • Test immutability — verify no update/delete API exists
  • Load test: verify non-blocking behavior under high event volume
  • Test buffer-full scenario — verify graceful degradation (warning log, not crash)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions