Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Sep 6, 2025

This PR implements a comprehensive feature to allow configuring Pandera schemas on Dataset Questions for data validation. Pandera schemas enable users to define validation rules for different question types (TextQuestion, LabelQuestion, MultiLabelQuestion, RankingQuestion) and ensure data quality in their datasets.

Overview

The implementation adds support for both DataFrameSchema and SeriesSchema configurations, stored as JSON in the questions table metadata field. Users can configure schemas through a rich UI interface or directly edit JSON, with real-time validation feedback.

Backend Changes

Database Schema:

  • Added metadata JSON column to questions table via Alembic migration
  • Schemas stored under pandera_schema key in question metadata

API Extensions:

  • Extended QuestionCreate and QuestionUpdate schemas with optional pandera_schema field
  • Added comprehensive Pandera schema serialization/deserialization utilities
  • Implemented validation logic to ensure only valid Pandera schemas are accepted

Context Layer:

  • Updated question creation/update contexts to handle metadata storage
  • Added pandera_schema property to Question model for easy access
  • Maintained full backward compatibility with existing questions

Frontend Changes

Question Entity:

  • Extended Question domain entity with Pandera schema support
  • Added schema management methods (setPanderaSchema, hasPanderaSchema, etc.)
  • Integrated schema state into question modification tracking

UI Components:

  • DatasetConfigurationPandera: Main configuration toggle with schema type selection
  • DatasetConfigurationPanderaDataFrame: Visual column configuration with data types, nullable/unique options
  • DatasetConfigurationPanderaSeries: Series validation configuration interface
  • JSON editor with real-time validation and error feedback

Type Safety:

  • Comprehensive TypeScript interfaces for Pandera schemas
  • Client-side validation utilities with proper error handling
  • Full internationalization support

Usage Example

// Create a DataFrame schema for table validation
const schema = {
  schema_type: "DataFrameSchema",
  columns: {
    "name": { dtype: "str", nullable: false },
    "score": { dtype: "int64", unique: false }
  },
  strict: true
};

// Schema gets stored in question metadata and used for validation
question.setPanderaSchema(schema);

Key Features

  • Dual Interface: Both visual configuration and JSON editing capabilities
  • Schema Types: Support for DataFrameSchema (tabular data) and SeriesSchema (single column)
  • Validation: Real-time schema format validation with user-friendly error messages
  • Persistence: Schemas automatically save to database and restore on page reload
  • Backward Compatibility: Existing questions continue to work without modification
  • Internationalization: Full i18n support for all configuration labels

Testing

  • Comprehensive unit tests for Pandera utilities (serialization, validation)
  • Integration tests for question context operations
  • Frontend component validation testing
  • All existing tests continue to pass

This feature enables data scientists and researchers to enforce data quality standards directly within their annotation workflows, ensuring consistent and reliable dataset collection.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits September 6, 2025 09:02
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
…ucture

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
…ity support

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Copilot AI changed the title [WIP] add a feature to allow configuring Pandera schema (json) on Dataset's Question. A DataFrameSchema or a Series Schema is associated with a Question entity in the database to specify the validation rules depending on TextQuestion, LabelQuestion, MultiLab... Add Pandera schema configuration support for Dataset Questions Sep 6, 2025
Copilot AI requested a review from dawn-tran September 6, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants