Skip to content

Add MS-SQL (SQL Server) support #109

@myyong

Description

@myyong

This issue tracks the work done on the mssql branch to make Datafaker work end-to-end against Microsoft SQL Server databases.

Background

Datafaker was originally written targeting PostgreSQL and DuckDB. MS-SQL differs in several ways that required systematic fixes across the codebase: different SQL dialect (no RANDOM()/LIMIT, no EXTRACT, no STDDEV), mandatory schema qualification when a non-default schema is in use, different type names, identity column behaviour, and stricter rules around foreign-key cascade paths.


Changes by area

Foundation and initial driver support

  • b2f14ab — Add initial MS-SQL support: pyodbc/aioodbc dependencies, async DSN rewriting, schema routing via SQLAlchemy schema_translate_map
  • d872fd9 — Document macOS ODBC driver setup and add .env.example for MS-SQL connection strings
  • fc97b2c — Add TrustServerCertificate=yes to MS-SQL DSN examples

Type system

Schema and foreign keys

Primary keys

Dialect-correct SQL in generators and interactive shell

MS-SQL does not support RANDOM() / LIMIT n or EXTRACT(… FROM …) / STDDEV. Every code path that emits these had to be updated:

Schema qualification for SQL stored in src-stats

Queries written by configure-generators into config.yaml's src-stats section are later executed by make-stats via raw text() — SQLAlchemy's schema_translate_map does not apply. Each code path that writes these strings needed to embed the schema-qualified table name explicitly:

  • 91578da — Fix missing schema qualification in interactive shell SELECT statements (do_peek, do_counts, print_column_data, _get_column_data)
  • 4d03aa3 — Fix unqualified table names in raw SQL during configure-generators (ContinuousLogDistributionGeneratorFactory, MultivariateNormalGeneratorFactory)
  • 173f6da — Fix unqualified table names in src-stats queries written by configure-generators (_get_aggregate_query, PredefinedGenerator.SELECT_AGGREGATE_RE)
  • 1132dc6 — Fix schema-qualified table name in ChoiceGenerator stored queries (closes ChoiceGenerator stores unqualified table name in custom_queries SQL #108): use text(schema_qualified_name(...)) so the compiled SQL string is not bracket-quoted as [mimic100.person]

Tests

Dialect-correctness tests were added alongside each fix. A dedicated test module tests/test_generators_dialect.py covers:

  • DATEPART vs EXTRACT in MimesisDateTimeGenerator
  • STDEV vs STDDEV in Buckets
  • NEWID()/TOP vs RANDOM()/LIMIT in ChoiceGeneratorFactory, CovariateQuery, MissingnessType
  • Schema qualification in Buckets, ContinuousLogDistributionGeneratorFactory, ChoiceGeneratorFactory, _get_aggregate_query, and PredefinedGenerator

tests/test_interactive_dialect.py covers the interactive shell commands (do_peek, do_counts, _get_column_data, print_column_data) for both MS-SQL and PostgreSQL dialects.

Examples

  • 40fc121 — Rename mimic_omop example to omop-mssql
  • 1857e76 — Add omop-postgresql example

Known limitations / deferred work


Summary

The mssql branch adds end-to-end MS-SQL support to Datafaker. The main categories of change were: (1) driver and connection plumbing, (2) SQL dialect translation (RANDOMNEWID, LIMITTOP, EXTRACTDATEPART, STDDEVSTDEV), (3) schema qualification in stored SQL strings that bypass schema_translate_map, and (4) MS-SQL-specific schema/type/FK constraints. All changes are covered by unit tests using mocked engines and dialect instances.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions