You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks the work done on the mssql branch to make Datafaker work end-to-end against Microsoft SQL Server databases.
Background
Datafaker was originally written targeting PostgreSQL and DuckDB. MS-SQL differs in several ways that required systematic fixes across the codebase: different SQL dialect (no RANDOM()/LIMIT, no EXTRACT, no STDDEV), mandatory schema qualification when a non-default schema is in use, different type names, identity column behaviour, and stricter rules around foreign-key cascade paths.
Queries written by configure-generators into config.yaml's src-stats section are later executed by make-stats via raw text() — SQLAlchemy's schema_translate_map does not apply. Each code path that writes these strings needed to embed the schema-qualified table name explicitly:
4d03aa3 — Fix unqualified table names in raw SQL during configure-generators (ContinuousLogDistributionGeneratorFactory, MultivariateNormalGeneratorFactory)
173f6da — Fix unqualified table names in src-stats queries written by configure-generators (_get_aggregate_query, PredefinedGenerator.SELECT_AGGREGATE_RE)
Dialect-correctness tests were added alongside each fix. A dedicated test module tests/test_generators_dialect.py covers:
DATEPART vs EXTRACT in MimesisDateTimeGenerator
STDEV vs STDDEV in Buckets
NEWID()/TOP vs RANDOM()/LIMIT in ChoiceGeneratorFactory, CovariateQuery, MissingnessType
Schema qualification in Buckets, ContinuousLogDistributionGeneratorFactory, ChoiceGeneratorFactory, _get_aggregate_query, and PredefinedGenerator
tests/test_interactive_dialect.py covers the interactive shell commands (do_peek, do_counts, _get_column_data, print_column_data) for both MS-SQL and PostgreSQL dialects.
The mssql branch adds end-to-end MS-SQL support to Datafaker. The main categories of change were: (1) driver and connection plumbing, (2) SQL dialect translation (RANDOM→NEWID, LIMIT→TOP, EXTRACT→DATEPART, STDDEV→STDEV), (3) schema qualification in stored SQL strings that bypass schema_translate_map, and (4) MS-SQL-specific schema/type/FK constraints. All changes are covered by unit tests using mocked engines and dialect instances.
This issue tracks the work done on the
mssqlbranch to make Datafaker work end-to-end against Microsoft SQL Server databases.Background
Datafaker was originally written targeting PostgreSQL and DuckDB. MS-SQL differs in several ways that required systematic fixes across the codebase: different SQL dialect (no
RANDOM()/LIMIT, noEXTRACT, noSTDDEV), mandatory schema qualification when a non-default schema is in use, different type names, identity column behaviour, and stricter rules around foreign-key cascade paths.Changes by area
Foundation and initial driver support
b2f14ab— Add initial MS-SQL support:pyodbc/aioodbcdependencies, async DSN rewriting, schema routing via SQLAlchemyschema_translate_mapd872fd9— Document macOS ODBC driver setup and add.env.examplefor MS-SQL connection stringsfc97b2c— AddTrustServerCertificate=yesto MS-SQL DSN examplesType system
76fec75— Extend type parser for MS-SQL column types (issue Extend type parser to support MS-SQL column types #96):BIGINT,NVARCHAR,DATETIME2, etc.da4a69b— StripSERIAL/IDENTITYfor MS-SQL target databases (issue Handle SERIAL autoincrement columns when targeting MS-SQL #97)Schema and foreign keys
81a65eb— Fix schema-qualified FK resolution and MS-SQL multiple cascade paths (closes Bug: MS-SQL create-tables fails with schema-qualified FK targets and multiple cascade paths #101): resolveschema.table.columnreferences inorm.yamland avoidFOREIGN KEY … CASCADEconflicts that MS-SQL rejectsPrimary keys
84a209f— Let the database generate integer primary keys on MS-SQL (closes MS-SQL identity insert error (8106): SET IDENTITY_INSERT fails because remove_mssql_identity strips IDENTITY from DDL #104): suppress Datafaker-generated PKs when the column hasIDENTITYDialect-correct SQL in generators and interactive shell
MS-SQL does not support
RANDOM()/LIMIT norEXTRACT(… FROM …)/STDDEV. Every code path that emits these had to be updated:3637258— Fix dialect-specific SQL in generator commands (closes configure-generators and make-stats fail on MS-SQL: PostgreSQL-specific SQL functions used in source queries #105):RANDOM()→NEWID(),LIMIT→TOP,EXTRACT→DATEPART,STDDEV→STDEVin the interactive shell7c0add6— FixRANDOM()/LIMITinChoiceGeneratorFactory(closes configure-generators fails on MS-SQL: RANDOM() and LIMIT not supported in ChoiceGeneratorFactory #106)2bcea2b— Fix schema-qualified table in live queries ofChoiceGeneratorFactory41b96f6— Fix schema-missing FROM clause inBucketsqueries29e7889— Fix remainingRANDOM()/LIMITincompatibilities acrossCovariateQuery,MissingnessType, anddo_peek/print_column_data(closes configure-generators fails on MS-SQL: remaining RANDOM() and LIMIT occurrences in interactive shell and generators #107)Schema qualification for SQL stored in
src-statsQueries written by
configure-generatorsintoconfig.yaml'ssrc-statssection are later executed bymake-statsvia rawtext()— SQLAlchemy'sschema_translate_mapdoes not apply. Each code path that writes these strings needed to embed the schema-qualified table name explicitly:91578da— Fix missing schema qualification in interactive shellSELECTstatements (do_peek,do_counts,print_column_data,_get_column_data)4d03aa3— Fix unqualified table names in raw SQL duringconfigure-generators(ContinuousLogDistributionGeneratorFactory,MultivariateNormalGeneratorFactory)173f6da— Fix unqualified table names insrc-statsqueries written byconfigure-generators(_get_aggregate_query,PredefinedGenerator.SELECT_AGGREGATE_RE)1132dc6— Fix schema-qualified table name inChoiceGeneratorstored queries (closes ChoiceGenerator stores unqualified table name in custom_queries SQL #108): usetext(schema_qualified_name(...))so the compiled SQL string is not bracket-quoted as[mimic100.person]Tests
Dialect-correctness tests were added alongside each fix. A dedicated test module
tests/test_generators_dialect.pycovers:DATEPARTvsEXTRACTinMimesisDateTimeGeneratorSTDEVvsSTDDEVinBucketsNEWID()/TOPvsRANDOM()/LIMITinChoiceGeneratorFactory,CovariateQuery,MissingnessTypeBuckets,ContinuousLogDistributionGeneratorFactory,ChoiceGeneratorFactory,_get_aggregate_query, andPredefinedGeneratortests/test_interactive_dialect.pycovers the interactive shell commands (do_peek,do_counts,_get_column_data,print_column_data) for both MS-SQL and PostgreSQL dialects.Examples
40fc121— Renamemimic_omopexample toomop-mssql1857e76— Addomop-postgresqlexampleKnown limitations / deferred work
Summary
The
mssqlbranch adds end-to-end MS-SQL support to Datafaker. The main categories of change were: (1) driver and connection plumbing, (2) SQL dialect translation (RANDOM→NEWID,LIMIT→TOP,EXTRACT→DATEPART,STDDEV→STDEV), (3) schema qualification in stored SQL strings that bypassschema_translate_map, and (4) MS-SQL-specific schema/type/FK constraints. All changes are covered by unit tests using mocked engines and dialect instances.