Bug
Running datafaker configure-generators against an MS-SQL source raises:
('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
'EXTRACT' is not a recognized built-in function name. (195)")
[SQL: SELECT MIN(CAST(EXTRACT(YEAR FROM birth_datetime) AS INT)) AS start,
MAX(CAST(EXTRACT(YEAR FROM birth_datetime) AS INT)) AS end FROM person]
A full audit of raw SQL in the codebase found three MS-SQL incompatibilities in the commands that query the source database.
Affected locations
1. EXTRACT(YEAR FROM ...) — datafaker/generators/mimesis.py:189–195
Triggered by configure-generators.
extract_year = f"CAST(EXTRACT(YEAR FROM {column.name}) AS INT)"
max_year = f"MAX({extract_year})"
min_year = f"MIN({extract_year})"
# ...
result = connection.execute(text(
f"SELECT {min_year} AS start, {max_year} AS end FROM {column.table.name}"
))
MS-SQL does not support EXTRACT(); it uses YEAR(col) or DATEPART(year, col).
2. EXTRACT(YEAR FROM ...) — datafaker/make.py:318–321
Triggered by create-generators and make-stats for Date/DateTime columns.
_YEAR_SUMMARY_QUERY = (
"SELECT MIN(y) AS start, MAX(y) AS end FROM "
"(SELECT EXTRACT(YEAR FROM {column}) AS y FROM {table}) AS years"
)
3. STDDEV() — datafaker/generators/base.py:325–327
Triggered by configure-generators for numeric columns used in bucket generation.
text(
f"SELECT AVG({column_name}) AS mean,"
f" STDDEV({column_name}) AS stddev,"
f" COUNT({column_name}) AS count FROM {table_name}"
)
MS-SQL spells this STDEV() (no trailing D), not STDDEV().
Already fixed
RANDOM() → NEWID() was fixed in providers.py (commit 84a209f) using a connection.dialect.name == "mssql" check — the same pattern applies here.
Options for fixes
Option A — Dialect check with raw SQL (follows existing pattern)
Mirror the NEWID() fix: check connection.dialect.name or engine.dialect.name at the call site and substitute the MS-SQL equivalent.
# EXTRACT fix
if dialect_name == "mssql":
year_expr = f"YEAR({col})"
else:
year_expr = f"EXTRACT(YEAR FROM {col})"
# STDDEV fix
if dialect_name == "mssql":
stddev_fn = "STDEV"
else:
stddev_fn = "STDDEV"
Pros: Minimal change; consistent with NEWID() pattern already in the codebase.
Cons: Must enumerate dialects; each new dialect may need adding.
Option B — SQLAlchemy expression API (most portable)
Replace raw SQL strings with SQLAlchemy expressions that compile correctly per dialect:
sqlalchemy.extract('year', col) → EXTRACT(YEAR FROM col) on PostgreSQL/DuckDB, DATEPART(year, col) on MS-SQL
sqlalchemy.func.stddev(col) → STDDEV(col) on PostgreSQL, STDEV(col) on MS-SQL (SQLAlchemy maps this automatically)
Requires changing text() calls to expression-based queries. Both locations already have access to a Column object.
Pros: Automatically correct for all current and future SQLAlchemy-supported dialects.
Cons: More refactoring; _YEAR_SUMMARY_QUERY can no longer be a plain string constant.
Affected files summary
| File |
Line |
Issue |
datafaker/generators/mimesis.py |
189–195 |
EXTRACT(YEAR FROM ...) |
datafaker/make.py |
318–321 |
EXTRACT(YEAR FROM ...) |
datafaker/generators/base.py |
325–327 |
STDDEV() vs STDEV() |
Bug
Running
datafaker configure-generatorsagainst an MS-SQL source raises:A full audit of raw SQL in the codebase found three MS-SQL incompatibilities in the commands that query the source database.
Affected locations
1.
EXTRACT(YEAR FROM ...)—datafaker/generators/mimesis.py:189–195Triggered by
configure-generators.MS-SQL does not support
EXTRACT(); it usesYEAR(col)orDATEPART(year, col).2.
EXTRACT(YEAR FROM ...)—datafaker/make.py:318–321Triggered by
create-generatorsandmake-statsforDate/DateTimecolumns.3.
STDDEV()—datafaker/generators/base.py:325–327Triggered by
configure-generatorsfor numeric columns used in bucket generation.MS-SQL spells this
STDEV()(no trailing D), notSTDDEV().Already fixed
RANDOM()→NEWID()was fixed inproviders.py(commit 84a209f) using aconnection.dialect.name == "mssql"check — the same pattern applies here.Options for fixes
Option A — Dialect check with raw SQL (follows existing pattern)
Mirror the
NEWID()fix: checkconnection.dialect.nameorengine.dialect.nameat the call site and substitute the MS-SQL equivalent.Pros: Minimal change; consistent with
NEWID()pattern already in the codebase.Cons: Must enumerate dialects; each new dialect may need adding.
Option B — SQLAlchemy expression API (most portable)
Replace raw SQL strings with SQLAlchemy expressions that compile correctly per dialect:
sqlalchemy.extract('year', col)→EXTRACT(YEAR FROM col)on PostgreSQL/DuckDB,DATEPART(year, col)on MS-SQLsqlalchemy.func.stddev(col)→STDDEV(col)on PostgreSQL,STDEV(col)on MS-SQL (SQLAlchemy maps this automatically)Requires changing
text()calls to expression-based queries. Both locations already have access to aColumnobject.Pros: Automatically correct for all current and future SQLAlchemy-supported dialects.
Cons: More refactoring;
_YEAR_SUMMARY_QUERYcan no longer be a plain string constant.Affected files summary
datafaker/generators/mimesis.pyEXTRACT(YEAR FROM ...)datafaker/make.pyEXTRACT(YEAR FROM ...)datafaker/generators/base.pySTDDEV()vsSTDEV()