Skip to content

MS-SQL identity insert error (8106): SET IDENTITY_INSERT fails because remove_mssql_identity strips IDENTITY from DDL #104

@myyong

Description

@myyong

Bug

When running datafaker create-data against an MS-SQL destination, the following error is raised:

('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
Table 'mimic100_synthetic.person' does not have the identity property.
Cannot perform SET operation. (8106) (SQLExecDirectW)")
[SQL: SET IDENTITY_INSERT mimic100_synthetic.person ON]

Datafaker halts before inserting any rows.

Root cause

There are two systems in conflict:

  1. remove_mssql_identity DDL hook (create.py:46–57) strips IDENTITY from every CREATE TABLE statement on MS-SQL, so the physical column is a plain INTEGER with no identity property.

  2. SQLAlchemy's insert path sees Column(Integer, primary_key=True) with autoincrement='auto' (the SQLAlchemy default). When an explicit PK value is supplied, SQLAlchemy wraps the INSERT with SET IDENTITY_INSERT <table> ON. SQL Server rejects this with error 8106 because the column has no identity property — it was stripped in step 1.

The result is that the table and the metadata disagree about whether the column is an identity column, so either the table creation or the insert path will always be wrong.

Options considered

Option A — Let the database generate PKs ✅ (chosen)

  • Remove remove_mssql_identity so tables are created with IDENTITY(1,1).
  • Stop supplying explicit PK values in generated df.py files. The DB generates them, and SQLAlchemy's return_defaults() captures the generated value for story generators.

Pros:

  • PKs are guaranteed unique even across concurrent datafaker runs.
  • No SELECT MAX() query on first insert.
  • Simpler generated df.py — no increment() boilerplate per identity column.
  • Works correctly when datafaker is run multiple times against a non-empty destination.

Cons:

  • Generated PK values are not deterministic — cannot reproduce the exact same synthetic dataset.
  • Existing hand-edited df.py files must be updated to remove increment() calls.

Option B — Suppress autoincrement in MetaData, keep increment()

  • Add autoincrement=False to Column(...) in dict_to_column (serialize_metadata.py). SQLAlchemy stops emitting SET IDENTITY_INSERT; increment() in df.py supplies explicit values as before.

Pros:

  • Smallest possible code change (one keyword argument).
  • Deterministic/reproducible PK values.
  • No need to regenerate or edit df.py files.

Cons:

  • Does not give automatic generation — every df.py must still wire up increment() per identity PK.
  • In-memory accumulator is not concurrency-safe: two parallel datafaker processes both query SELECT MAX() and start from the same value → PK collision.

Option C — Keep IDENTITY on tables, keep increment() (minimal surgical)

  • Remove remove_mssql_identity so tables have IDENTITY. Keep increment() in df.py to supply explicit values. SET IDENTITY_INSERT ON/OFF now works because the column actually has IDENTITY.

Pros:

  • Smallest change to create.py (delete ~10 lines). No changes to df.py or serialize_metadata.py.

Cons:

  • Two systems manage the same counter. The DB's IDENTITY sequence falls behind explicit inserts; if anything ever inserts without IDENTITY_INSERT ON, the DB-generated value collides with a previously-inserted one.
  • SET IDENTITY_INSERT is one-table-at-a-time per session; story generators touching multiple identity-PK tables in one pass hit this limit.

Fix (Option A)

The following files are changed in the associated PR:

  • datafaker/create.py: Remove remove_mssql_identity hook — MS-SQL tables now get IDENTITY(1,1).
  • datafaker/make.py: _get_default_generator returns None for single-column integer PKs with no FK; _get_generator_for_table excludes them from both nonnull_columns and row_gens. The existing comment ("we presume that primary keys are populated automatically") already documented this intent.
  • examples/omop-mssql/df.py: Remove manual increment() call for person_id.
  • tests/test_create_mssql.py: Invert TestMSSQLRemoveIdentityTestMSSQLIdentityPresent; add insert-without-PK test.
  • tests/test_make.py: Add TestGetDefaultGenerator unit tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions