Skills Data Science Database Table Reconciliation Tool

Database Table Reconciliation Tool

v20260519
sql-server-table-reconciliation
This powerful tool compares identical tables across two SQL Server instances (e.g., Production vs. Staging). It efficiently detects missing rows, extra columns, schema drifts, and data mismatches. Leveraging Python and Apache Arrow for fast columnar data transfer, it provides a comprehensive reconciliation report, making it essential for data migration validation and ETL verification.
Get Skill
347 downloads
Overview

SQL Server Table Reconciliation

Compare identical tables across two SQL Server instances using Python with mssql-python driver and Apache Arrow. Detect missing rows, column mismatches, schema drift, and produce a reconciliation report.

Workflow

  1. Collect connection details for source and target
  2. Identify primary key / composite key
  3. Detect schema differences
  4. Extract data via Arrow for efficient columnar transfer
  5. Compare rows and columns
  6. Generate reconciliation report

Collect Inputs

Parameter Required Description
Source server Yes Source SQL Server (e.g. prod-server.database.windows.net)
Source database Yes Source database name
Target server Yes Target SQL Server (e.g. staging-server.database.windows.net)
Target database Yes Target database name
Tables Yes Comma-separated schema.table names, or schema.* wildcard (e.g. dbo.Orders,dbo.Items or dbo.*)
Auth mode Yes sql (user/password) or entra (Azure AD/token)
Primary key Auto-detect Column(s) forming the row identity. Auto-detect from metadata if not provided.
Columns to compare All Subset of columns, or all non-PK columns
Chunk size 100000 Rows per batch for large tables
Output format console console, csv, parquet, or json

Bundled Script

The reconciliation logic is provided as a standalone script at scripts/reconcile.py. Invoke it with the appropriate arguments based on user inputs:

python scripts/reconcile.py \
    --source-server <source_server> \
    --source-database <source_database> \
    --target-server <target_server> \
    --target-database <target_database> \
    --tables "<table_spec>" \
    --auth <sql|entra> \
    --chunk-size <chunk_size> \
    --output <console|csv|json>

Optional arguments

Argument Description
--primary-key Comma-separated PK column(s). Omit to auto-detect.
--columns Comma-separated columns to compare. Omit to compare all non-PK columns.

Example invocations

Single table with SQL auth:

python scripts/reconcile.py \
    --source-server prod-server.database.windows.net \
    --source-database ProdDB \
    --target-server staging-server.database.windows.net \
    --target-database StagingDB \
    --tables "dbo.Orders" \
    --auth sql \
    --output console

Wildcard with Entra auth and CSV output:

python scripts/reconcile.py \
    --source-server prod-server.database.windows.net \
    --source-database ProdDB \
    --target-server staging-server.database.windows.net \
    --target-database StagingDB \
    --tables "dbo.*" \
    --auth entra \
    --output csv

Prerequisites

Install required packages before running:

pip install mssql-python pyarrow pandas

Comparison Rules

  • Normalize types before comparing: cast decimals to same precision, trim strings, normalize datetime to UTC
  • NULL handling: NULL == NULL is considered a match (both sides missing = no diff)
  • Ignore row order: always compare by PK join, never positional
  • Large tables: chunk extraction with OFFSET/FETCH or ROW_NUMBER() partitioning

Hash-Based Optimization (for large tables)

When table has >1M rows, generate a hash pre-check:

SELECT {pk_cols},
       HASHBYTES('SHA2_256', CONCAT_WS('|', col1, col2, ...)) AS row_hash
FROM {table}

Compare hashes first; only fetch full rows for mismatched hashes. This reduces data transfer significantly.

Report Format

Reconciling dbo.EMPLOYEES...
Reconciling dbo.DEPARTMENTS...
Reconciling dbo.JOBS...

--- dbo.EMPLOYEES ---
  Source: 107  Target: 107
  Missing: 0  Extra: 0  Mismatches: 0
  Result: ✓ IDENTICAL

--- dbo.DEPARTMENTS ---
  Source: 27  Target: 27
  Missing: 0  Extra: 0  Mismatches: 3
  Result: ✗ DIFFERENCES FOUND

--- dbo.JOBS ---
  Source: 19  Target: 19
  Missing: 0  Extra: 0  Mismatches: 0
  Result: ✓ IDENTICAL

=== Summary: 2 passed, 1 failed, 0 skipped / 3 tables ===

When a single table is provided, include full detail (schema drift, sample rows, mismatches). When multiple tables, use the compact per-table format above with full detail only for tables with FAIL status.

Performance Considerations

Scenario Strategy
< 100K rows Single Arrow fetch, in-memory pandas compare
100K–1M rows Chunked extraction (100K batches), streaming comparison
> 1M rows Hash pre-check → only fetch mismatched rows
Wide tables (100+ cols) Compare PK + hash first, drill into specific columns on mismatch
Network-constrained Use Arrow columnar format (10-50x smaller than row-by-row)

Constraints

  • Always use mssql-python driver (not pyodbc, pymssql)
  • Always use Apache Arrow via cursor (cursor.arrow()) for data extraction
  • Connection MUST use connection string format, not keyword arguments (kwargs like encrypt=True throw errors)
  • Never compare without identifying PK first — ask user if auto-detect fails
  • Handle connection failures gracefully with retry logic
  • Never hardcode credentials in generated scripts — use os.environ / getpass (env vars: MSSQL_USER, MSSQL_PASSWORD)
  • Do not print credentials in output or logs
  • Use parameterized queries (? placeholders) for metadata lookups — never f-string interpolate user input into SQL
Info
Category Data Science
Name sql-server-table-reconciliation
Version v20260519
Size 6.6KB
Updated At 2026-05-20
Language