Compare identical tables across two SQL Server instances using Python with mssql-python driver and Apache Arrow. Detect missing rows, column mismatches, schema drift, and produce a reconciliation report.
| Parameter | Required | Description |
|---|---|---|
| Source server | Yes | Source SQL Server (e.g. prod-server.database.windows.net) |
| Source database | Yes | Source database name |
| Target server | Yes | Target SQL Server (e.g. staging-server.database.windows.net) |
| Target database | Yes | Target database name |
| Tables | Yes | Comma-separated schema.table names, or schema.* wildcard (e.g. dbo.Orders,dbo.Items or dbo.*) |
| Auth mode | Yes | sql (user/password) or entra (Azure AD/token) |
| Primary key | Auto-detect | Column(s) forming the row identity. Auto-detect from metadata if not provided. |
| Columns to compare | All | Subset of columns, or all non-PK columns |
| Chunk size | 100000 |
Rows per batch for large tables |
| Output format | console |
console, csv, parquet, or json |
The reconciliation logic is provided as a standalone script at scripts/reconcile.py. Invoke it with the appropriate arguments based on user inputs:
python scripts/reconcile.py \
--source-server <source_server> \
--source-database <source_database> \
--target-server <target_server> \
--target-database <target_database> \
--tables "<table_spec>" \
--auth <sql|entra> \
--chunk-size <chunk_size> \
--output <console|csv|json>
| Argument | Description |
|---|---|
--primary-key |
Comma-separated PK column(s). Omit to auto-detect. |
--columns |
Comma-separated columns to compare. Omit to compare all non-PK columns. |
Single table with SQL auth:
python scripts/reconcile.py \
--source-server prod-server.database.windows.net \
--source-database ProdDB \
--target-server staging-server.database.windows.net \
--target-database StagingDB \
--tables "dbo.Orders" \
--auth sql \
--output console
Wildcard with Entra auth and CSV output:
python scripts/reconcile.py \
--source-server prod-server.database.windows.net \
--source-database ProdDB \
--target-server staging-server.database.windows.net \
--target-database StagingDB \
--tables "dbo.*" \
--auth entra \
--output csv
Install required packages before running:
pip install mssql-python pyarrow pandas
NULL == NULL is considered a match (both sides missing = no diff)OFFSET/FETCH or ROW_NUMBER() partitioningWhen table has >1M rows, generate a hash pre-check:
SELECT {pk_cols},
HASHBYTES('SHA2_256', CONCAT_WS('|', col1, col2, ...)) AS row_hash
FROM {table}
Compare hashes first; only fetch full rows for mismatched hashes. This reduces data transfer significantly.
Reconciling dbo.EMPLOYEES...
Reconciling dbo.DEPARTMENTS...
Reconciling dbo.JOBS...
--- dbo.EMPLOYEES ---
Source: 107 Target: 107
Missing: 0 Extra: 0 Mismatches: 0
Result: ✓ IDENTICAL
--- dbo.DEPARTMENTS ---
Source: 27 Target: 27
Missing: 0 Extra: 0 Mismatches: 3
Result: ✗ DIFFERENCES FOUND
--- dbo.JOBS ---
Source: 19 Target: 19
Missing: 0 Extra: 0 Mismatches: 0
Result: ✓ IDENTICAL
=== Summary: 2 passed, 1 failed, 0 skipped / 3 tables ===
When a single table is provided, include full detail (schema drift, sample rows, mismatches). When multiple tables, use the compact per-table format above with full detail only for tables with FAIL status.
| Scenario | Strategy |
|---|---|
| < 100K rows | Single Arrow fetch, in-memory pandas compare |
| 100K–1M rows | Chunked extraction (100K batches), streaming comparison |
| > 1M rows | Hash pre-check → only fetch mismatched rows |
| Wide tables (100+ cols) | Compare PK + hash first, drill into specific columns on mismatch |
| Network-constrained | Use Arrow columnar format (10-50x smaller than row-by-row) |
mssql-python driver (not pyodbc, pymssql)cursor.arrow()) for data extractionencrypt=True throw errors)os.environ / getpass (env vars: MSSQL_USER, MSSQL_PASSWORD)? placeholders) for metadata lookups — never f-string interpolate user input into SQL