Data Cleaning and Variable Screening
Quick Start
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
Complete Process Description
The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
-
Get Data - Load and format raw data
-
Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
-
Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
-
Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
-
Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
-
Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
-
Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
-
Drop High PSI Features - Remove features with unstable PSI
-
Null Importance Denoising - Remove noise features using label permutation method
-
Drop High Correlation Features - Remove high correlation features based on original gain
-
Export Report - Generate Excel report containing details and statistics of all steps
Core Functions
| Function |
Purpose |
Module |
get_dataset() |
Load and format data |
references.func |
org_analysis() |
Organization sample analysis |
references.func |
missing_check() |
Calculate missing rate |
references.func |
drop_abnormal_ym() |
Filter abnormal months |
references.analysis |
drop_highmiss_features() |
Drop high missing rate features |
references.analysis |
drop_lowiv_features() |
Drop low IV features |
references.analysis |
drop_highpsi_features() |
Drop high PSI features |
references.analysis |
drop_highnoise_features() |
Null Importance denoising |
references.analysis |
drop_highcorr_features() |
Drop high correlation features |
references.analysis |
iv_distribution_by_org() |
IV distribution statistics |
references.analysis |
psi_distribution_by_org() |
PSI distribution statistics |
references.analysis |
value_ratio_distribution_by_org() |
Value ratio distribution statistics |
references.analysis |
export_cleaning_report() |
Export cleaning report |
references.analysis |
Parameter Description
Data Loading Parameters
-
DATA_PATH: Data file path (best are parquet format)
-
DATE_COL: Date column name
-
Y_COL: Label column name
-
ORG_COL: Organization column name
-
KEY_COLS: Primary key column name list
OOS Organization Configuration
-
OOS_ORGS: Out-of-sample organization list
Abnormal Month Filtering Parameters
-
min_ym_bad_sample: Minimum bad sample count per month (default 10)
-
min_ym_sample: Minimum total sample count per month (default 500)
Missing Rate Parameters
-
missing_ratio: Overall missing rate threshold (default 0.6)
IV Parameters
-
overall_iv_threshold: Overall IV threshold (default 0.1)
-
org_iv_threshold: Single organization IV threshold (default 0.1)
-
max_org_threshold: Maximum tolerated low IV organization count (default 2)
PSI Parameters
-
psi_threshold: PSI threshold (default 0.1)
-
max_months_ratio: Maximum unstable month ratio (default 1/3)
-
max_orgs: Maximum unstable organization count (default 6)
Null Importance Parameters
-
n_estimators: Number of trees (default 100)
-
max_depth: Maximum tree depth (default 5)
-
gain_threshold: Gain difference threshold (default 50)
High Correlation Parameters
-
max_corr: Correlation threshold (default 0.9)
-
top_n_keep: Keep top N features by original gain ranking (default 20)
Output Report
The generated Excel report contains the following sheets:
-
汇总 - Summary information of all steps, including operation results and conditions
-
机构样本统计 - Sample count and bad sample rate for each organization
-
分离OOS数据 - OOS sample and modeling sample counts
-
Step4-异常月份处理 - Abnormal months that were removed
-
缺失率明细 - Overall and organization-level missing rates for each feature
-
Step5-有值率分布统计 - Distribution of features in different value ratio ranges
-
Step6-高缺失率处理 - High missing rate features that were removed
-
Step7-IV明细 - IV values of each feature in each organization and overall
-
Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
-
Step7-IV分布统计 - Distribution of features in different IV ranges
-
Step8-PSI明细 - PSI values of each feature in each organization each month
-
Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
-
Step8-PSI分布统计 - Distribution of features in different PSI ranges
-
Step9-null importance处理 - Noise features that were removed
-
Step10-高相关性剔除 - High correlation features that were removed
Features
-
Interactive Input: Parameters can be input before each step execution, with default values supported
-
Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
-
Complete Report: Generate complete Excel report containing details, statistics, and distributions
-
Multi-process Support: IV and PSI calculations support multi-process acceleration
-
Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction