信用风险数据清洗管道

v20260415

datanalysis-credit-risk

面向贷前建模的信用风险数据清洗流程，处理原始数据、异常期、高缺失、低 IV/高 PSI/高相关特征，Null Importance 降噪并导出 Excel 清洗报告，方便质量评估与变量筛选。

信用风险数据清洗变量筛选建模预处理特征工程报告 Python 数据处理

278 次下载

概览

Data Cleaning and Variable Screening

# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

Get Data - Load and format raw data
Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
Drop High PSI Features - Remove features with unstable PSI
Null Importance Denoising - Remove noise features using label permutation method
Drop High Correlation Features - Remove high correlation features based on original gain
Export Report - Generate Excel report containing details and statistics of all steps

Function	Purpose	Module
`get_dataset()`	Load and format data	references.func
`org_analysis()`	Organization sample analysis	references.func
`missing_check()`	Calculate missing rate	references.func
`drop_abnormal_ym()`	Filter abnormal months	references.analysis
`drop_highmiss_features()`	Drop high missing rate features	references.analysis
`drop_lowiv_features()`	Drop low IV features	references.analysis
`drop_highpsi_features()`	Drop high PSI features	references.analysis
`drop_highnoise_features()`	Null Importance denoising	references.analysis
`drop_highcorr_features()`	Drop high correlation features	references.analysis
`iv_distribution_by_org()`	IV distribution statistics	references.analysis
`psi_distribution_by_org()`	PSI distribution statistics	references.analysis
`value_ratio_distribution_by_org()`	Value ratio distribution statistics	references.analysis
`export_cleaning_report()`	Export cleaning report	references.analysis

The generated Excel report contains the following sheets:

汇总 - Summary information of all steps, including operation results and conditions
机构样本统计 - Sample count and bad sample rate for each organization
分离OOS数据 - OOS sample and modeling sample counts
Step4-异常月份处理 - Abnormal months that were removed
缺失率明细 - Overall and organization-level missing rates for each feature
Step5-有值率分布统计 - Distribution of features in different value ratio ranges
Step6-高缺失率处理 - High missing rate features that were removed
Step7-IV明细 - IV values of each feature in each organization and overall
Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
Step7-IV分布统计 - Distribution of features in different IV ranges
Step8-PSI明细 - PSI values of each feature in each organization each month
Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
Step8-PSI分布统计 - Distribution of features in different PSI ranges
Step9-null importance处理 - Noise features that were removed
Step10-高相关性剔除 - High correlation features that were removed

Interactive Input: Parameters can be input before each step execution, with default values supported
Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
Complete Report: Generate complete Excel report containing details, statistics, and distributions
Multi-process Support: IV and PSI calculations support multi-process acceleration
Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction

信息

Category 数据科学

Name datanalysis-credit-risk

版本 v20260415

大小 21.43KB

Source github/awesome-copilot

更新时间 2026-04-17