最近在建立申请评分卡的时候遇到几个问题,希望各路大神帮忙指教一下,先谢谢各位了!!!我的变量都是数值型连续变量。
1.特征变量筛选:
(1)变量相关性,
cor1<-cor(oball2[,3:19])
corrplot(cor1,method = "number")
names(oball2)
oball2cor<-oball2[,c("state","intime","blacklist","silence",
"calls","overdue","fraudscore","TDscore","connectbook","name")]
(2)分别利用线性回归分析,和随机森林两种方法挑选变量
##筛选特征
(1),随机森林变量重要性
train.forest<-randomForest(train.fangkuan1$state~.,train.fangkuan1)
str(train.forest)
importance(train.forest)[order(-importance(train.forest)),]
> importance(train.forest)[order(-importance(train.forest)),]
setlend fraudscore connectbook calls
36.5126773 28.2508826 20.6774881 18.5744489
intime TDscore name silence
17.4203810 15.6021926 11.9796313 8.3684015
blacklist overdue
7.9312677 0.5399154
vars.tr<-c("connectbook","fraudscore","connectbook","intime","calls",
"TDscore","name","silence","blacklist")
(2),线性回归变量重要性
train.glm<-glm(train.fangkuan1$state~.,train.fangkuan1,family = "binomial")
summary(train.glm)
train.step<-step(train.glm)
summary(train.step)
names(unlist(train.step$coefficients))
vars.glm<-c("blacklist","setlend","fraudscore","connectbook")
综合两种方法得出的变量,最后在做数据分箱的时候,变量“calls“计算IV值,前期高度重要的指标,IV值只有0.03左右
#(3)calls
result<-smbinning.custom(df=train.fangkuan1, y="state",x="calls",
cuts = c(1424,2034))
result$ivtable
Cutpoint CntRec CntGood CntBad CntCumRec
1 <= 1424 396 229 167 396
2 <= 2034 198 122 76 594
3 > 2034 198 129 69 792
4 Missing 0 0 0 792
5 Total 792 480 312 NA
CntCumGood CntCumBad PctRec GoodRate BadRate
1 229 167 0.50 0.5783 0.4217
2 351 243 0.25 0.6162 0.3838
3 480 312 0.25 0.6515 0.3485
4 480 312 0.00 NaN NaN
5 NA NA 1.00 0.6061 0.3939
Odds LnOdds WoE IV
1 1.3713 0.3157 -0.1151 0.0067
2 1.6053 0.4733 0.0425 0.0004
3 1.8696 0.6257 0.1949 0.0093
4 NaN NaN NaN NaN
5 1.5385 0.4308 0.0000 0.0164
2.想请教下 在做变量挑选工作的时候,应该怎么处理呢?还有一个就是,我的这个方法是否合理?