1、决策树基础

1.1 决策树的构成

(1)决策树由节点组成,可分为决策节点(Decision tree)与叶节点(leaf node)。

(2)从上到下的第一个节点也称为根节点(Root Node)。根节点到叶节点的最长距离称为树的深度。

(3)对于每一个决策节点,选择样本集的一个最佳预测变量,确定最佳的阈值,进行二分类;最后根据叶节点类别的众数进行分类(均数进行回归)。

Machine Learning Decision Tree Classification Algorithm - Javatpoint

  • 在决策树中有两个关键问题:

    (1)如何选择最佳的预测变量?以及如何确定其阈值?

    (2)如何防止决策树过拟合

1.2 基尼系数

(1)决策树为了使表示数据杂乱程度的不纯度减小,选择预测变量的最佳cut-off进行分割。当分割出来的每个组中的标签大部分属于同种标签时,不纯度会变小。

  • 每一组的复杂度可通过基尼指数表示。其值越小,表示数据纯度越高
  • 分割前后的数据复杂度变化可通过基尼增益表示。其值越大,表示数据的纯度变得越高。

(2)基尼指数与基尼增益的计算方法如下图所示:

如公式所示,对于二分类任务的基尼指数取值范围应为0~0.5之间。当等于0时,表示由同一种标签样本组成;当等于0.5时,表示两个标签样本各占一半。

微信图片_20220414092907

(3)比较不同预测变量的基尼增益,其实就是比较把不同预测变量的最佳分割cut-off下的基尼增益。

具体对于不同类型的预测变量,如上图所示,有不同的选择方法(从若干候选值中选择一个最佳的,作为该预测变量的代表)

1.3 防止决策树过拟合

  • 如果不对决策树做出限制,它会一直划分节点,直至所有的叶节点都是只有一个类别的纯节点。这是决策树过拟合的主要原因。

  • 防止过拟合的思路有两种–

    • ① 先生成一颗冗长的树,再回过头修剪;
    • ② 提前定义好规则,防止树出现过拟合。
  • 关于第二种思路,有如下4个角度(超参数)

(1)直接限制最长深度;

(2)每一次划分所引起的最小性能改进(cp值,类似于基尼增益);

(3)叶节点size:如果划分一个节点会导致叶节点中包含的样本数少于规定值,则不会划分;

(4)决策节点size:如果决策节点少于规定值,则不会被划分。(就被视为了叶节点)

2、决策树mlr建模

2.1 动物特征数据集

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data(Zoo, package = "mlbench")
#将逻辑变量(True/False)转为数值型因子变量
Zoo2 = Zoo %>% 
        dplyr::mutate(across(where(is.logical), as.factor))
str(Zoo2)
# 'data.frame':	101 obs. of  17 variables:
#  $ hair    : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
#  $ feathers: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
#  $ eggs    : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
#  $ milk    : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
#  $ airborne: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
#  $ aquatic : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
#  $ predator: Factor w/ 2 levels "FALSE","TRUE": 2 1 2 2 2 1 1 1 2 1 ...
#  $ toothed : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
#  $ backbone: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
#  $ breathes: Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
#  $ venomous: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
#  $ fins    : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
#  $ legs    : int  4 4 0 4 4 4 4 0 0 4 ...
#  $ tail    : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 1 2 2 2 2 2 1 ...
#  $ domestic: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 2 2 1 2 ...
#  $ catsize : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 1 ...
#  $ type    : Factor w/ 7 levels "mammal","bird",..: 1 1 4 1 1 1 1 4 4 1 ...

##第1~16列为动物特征
##第17列type列为动物名标签

2.2 确定预测目标与训练方法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#(1)根据1-16列动物特征进行动物类型预测
task_classif = as_task_classif(Zoo2, target = "type")
task_classif$col_roles$stratum = "type"

#使用决策树rpart分类算法
learner = lrn("classif.rpart", predict_type="prob")
learner$param_set

##(2)确定该算法的超参数空间以及遍历方法
search_space = ps(
  minsplit = p_int(lower=1,upper=20),
  minbucket = p_int(lower=1,upper=20),
  cp = p_dbl(lower=0, upper=1),
  maxdepth = p_int(lower=1,upper=20)
)

design = expand.grid(minsplit=c(1, 5, 10),
                     minbucket=c(1, 10, 20),
                     cp=c(0.1, 0.3, 0.5, 0.9),
                     maxdepth = c(5, 10, 15)) %>% as.data.table()

##5折交叉验证
resampling = rsmp("cv")
resampling$param_set$values$folds=5

##AUC模型评价指标
measure = msr("classif.acc")

2.3 模型训练

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
##创建实例
instance = TuningInstanceSingleCrit$new(
  task = task_classif,
  learner = learner,
  resampling = resampling,
  measure = measure,
  terminator = trm("none"),
  search_space = search_space
)
tuner = tnr("design_points", design = design)  
future::plan("multisession")
##超参数优化
tuner$optimize(instance)
as.data.table(instance$archive)[,1:5] %>% head


##(1)确定最佳超参数组合
library(parallel)
library(parallelMap)
#调用多线程
parallelStartSocket(cpus = detectCores())
tunedTreePars <- tuneParams(tree, task = zooTask, # ~30 sec
                           resampling = cvForTuning, 
                           par.set = treeParamSpace, 
                           control = randSearch)
#停用多线程
parallelStop()

tunedTreePars
# Tune result:
# Op. pars: minsplit=6; minbucket=3; cp=0.0345; maxdepth=9
# mmce.test.mean=0.1200000

##(2)使用上述组合训练模型
tunedTree <- setHyperPars(tree, par.vals = tunedTreePars$x)
tunedTreeModel <- train(tunedTree, zooTask)

##(3)可视化决策树
treeModelData <- getLearnerModel(tunedTreeModel)
library(rpart.plot)
rpart.plot(treeModelData, roundint = FALSE, type = 5)

image-20220414102546922

2.4 嵌套交叉验证

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#inner
inner <- makeResampleDesc("CV", iters = 5)
#outer
outer <- makeResampleDesc("CV", iters = 3)

treeWrapper <- makeTuneWrapper("classif.rpart", resampling = inner, 
                              par.set = treeParamSpace, 
                              control = randSearch) 

parallelStartSocket(cpus = 8)
cvWithTuning <- resample(treeWrapper, zooTask, resampling = outer)
parallelStop()

cvWithTuning
# Resample Result
# Task: zooTib
# Learner: classif.rpart.tuned
# Aggr perf: mmce.test.mean=0.1295306
# Runtime: 13.4196

3、bagging与随机森林

集成学习(Ensemble Learning)是指通过训练多个模型,从而在预测时考虑所有模型的预测结果,可减少单模型的方差。可分为bagging, boosting, stacking三种。虽然集成可用于多种机器学习算法,但还是最常应用于基于树的学习器,例如接下来学到的随机森林就是bagging应用于决策树的经典算法。

  • Bagging, bootstrap aggregating

    (1)决定要训练多少个子模型;

    (2)对于每个子模型,进行自助法抽样(bootstrap)。即从训练集中有放回的抽样。

    (3)根据上述的抽样,训练每一个子模型;

    (4)将新数据输入到每个子模型,做出预测;

    (5)汇总所有子模型的预测结果,进行最终的预测。

由于训练各个子模型的过程理论上互不干扰,因此当训练非常多的子模型时可应用并行化多线程处理。

image-20220417140223441

  • 随机森林(Random forest)是Bagging应用于决策树的实现方式。

    (1)特征变量抽样:如下图所示,随机森林基于Bagging步骤又多了一步。在上述的第二步结束后,在进行第三步训练每一个决策树时,对特定树的每一个节点都随机选择一定比例的预测变量;然后再从中根据基尼增益选择用于划分的预测变量。

    (2)包外(OOB, Out of bag)数据:因为Bagging的自助法抽样,所以每个决策树总会有一些重复样本,而有一定比例的(大部分说法是1/3)样本没有被抽到。换一个角度思考,对于一个样本来说,在随机森林的k个决策树模型中,会有~1/3的决策树训练时没有用到该样本。因此可用来评价、验证模型。

    (3)超参数: ①多少个决策树(不考虑计算开销,越多越好);②在每个节点随机采样的特征数量,即上面第一点的特征变量抽样;③叶节点的最小样本数等与决策树类似的超参数。

    (4)降低方差:bagging适合用于决策树算法,因为bagging的集成方法有利于降低不稳定的强学习器base learnner,比如决策树算法一般预测方差很大。

4、随机森林mlr建模

仍使用上面训练决策树所用到的模型

4.1 动物特征训练数据集

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(mlr3verse)
library(tidyverse)

data(Zoo, package = "mlbench")
Zoo2 = Zoo %>% 
        dplyr::mutate(across(where(is.logical), as.factor))
str(Zoo2)
# 'data.frame':	101 obs. of  17 variables:
# $ hair    : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
# $ feathers: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
# $ eggs    : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
# $ milk    : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
# $ airborne: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
# $ aquatic : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
# $ predator: Factor w/ 2 levels "FALSE","TRUE": 2 1 2 2 2 1 1 1 2 1 ...
# $ toothed : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
# $ backbone: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
# $ breathes: Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 2 ...
# $ venomous: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
# $ fins    : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 1 1 2 2 1 ...
# $ legs    : int  4 4 0 4 4 4 4 0 0 4 ...
# $ tail    : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 1 2 2 2 2 2 1 ...
# $ domestic: Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 2 2 1 2 ...
# $ catsize : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 2 1 1 1 ...
# $ type    : Factor w/ 7 levels "mammal","bird",..: 1 1 4 1 1 1 1 4 4 1 ...

##第1~16列为动物特征
##第17列为动物名标签

4.2 确定预测目标与训练方法

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#(1)确定预测目标,根据1~16列数据预测动物类别
task_classif = as_task_classif(Zoo2, target = "type")
task_classif$col_roles$stratum = "type"

#(2)指定随机森林分类算法
learner = lrn("classif.rpart", predict_type="prob")
learner$param_set

#(3)超参数组合以及遍历方法
search_space = ps(
  num.trees = p_int(lower=1,upper=1000),
  mtry = p_int(lower=6,upper=12),
  min.node.size = p_dbl(lower=1, upper=5),
  max.depth = p_int(lower=1,upper=20)
)

design = expand.grid(num.trees=c(300, 500, 1000),
                     mtry=c(6, 9, 12),
                     min.node.size=c(1,2,3,4,5),
                     max.depth = c(3, 6, 9)) %>% as.data.table()

#(4)交叉验证方法与模型评价指标
resampling = rsmp("cv")
measure = msr("classif.acc")

4.3 模型训练优化超参数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
##创建实例
instance = TuningInstanceSingleCrit$new(
  task = task_classif,
  learner = learner,
  resampling = resampling,
  measure = measure,
  terminator = trm("none"),
  search_space = search_space
)
tuner = tnr("design_points", design = design)  
##优化超参数
tuner$optimize(instance)
as.data.table(instance$archive)[,1:5]
#      num.trees mtry min.node.size max.depth classif.acc
#   1:       300    6             1         3   0.9311764
#   2:       500    6             1         3   0.9200653
#   3:      1000    6             1         3   0.9200653
#   4:       300    9             1         3   0.8814005
#   5:       500    9             1         3   0.9004481
#  ---                                                   
# 131:       500    9             5         9   0.9718615
# 132:      1000    9             5         9   0.9718615
# 133:       300   12             5         9   0.9623377
# 134:       500   12             5         9   0.9623377
# 135:      1000   12             5         9   0.9718615

instance$result_learner_param_vals
# $num.threads
# [1] 1

# $num.trees
# [1] 300

# $mtry
# [1] 12

# $min.node.size
# [1] 4

# $max.depth
# [1] 6
instance$result_y
#classif.acc 
#0.9813853

4.4 使用最佳超参数建模

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
learner$param_set$values = instance$result_learner_param_vals
learner$train(task_classif)
learner$model
# Ranger result

# Call:
#  ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      probability = self$predict_type == "prob", case.weights = task$weights$weight,      num.threads = 1L, num.trees = 300L, mtry = 12L, min.node.size = 4L,      max.depth = 6L) 

# Type:                             Probability estimation 
# Number of trees:                  300 
# Sample size:                      101 
# Number of independent variables:  16 
# Mtry:                             12 
# Target node size:                 4 
# Variable importance mode:         none 
# Splitrule:                        gini 
# OOB prediction error (Brier s.):  0.04122035 


predict(learner$model, task_classif)
prediction = learner$predict(task_classif)
prediction$confusion
#                truth
# response        mammal bird reptile fish amphibian insect mollusc.et.al
#   mammal            41    0       0    0         0      0             0
#   bird               0   20       0    0         0      0             0
#   reptile            0    0       5    0         0      0             0
#   fish               0    0       0   13         0      0             0
#   amphibian          0    0       0    0         4      0             0
#   insect             0    0       0    0         0      8             0
#   mollusc.et.al      0    0       0    0         0      0            10

prediction$score(msrs(c("classif.acc","classif.ce")))
# classif.acc  classif.ce 
# 1           0