1、朴素贝叶斯简介

  • Naive Bayes:预测样本属于每一类别的概率,取概率最高的类别。包含四个概念:后验概率、似然、先验概率以及全概率。如下图示例

image-20220406104834011

例(1):某人的某病诊断结果为阳性,那他实际患该病的概率是多少?

可以视为1个预测变量(诊断结果),一个二分类标签(是否患病)。

下图中关于似然的解释应改为:如果实际患病,那么诊断为阳的概率。

image-20220406111726127

  • (1)如果有多个预测变量,就单独估计每个预测变量的似然,并将它们相乘。这样做的前提是预测变量间是独立的。
  • (2)对于分类型预测变量可以直接计算概率;对于连续型预测变量,假设每类样本该预测变量均呈正态分布,计算概率密度,视为概率。
  • (3)由于全概率难以获得,且计算不同类别的后验概率,全概率值为常数,所以可以直接计算分子的乘积,进而比较不同类别的后验概率,进行分类。

例(2):某班评三好学生,有10%名额,有三项指标。判断某同学是否有希望获奖。

可以视为3个预测变量的二分类分体

image-20220406150739710

2、mlr建模

1
2
library(mlr3verse)
library(tidyverse)

2.1 众议员投票情况示例数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data(HouseVotes84, package = "mlbench")
HouseVotes84_sub = na.omit(HouseVotes84)
str(HouseVotes84_sub)
# 'data.frame':	232 obs. of  17 variables:
#  $ Class: Factor w/ 2 levels "democrat","republican": 1 2 1 1 1 1 1 2 1 2 ...
#  $ V1   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 2 2 2 1 ...
#  $ V2   : Factor w/ 2 levels "n","y": 2 2 2 2 1 1 2 1 2 2 ...
#  $ V3   : Factor w/ 2 levels "n","y": 2 1 2 2 2 2 2 1 2 1 ...
#  $ V4   : Factor w/ 2 levels "n","y": 1 2 1 1 1 1 1 2 1 2 ...
#  $ V5   : Factor w/ 2 levels "n","y": 2 2 1 1 1 1 1 2 1 2 ...
#  $ V6   : Factor w/ 2 levels "n","y": 2 2 1 1 1 1 1 1 1 2 ...
#  $ V7   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 2 2 2 1 ...
#  $ V8   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 2 2 2 1 ...
#  $ V9   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 2 2 2 1 ...
#  $ V10  : Factor w/ 2 levels "n","y": 1 1 1 1 2 1 1 1 1 1 ...
#  $ V11  : Factor w/ 2 levels "n","y": 1 1 2 1 1 2 2 1 2 1 ...
#  $ V12  : Factor w/ 2 levels "n","y": 1 2 1 1 1 1 1 2 1 2 ...
#  $ V13  : Factor w/ 2 levels "n","y": 2 2 1 1 1 1 1 2 1 2 ...
#  $ V14  : Factor w/ 2 levels "n","y": 2 2 1 1 1 1 1 2 1 2 ...
#  $ V15  : Factor w/ 2 levels "n","y": 2 1 2 2 2 2 2 1 2 1 ...
#  $ V16  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 1 ...
#  - attr(*, "na.action")= 'omit' Named int [1:203] 1 2 3 4 5 7 8 10 11 12 ...
#   ..- attr(*, "names")= chr [1:203] "1" "2" "3" "4" ...

#第一列:众议员的派别
#第2到17列:16次投票的表决情况

2.2 确定预测目标与训练方法

  • 根据16次的表决情况,判断某议员是共和党还是民主党
1
2
task_classif = as_task_classif(HouseVotes84_sub, target = "Class")
task_classif$col_roles$stratum = "Class"
  • 使用朴素贝叶斯的分类学习器
1
2
learner = lrn("classif.naive_bayes", predict_type="prob")
learner$param_set #无可调超参数

2.3 模型训练、预测

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## 如下使用60%数据训练、40%数据验证
split = partition(task_classif, ratio = 0.6, stratify = T)
learner$train(task_classif, row_ids = split$train)
prediction = learner$predict(task_classif, row_ids = split$test)
prediction$confusion
#             truth
# response     democrat republican
#   democrat         44          2
#   republican        6         41

as.data.table(prediction) %>% head
#    row_ids    truth response prob.democrat prob.republican
# 1:       3 democrat democrat             1    2.762994e-11
# 2:       6 democrat democrat             1    2.849338e-11
# 3:       9 democrat democrat             1    2.762994e-11

prediction$score(msr("classif.auc"))
#classif.auc 
#  0.9683721

2.4 交叉验证模型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#10折交叉验证
resampling = rsmp("cv")
rr = resample(task_classif, learner, resampling)
rr$prediction()
# <PredictionClassif> for 232 observations:
#     row_ids      truth   response prob.democrat prob.republican
#          48   democrat   democrat  1.000000e+00    7.489670e-12
#          50   democrat   democrat  9.985252e-01    1.474753e-03
#          55   democrat   democrat  1.000000e+00    6.878633e-11
# ---                                                            
#         144 republican republican  9.822952e-08    9.999999e-01
#         190 republican   democrat  9.999878e-01    1.215106e-05
#         200 republican republican  8.595926e-08    9.999999e-01

rr$aggregate(msr("classif.auc"))
# classif.auc 
# 0.9709499

rr$aggregate(msr("classif.auc"))
# classif.auc 
# 0.9709499

autoplot(rr, type = "roc")
image-20220701211000852