📖 生信数据分析--分析流程，工具包等

机器学习基于R(0)--mlr3基本流程 V2

https://mlr3book.mlr-org.com/ 1 2 3 4 5 6 7 8 9 10 library(mlr3verse) library(tidyverse) tsks() #预置数据任务 lrns() #机器学习算法 msrs() #性能评价指标 as.data.table() 1. Task 任务 https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 tsk() #预置数据任务 as.data.table(tsk()) tsk("mtcars") #自定义任务 tsk_mtcars = as_task_regr(mtcars, target = "mpg", id = "cars") #target参数指定标签列，id参数（可选）设置任务名 as_task_classif() #支持对任务对象进行数据查看、修改等操作，不一一列举，详见上述链接 #有两点需要重点说明 tsk_mtcars$row_ids #不等于一般的行序号。一旦定义任务，row_ids就确定不变了，可以理解为row name。方便后续数据分割。 tsk_mtcars_another = tsk_mtcars$clone() #想要独立的复制任务时，需要使用clone() 对于分类任务基本类似。值得注意的是在二分类问题时，需要进一步指定阳性标签 ...

机器学习基于R(0)--mlr3基本流程

1 2 library(mlr3verse) library(tidyverse) 1、Task训练数据与目的 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 分类任务 task_classif = as_task_classif(data, target = "col_target") #根据预测结果又可分为：twoclass二分类, multiclass多分类 ## 回归任务 task_regr = as_task_regr(data, target = "col_target") task$ncol task$nrow task$feature_names task$feature_types task$target_names task$task_type task$data() task$col_roles 2、Learner 机器学习算法 mlr3learners包提供了基本的机器学习算法（如下图） https://github.com/mlr-org/mlr3learners ...

机器学习基于R包mlr3(1)--分类--KNN

KNN–K近邻 1、KNN的步骤（1）计算输入数据与训练数据的距离（一般欧几里得距离）；（2）从训练集中，选取距离输入数据点最近的k个数据；（3）对于分类任务【常见】，取这k个训练数据类别的众数；对于回归任务，取这k个训练数据值的平均数。特点（1）如上步骤，KNN没有模型训练的过程。需要预测数据时，直接与训练数据集进行计算即可。（2）KNN算法中最重要的超参数就是K的选择，会在下面具体操作中介绍。（3）因为需要计算距离，所以需要进行数值变量标准化，以及类别变量转化（如果有分类变量的话）。（4）KNN在数据量小或者维度较小的情况下效果很好，但不适用于大规模的数据（计算量大）。关于距离，欧几里得距离，归一化（中心化） ...

机器学习基于R包mlr3(2)--分类--逻辑回归.md

1、逻辑回归的算法理解逻辑回归 = 线性回归 + Sigmoid函数与线性回归相同的是同样需要学习变量的权重(系数)与偏置(截距)；与线性回归不同的是逻辑回归的输出必须限制在0和1之间，即解释为概率（二分类）。一般来说：P>0.5,分类为1，P<0.5分类为0 2、mlr建模 1 2 library(mlr3verse) library(tidyverse) 2.1 泰坦尼克号示例数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 data(titanic_train, package = "titanic") titanicSub = titanic_train[,c("Survived","Sex","Pclass", "Age","Fare","SibSp","Parch")] summary(titanicSub) # Survived Sex Pclass Age Fare SibSp Parch # Min. :0.0000 Length:891 Min. :1.000 Min. : 0.42 Min. : 0.00 Min. :0.000 Min. :0.0000 # 1st Qu.:0.0000 Class :character 1st Qu.:2.000 1st Qu.:20.12 1st Qu.: 7.91 1st Qu.:0.000 1st Qu.:0.0000 # Median :0.0000 Mode :character Median :3.000 Median :28.00 Median : 14.45 Median :0.000 Median :0.0000 # Mean :0.3838 Mean :2.309 Mean :29.70 Mean : 32.20 Mean :0.523 Mean :0.3816 # 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00 3rd Qu.: 31.00 3rd Qu.:1.000 3rd Qu.:0.0000 # Max. :1.0000 Max. :3.000 Max. :80.00 Max. :512.33 Max. :8.000 Max. :6.0000 # 第一列：生存与否0/1 # 第二列：性别 # 第三列：头等舱、二等舱、三等舱 1/2/3 # 第四列：年龄 # 第五列：票价 # 第六列：兄弟姐妹+配偶人数 # 第七列：父母和孩子总人数数据预处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #删除含有缺失值的行 titanicSub = na.omit(titanicSub) #对于分类变量因子化 titanicSub$Survived = factor(titanicSub$Survived) titanicSub$Sex = factor(titanicSub$Sex) titanicSub$Pclass = factor(titanicSub$Pclass) head(titanicSub) # Survived Sex Pclass Age Fare SibSp Parch # 1 0 male 3 22 7.2500 1 0 # 2 1 female 1 38 71.2833 1 0 # 3 1 female 3 26 7.9250 0 0 # 4 1 female 1 35 53.1000 1 0 # 5 0 male 3 35 8.0500 0 0 # 7 0 male 1 54 51.8625 0 0 2.2 确定预测目标与训练方法（1）确定预测目的：根据5个变量Pclass，Sex ，Age，Fare 以及FamSize预测是否会生存 1 2 3 task_classif = as_task_classif(titanicSub, target = "Survived") task_classif$col_roles$stratum = "Survived" task_classif$col_roles （2）确定预测方法：使用逻辑回归算法，无可调超参数 1 2 3 #设置predict.type参数为"prob"，则预测输出不仅仅是分类变量，还有概率值 learner = lrn("classif.log_reg", predict_type = "prob") learner$param_set 2.3 模型训练、预测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #使用训练集训练模型 split = partition(task_classif, ratio = 0.6, stratify = T) learner$train(task_classif, row_ids = split$train) #使用测试集预测模型 prediction = learner$predict(task_classif, row_ids = split$test) prediction$confusion # truth # response 0 1 # 0 138 36 # 1 32 80 as.data.table(prediction) %>% head # row_ids truth response prob.0 prob.1 # 1: 1 0 0 0.9240149 0.07598510 # 2: 17 0 1 0.4895318 0.51046818 # 3: 28 0 0 0.7318201 0.26817990 ##对于二分类问题可以查看AUC值等指标 prediction$score(msrs(c("classif.acc","classif.auc"))) #classif.acc classif.auc # 0.7622378 0.8245436 autoplot(prediction, type = "roc") 理解模型的系数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 learner$model$coefficients # (Intercept) Age Fare Parch Pclass2 Pclass3 Sexmale SibSp # -4.776866553 0.055414781 -0.001065547 -0.024478041 1.081818050 2.554926703 3.027348240 0.481382618 #指数转换 exp(cbind(Odds_Ratio = learner$model$coefficients)) # Odds_Ratio # (Intercept) 0.008422349 # Age 1.056978939 # Fare 0.998935021 # Parch 0.975819117 # Pclass2 2.950037995 # Pclass3 12.870356259 # Sexmale 20.642421208 # SibSp 1.618310361 #对于连续变量的解释：例如Age：表示其它变量不变，每Age增长1岁，生存率降低1% #对于分类变量的解释，需要有参照：例如Sexmale：表示男性的生存率仅为女性的20% 模型预测 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 data(titanic_test, package = "titanic") #无生存信息 titanicNewClean = titanic_test[,c("Sex","Pclass", "Age","Fare","SibSp","Parch")] titanicNewClean = na.omit(titanicNewClean) titanicNewClean$Sex = factor(titanicNewClean$Sex) titanicNewClean$Pclass = factor(titanicNewClean$Pclass) learner$predict_newdata(titanicNewClean) # <PredictionClassif> for 331 observations: # row_ids truth response prob.0 prob.1 # 1 <NA> 0 0.92308299 0.07691701 # 2 <NA> 0 0.61325375 0.38674625 # 3 <NA> 0 0.92323646 0.07676354 # --- # 329 <NA> 1 0.35533044 0.64466956 # 330 <NA> 1 0.04779831 0.95220169 # 331 <NA> 0 0.93389621 0.06610379 2.4 交叉验证模型 10次重复的5折交叉验证 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 resampling = rsmp("repeated_cv") resampling$param_set$values$repeats = 5 resampling$param_set$values$folds = 5 rr$prediction() rr$score(msr("classif.auc"))[,c(-1,-3,-5,-8)] # task_id learner_id resampling_id iteration classif.auc # 1: titanicSub classif.log_reg repeated_cv 1 0.8539554 # 2: titanicSub classif.log_reg repeated_cv 2 0.8385396 # 3: titanicSub classif.log_reg repeated_cv 3 0.8392495 # 4: titanicSub classif.log_reg repeated_cv 4 0.8678499 # 5: titanicSub classif.log_reg repeated_cv 5 0.8663793 # 6: titanicSub classif.log_reg repeated_cv 6 0.8709939 # 7: titanicSub classif.log_reg repeated_cv 7 0.8847870 # 8: titanicSub classif.log_reg repeated_cv 8 0.8330629 # 9: titanicSub classif.log_reg repeated_cv 9 0.8314402 # 10: titanicSub classif.log_reg repeated_cv 10 0.8467775 # 11: titanicSub classif.log_reg repeated_cv 11 0.7981744 # 12: titanicSub classif.log_reg repeated_cv 12 0.8308316 # 13: titanicSub classif.log_reg repeated_cv 13 0.8947262 # 14: titanicSub classif.log_reg repeated_cv 14 0.8658215 # 15: titanicSub classif.log_reg repeated_cv 15 0.8723317 # 16: titanicSub classif.log_reg repeated_cv 16 0.8669371 # 17: titanicSub classif.log_reg repeated_cv 17 0.8277890 # 18: titanicSub classif.log_reg repeated_cv 18 0.8336714 # 19: titanicSub classif.log_reg repeated_cv 19 0.8377282 # 20: titanicSub classif.log_reg repeated_cv 20 0.8955255 # 21: titanicSub classif.log_reg repeated_cv 21 0.8566937 # 22: titanicSub classif.log_reg repeated_cv 22 0.8292089 # 23: titanicSub classif.log_reg repeated_cv 23 0.9101420 # 24: titanicSub classif.log_reg repeated_cv 24 0.7997972 # 25: titanicSub classif.log_reg repeated_cv 25 0.8844417 rr$aggregate(msrs(c("classif.auc","classif.prauc"))) # classif.auc classif.prauc # 0.8534742 0.8735251 autoplot(rr, type = "prc")

机器学习基于R包mlr3(3)--分类--LDA与QDA

1、概述 LDA与QDA可以简单理解为有监督的降维，将多个预测变量信息压缩成少数（类别数-1）新的预测变量。每一个新的预测变量称之为判别函数，由所有原始变量的线性组合。 ...

机器学习基于R包mlr3(4)--分类--朴素贝叶斯

1、朴素贝叶斯简介 Naive Bayes：预测样本属于每一类别的概率，取概率最高的类别。包含四个概念：后验概率、似然、先验概率以及全概率。如下图示例例(1)：某人的某病诊断结果为阳性，那他实际患该病的概率是多少？ ...

机器学习基于R包mlr3(5)--分类--SVM

1、SVM相关基本概念超平面：比数据集的变量少一个维度的平面，也称为决策边界；间隔：（对于硬间隔）训练数据中最接近决策边界的样本点与决策边界之间的距离；支持向量：（对于硬间隔）接触间隔边界的数据样本，它们是支持超平面的位置。（对于软间隔）间隔内的样本点也属于支持向量，因为移动它们也会改变超平面的位置。如下图所示，SVM算法将寻找一个最优的线性超平面进行分类。 ...

机器学习基于R包mlr3(6)--分类--决策树与随机森林

1、决策树基础 1.1 决策树的构成（1）决策树由节点组成，可分为决策节点(Decision tree)与叶节点(leaf node)。（2）从上到下的第一个节点也称为根节点(Root Node)。根节点到叶节点的最长距离称为树的深度。 ...

机器学习基于R包mlr3(8)--回归--线性回归

1、关于线性回归 1.1 公式理解由于实际问题很少遇到单变量线性回归，所以更常见的表示为通用线性模型： $$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + …+\beta_kx_k + \epsilon $$ （1）β0表示截距，即所有预测变量取0时的值； ...

机器学习基于R包mlr3(9)--回归--GAM非线性回归

1、关于GAM非线性回归 (1) n阶多项式如前所说，线性回归的假设是每个预测变量与输出变量之间为线性相关。即类似 y = ax + b。当预测变量与输出变量之间为非线性相关，即呈曲线特征时，可尝试使用高阶多项式进行拟合。 $$ y = \beta_0 + \beta_1x + \beta_2x^2 + …+\beta_nx^n + \varepsilon $$ ...