📖 生信数据分析--分析流程，工具包等

化合物敏感度数据库GDSC_CTRL

一、GDSC GDSC : https://www.cancerrxgene.org/，已上传至阿里云盘 1、原始数据整理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 ## 预处理 # library(tidyverse) # #RAW 文件夹 # gdsc_drug = read.csv("GDSC_drug.csv") # colnames(gdsc_drug) = gsub("[.]", "_", colnames(gdsc_drug)) # # gdsc_cl = read.csv("GDSC_cellline.csv") # colnames(gdsc_cl) = gsub("[.]", "_", colnames(gdsc_cl)) # gdsc_cl = gdsc_cl %>% # reshape2::dcast(Cell_line_Name+Model_ID+COSMIC_ID+TCGA_Classfication+Tissue+Tissue_sub_type~Datasets, # value.var = "number_of_drugs") # # GDSC1 = readxl::read_excel("GDSC1_fitted_dose_response_25Feb20.xlsx") # GDSC1 = GDSC1[,c(-4, -6)] # GDSC1 = GDSC1[,c(-6, -8, -9)] # GDSC1 = GDSC1 %>% # dplyr::select(DATASET, DRUG_NAME, CELL_LINE_NAME, TCGA_DESC, LN_IC50, AUC, RMSE, Z_SCORE, everything()) # GDSC1 = GDSC1 %>% as.data.frame() # head(GDSC1) # # GDSC2 = readxl::read_excel("GDSC2_fitted_dose_response_25Feb20.xlsx") # GDSC2 = GDSC2[,c(-4, -6)] # GDSC2 = GDSC2[,c(-6, -8, -9)] # GDSC2 = GDSC2 %>% # dplyr::select(DATASET, DRUG_NAME, CELL_LINE_NAME, TCGA_DESC, LN_IC50, AUC, RMSE, Z_SCORE, everything()) # GDSC2 = GDSC2 %>% as.data.frame() # head(GDSC2) # # GDSC_merge = rbind(GDSC1, GDSC2) # head(GDSC_merge) # # head(gdsc_cl) 2、敏感度实验结果 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 GDSC_res = read.csv("GDSC/GDSC_result.csv") # DATASET DRUG_NAME CELL_LINE_NAME TCGA_DESC LN_IC50 AUC RMSE Z_SCORE # 1 GDSC1 Erlotinib MC-CAR MM 2.395685 0.982114 0.022521 -0.189576 # 2 GDSC1 Erlotinib ES3 UNCLASSIFIED 3.140923 0.984816 0.031840 0.508635 # 3 GDSC1 Erlotinib ES5 UNCLASSIFIED 3.968757 0.985693 0.026052 1.284229 # 4 GDSC1 Erlotinib ES7 UNCLASSIFIED 2.692768 0.972699 0.110056 0.088760 # 5 GDSC1 Erlotinib EW-11 UNCLASSIFIED 2.478678 0.944462 0.087011 -0.111820 # 6 GDSC1 Erlotinib SK-ES-1 UNCLASSIFIED 2.034050 0.950763 0.016288 -0.528390 ## 总共药物数 GDSC_res %>% dplyr::distinct(DRUG_NAME) %>% dim() # [1] 449 1 ## 每期药物数 GDSC_res %>% dplyr::distinct(DATASET, DRUG_NAME) %>% dplyr::count(DATASET, name = "Drugs") # DATASET Drugs # 1 GDSC1 345 # 2 GDSC2 192 ## 每个细胞系的实验数 GDSC_res %>% dplyr::count(DATASET, CELL_LINE_NAME, name = "assays") %>% reshape2::dcast(CELL_LINE_NAME ~ DATASET, value.var = "assays") %>% dplyr::arrange(desc(GDSC1)) %>% head() # CELL_LINE_NAME GDSC1 GDSC2 # 1 A253 367 179 # 2 AMO-1 367 178 # 3 KCL-22 367 178 # 4 KNS-42 367 NA summary(GDSC_res$LN_IC50) # Min. 1st Qu. Median Mean 3rd Qu. Max. # -10.5793 0.8435 2.6228 2.2052 4.1216 12.3591 summary(GDSC_res$AUC) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.00479 0.78839 0.92309 0.84467 0.97306 0.99984 cor(GDSC_res$LN_IC50, GDSC_res$AUC) # [1] 0.7534196 关于IC50与AUC：https://blog.csdn.net/linkequa/article/details/88221975 ...

ChemmineR处理化合物信息的基础工具R包

ChemmineR是使用R语言实现化合物基础操作的工具包，现根据其官方文档学习其主要用法如下： https://www.bioconductor.org/packages/release/bioc/vignettes/ChemmineR/inst/doc/ChemmineR.html 1 2 3 4 5 6 if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("ChemmineR") library("ChemmineR") # library("ChemmineOB") 1. SDFset格式 ChemmineR基础操作是围绕SDFset对象展开的，其表示多个SDF格式的化合物集合 1 2 3 4 5 6 7 8 9 data(sdfsample) sdfset = sdfsample # valid <- validSDF(sdfset) # sdfset <- sdfset[valid] class(sdfset) # SDFset length(sdfset) # 100 c(sdfset[1:4], sdfset[5:8]) # 合并 sdfset[1:4] # 子集每个SDFset集合是由单个SDF对象组成的，主要由4部分构成 <<header» : 化合物id等基本信息 <<atomblock» : 原子信息，<<bondblock»: 键信息 <<datablock» : 化合物的属性/其它注释信息 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 sdfset[[1]] as(sdfset[[1]], "list") ## ID cid(sdfset[1:2]) # slot ID sdfid(sdfset[1:2]) # header ID cid(sdfset) = sdfid(sdfset) ## Component header(sdfset[[1]]) # character atomblock(sdfset[[1]]) # matrix bondblock(sdfset[[1]]) # matrix datablock(sdfset[[1]]) # character blockmatrix = datablock2ma(datablock(sdfset[1:2])) 补充：ChemmineR提供一些函数可计算化合物的基本属性信息，例如分子量等。此外ChemmineOB也可以实现类似功能。 ...

深度学习D2L--01--线性回归

深度学习组成要素线性回归可以认为是最简单的一层深度神经网络一、从零实现 1 2 3 import numpy as np import torch import random 1、示例数据模拟样本特征与标签数据，并分成小批量传入 ...

深度学习D2L--02--softmax多分类

一、从零实现 1 2 3 4 import torch import torchvision from torch.utils import data from torchvision import transforms 1、示例数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ##数据集为fashion_minist，10类衣服及对应的图片 def load_data_fashion_minist(batch_size): #将图片转为张量矩阵 trans = transforms.ToTensor() mnist_train = torchvision.datasets.FashionMNIST( root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST( root="./data", train=False, transform=trans, download=True) #生成训练集数据迭代器 train_iter = data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True) #生成测试集数据迭代器 test_iter = data.DataLoader(mnist_train, batch_size=batch_size, shuffle=False) return train_iter, test_iter # batch_size = 256 # train_iter, test_iter = load_data_fashion_minist(batch_size) # X, y = next(iter(train_iter)) # X.shape # # torch.Size([256, 1, 28, 28]) 2、定义模型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def net(X): # W.shape -- torch.Size([784, 10]) # b.shape -- torch.Size([10]) # 全连接层 X = torch.matmul(X.reshape((-1, 784)), W) + b # 激活函数(见上) X_softmax = softmax(X) return X_softmax def softmax(X): # 幂函数使数据具有非负性 X_exp = torch.exp(X) partition = X_exp.sum(1, keepdim=True) # 归一化，一个样本对于全部类别预测结果和为1 return X_exp/partition # W = torch.normal(0, 0.01, (784, 10), requires_grad = True) # b = torch.zeros(10, requires_grad = True) # y_hat = net(X) # y_hat.shape # # torch.Size([256, 10]) 3、定义损失函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 def cross_entropy(y_hat, y): # 对于某个样本真实类别的预测概率 y_hat_target = y_hat[range(len(y_hat)), y] # 负log转换--→ 符合最小化 return - torch.log(y_hat_target) # cross_entropy(y_hat, y).shape # # torch.Size([256]) ## 计算分类精度评价指标 # 计算1个batch的分类正确数 def accuracy(y_hat, y): y_hat_class = y_hat.argmax(1) # 样本类别预测与否(True/False) cmp = y_hat_class.type(y.dtype) == y return cmp.sum().item() # accuracy(y_hat, y) # # 8 # 定义一个累加值计数器：用以累计1轮epoch所有batch的分类精度 class Accumulator: def __init__(self, n): self.data = [0.0]*n # [0, 0] def add(self, *args): # [0, 0] + [1, 2] = [1, 2] self.data = [a + b for a, b in zip(self.data, args)] def reset(self): self.data = [0.0]*len(self.data) def __getitem__(self, idx): return self.data[idx] # 计算测试集的分类精度 def evaluate_accuracy(net, data_iter): metric = Accumulator(2) with torch.no_grad(): for X, y in data_iter: metric.add(accuracy(net(X), y), len(y)) Acc_avg = metric[0]/metric[1] return Acc_avg # evaluate_accuracy(net, test_iter) # # 0.0257 4、定义优化算法 1 2 3 4 5 6 # (同线性回归) def sgd(params, lr, batch_size): with torch.no_grad(): for param in params: param -= lr * param.grad / batch_size param.grad.zero_() 5、训练模型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 W = torch.normal(0, 0.01, (784, 10), requires_grad = True) b = torch.zeros(10, requires_grad = True) lr = 0.01 batch_size=256 epoch_metric = [] num_epochs = 10 for epoch in range(num_epochs): train_metric = Accumulator(3) for X, y in train_iter: y_hat = net(X) l = cross_entropy(y_hat, y) l.sum().backward() sgd([W, b], lr, batch_size) train_metric.add(l.sum().item(), accuracy(y_hat, y), len(y)) acc_avg = train_metric[1]/train_metric[2] loss_avg = train_metric[0]/train_metric[2] test_acc_avg = evaluate_accuracy(net, test_iter) epoch_metric.append([loss_avg, acc_avg, test_acc_avg]) print(f'epoch {epoch + 1},train loss {loss_avg:.3f} | train acc {acc_avg:.3f} | test acc {test_acc_avg:.3f}') import pandas as pd epoch_metric_df = pd.DataFrame(epoch_metric, columns=["train_loss","train_acc","test_acc"]) epoch_metric_df.plot.line() 二、torch框架 1 2 3 4 5 import torch import torchvision from torch import nn from torch.utils import data from torchvision import transforms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 ## (1) 示例数据 def load_data_fashion_minist(batch_size): trans = transforms.ToTensor() mnist_train = torchvision.datasets.FashionMNIST( root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST( root="./data", train=False, transform=trans, download=True) train_iter = data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True) test_iter = data.DataLoader(mnist_train, batch_size=batch_size, shuffle=False) return train_iter, test_iter # batch_size = 256 # train_iter, test_iter = load_data_fashion_minist(batch_size) # X, y = next(iter(train_iter)) # X.shape # # torch.Size([256, 1, 28, 28]) ## (2) 定义模型 net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10)) # len(net) # # 2 # net[1].weight # net[1].bias # y_hat = net(X) # y_hat.shape # # torch.Size([256, 10]) ## (3) 定义损失函数 loss = nn.CrossEntropyLoss(reduction='none') # loss(y_hat, y).shape # # torch.Size([256]) # 计算1个batch的分类正确数 def accuracy(y_hat, y): y_hat_class = y_hat.argmax(1) # 样本类别预测与否(True/False) cmp = y_hat_class.type(y.dtype) == y return cmp.sum().item() # accuracy(y_hat, y) # # 13 # 定义一个累加值计数器：用以累计1轮epoch所有batch的分类精度 class Accumulator: def __init__(self, n): self.data = [0.0]*n # [0, 0] def add(self, *args): # [0, 0] + [1, 2] = [1, 2] self.data = [a + b for a, b in zip(self.data, args)] def reset(self): self.data = [0.0]*len(self.data) def __getitem__(self, idx): return self.data[idx] # 计算测试集的分类精度 def evaluate_accuracy(net, data_iter): metric = Accumulator(2) with torch.no_grad(): for X, y in data_iter: metric.add(accuracy(net(X), y), len(y)) Acc_avg = metric[0]/metric[1] return Acc_avg # evaluate_accuracy(net, test_iter) # # 0.0515 ## (4) 定义优化算法 net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10)) optimizer = torch.optim.SGD(net.parameters(), lr = 0.1) ## (5) 训练模型 batch_size=256 train_iter, test_iter = load_data_fashion_minist(batch_size) epoch_metric = [] num_epochs = 10 for epoch in range(num_epochs): train_metric = Accumulator(3) net.train() for X, y in train_iter: y_hat = net(X) l = loss(y_hat, y) optimizer.zero_grad() l.mean().backward() optimizer.step() train_metric.add(l.sum().item(), accuracy(y_hat, y), len(y)) acc_avg = train_metric[1]/train_metric[2] loss_avg = train_metric[0]/train_metric[2] net.eval() test_acc_avg = evaluate_accuracy(net, test_iter) epoch_metric.append([loss_avg, acc_avg, test_acc_avg]) print(f'epoch {epoch + 1},train loss {loss_avg:.3f} | train acc {acc_avg:.3f} | test acc {test_acc_avg:.3f}') import pandas as pd epoch_metric_df = pd.DataFrame(epoch_metric, columns=["train_loss","train_acc","test_acc"]) epoch_metric_df.plot.line() 值得注意的是在使用torch框架时，并没有像从零实现那样进行幂函数归一化转换。 ...

深度学习D2L--03--K折交叉验证的torch训练基础流程

1、加载库 1 2 3 4 5 6 7 import pandas as pd import torch from torch import nn from torch.nn import functional as F from torch.utils import data import itertools 2、示例数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_train.csv # http://d2l-data.s3-accelerate.amazonaws.com/kaggle_house_pred_test.csv train_data = pd.read_csv("../data/kaggle_house_pred_train.csv") test_data = pd.read_csv("../data/kaggle_house_pred_test.csv") train_data.shape, test_data.shape all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) num_features = all_features.dtypes[all_features.dtypes != "object"].index all_features[num_features] = all_features[num_features].apply( lambda x: (x - x.mean()) / (x.std()) ) all_features[num_features] = all_features[num_features].fillna(0) all_features = pd.get_dummies(all_features, dummy_na=True) all_features.shape n_train = train_data.shape[0] train_feats = torch.tensor(all_features[:n_train].values, dtype=torch.float32) test_feats = torch.tensor(all_features[n_train:].values, dtype=torch.float32) train_labels = torch.tensor(train_data.SalePrice.values.reshape((-1,1)), dtype=torch.float32) 3、定义模型框架 1 2 3 4 5 6 7 8 9 10 11 class MLP(nn.Module): def __init__(self, in_feats, hidden_feats, dropout): super().__init__() self.hidden = nn.Linear(in_feats, hidden_feats) self.out = nn.Linear(hidden_feats, 1) self.dropout = nn.Dropout(dropout) def forward(self, X): hiddens = F.relu(self.hidden(X)) output = self.out(self.dropout(hiddens)) return output torch模型基础 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 model = MLP(10, 6, 0.1) model ## 查看torch默认初始化的每一层参数 model.state_dict() model.state_dict().keys() model.state_dict()['hidden.bias'] model.hidden.bias.data model.out.weight.grad == None #自定义模型参数初始化方式 def init_normal(m): if type(m) == nn.Linear: nn.init.normal_(m.weight, mean=0, std=0.01) nn.init.zeros_(m.bias) model.apply(init_normal) model.state_dict() def xvaier(m): if type(m) == nn.Linear: nn.init.xavier_uniform_(m.weight) model.apply(xvaier) model.state_dict() #保存与加载模型参数 torch.save(model.state_dict(), "mlp.params") new_model = MLP(10, 6, 0.1) new_model.load_state_dict(torch.load("mlp.params")) #GPU加速 nvidia-smi #查看当前系统的GPU情况 watch -n 0.1 -d nvidia-smi #动态刷新查看 torch.cuda.is_available() #是否有GPU资源 torch.cuda.device_count() #查看可用的GPU数量 ##将数据与模型都转移到同一个GPU上 def try_gpu(i=0): if torch.cuda.device_count() >= i + 1 : return torch.device(f'cuda:{i}') return torch.device("cpu") X = torch.ones(2, 3, device = try_gpu(0)) model.to("cuda:0") 4、定义损失函数与性能评价方法 1 2 3 4 5 6 loss = nn.MSELoss() def log_rmse(model, feature, labels): clipped_preds = torch.clamp(model(feature), 1, float('inf')) rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels))) return rmse.item() 5、小批量训练框架 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def load_array(data_arrays, batch_size, is_train=True): dataset = data.TensorDataset(*data_arrays) return data.DataLoader(dataset, batch_size, shuffle=is_train) def train(model, train_feats, train_labels, test_feats, test_labels, num_epochs, lr, weight_decay, batch_size): train_ls, test_ls = [],[] #记录每一轮epoch的训练集/测试集性能 train_iter = load_array((train_feats, train_labels), batch_size) optimizer = torch.optim.Adam(model.parameters(), lr = lr, weight_decay=weight_decay) for epoch in range(num_epochs): for X, y in train_iter: optimizer.zero_grad() l = loss(model(X), y) l.backward() optimizer.step() train_ls.append(log_rmse(model, train_feats, train_labels)) if test_labels is not None: test_ls.append(log_rmse(model, test_feats, test_labels)) return train_ls, test_ls 6、K折交叉验证 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 def get_k_fold_data(k, i, X, y): assert k > 1 fold_size = X.shape[0] // k X_train, y_train = None, None for j in range(k): idx = slice(j*fold_size, (j+1)*fold_size) X_part, y_part = X[idx, :], y[idx] if j == i: X_valid, y_valid = X_part, y_part elif X_train is None: X_train, y_train = X_part, y_part else: X_train = torch.cat([X_train, X_part], 0) y_train = torch.cat([y_train, y_part], 0) return X_train, y_train, X_valid, y_valid def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size, in_feats, hidden_feats, dropout): train_l_sum, valid_l_sum = 0,0 for i in range(k): data = get_k_fold_data(k, i, X_train, y_train) model = MLP(in_feats, hidden_feats, dropout) train_ls, valid_ls = train(model, *data, num_epochs, learning_rate, weight_decay, batch_size) #将最后一轮的性能作为该模型的最终性能 train_l_sum += train_ls[-1] valid_l_sum += valid_ls[-1] # print(f'Fold-{i+1}, train log rmse {float(train_ls[-1]):f},' # f'valid log rmse {float(valid_ls[-1]):f}') return train_l_sum / k, valid_l_sum / k # k, num_epochs, learning_rate, weight_decay, batch_size = 10, 100, 5, 0, 64 # in_feats, hidden_feats, dropout = train_feats.shape[1], 64, 0.5 # train_l, valid_l = k_fold(k, train_feats, train_labels, # num_epochs, learning_rate, weight_decay, batch_size, # in_feats, hidden_feats, dropout) 7、超参数遍历 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k, num_epochs= 5, 100 in_feats = [train_feats.shape[1]] learning_rate = [0.1, 1, 3, 5] weight_decay = [0, 0.001] batch_size = [32, 64] hidden_feats = [16, 64, 128] dropout = [0, 0.1] grid_iter = itertools.product(learning_rate, weight_decay, batch_size, in_feats, hidden_feats, dropout) len_grids = len(list(grid_iter)) grid_train_l, grid_valid_l = [], [] for j, args in enumerate(itertools.product(learning_rate, weight_decay, batch_size, in_feats, hidden_feats, dropout)): print(f'{j+1}--{len_grids}: {args}') train_l, valid_l = k_fold(k, train_feats, train_labels, num_epochs, *args) grid_train_l.append(train_l) grid_valid_l.append(valid_l) print(f'---- valid rmse {valid_l:.2f}')

深度学习--VAE变分自动编码器

自动编码器（AE, autoencoder）是应用神经网络进行数据降维的有效方，其结构分为编码器（encoder）与解码器（decoder）两部分；基于损失函数优化使模型的编码器输入数据与解码器输出数据尽可能相一致。中间层数据可视为低维结果。 ...

D2L--第二章预备知识

1. 数据操作 1.1 入门张量：具有多个维度（轴）的数组。具有一个轴的张量，对应数学上的向量；具有两个轴的张量，对应数学上的矩阵。创建张量 1 2 3 4 5 6 7 8 9 import torch x = torch.arange(12) # 长度为12个行向量 torch.zeros((2, 3, 4)) torch.ones((2, 3, 4)) torch.randn(3, 4) torch.tensor([[2,1,4,3],[1,2,3,4],[4,3,2,1]]) 基本信息 1 2 3 4 5 x.shape x.numel #元素个数 X = x.reshape(3, 4) #修改形状 x.reshape(-1, 4) x.reshape(3, -1) 1.2 运算符任意两个形状相同的张量，执行基本运算符时，均为按元素操作，结果的形状不变。 1 2 3 4 x = torch.tensor([1.0, 2, 3, 4]) y = torch.tensor([2, 2, 2, 2]) x-y, x+y, x*x, x/y, x**y 张量连接操作concatenate 1 2 3 4 5 x = torch.arange(12, dtype = torch.float32).reshape(3, 4) y = torch.tensor([[2,1,4,3],[1,2,3,4],[4,3,2,1]]) torch.cat((x, y), dim = 0) #纵向拼接，增加轴0的维度/行 torch.cat((x, y), dim = 1) #横向拼接，增加轴1的维度/列逻辑运算符构建逻辑张量 1 x == y 1.3 广播机制形状不同的两个张量执行基本运算时，会适当复制元素扩展数组，使二者具有相同形状，再按元素计算 1 2 3 4 5 6 x = torch.arange(6) x + torch.tensor(1) a = torch.arange(3).reshape(3, 1) b = torch.arange(2).reshape(1, 2) a + b 1.4 索引切片类似Python数组操作 1 2 3 4 X = torch.arange(12).reshape(3, 4) X[-1] #最后一行 X[1:3] #第二、三行 X[:, 1:3] #第二、三列 1.5 节省内存变量名赋值新的计算结果时，会重新分配内存 1 2 3 4 5 6 a = torch.tensor(0) before = id(a) #内存地址 a = a + torch.tensor(1) # 重新分配内存 id(a) == before # False 原地更新、覆盖先前的计算结果 1 2 3 4 5 a = torch.tensor(0) before = id(a) #内存地址 a[:] = a + torch.tensor(1) id(a) == before 1.6 转为其它Python对象转为Numpy数组 1 2 3 A = X.numpy() # tensor→numpy torch.tensor(A) # numpy→tensor 大小为1的张量转为Python标量 1 2 3 4 5 a = torch.tensor(3.0) a.item() float(a) int(a) 2. 数据预处理 2.1 读取数据集 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import os import pandas as pd os.makedirs(os.path.join('..','data'), exist_ok=True) #上一级目录创建data文件夹 data_file = os.path.join('..', 'data', 'house_tiny.csv') with open(data_file, 'w') as f: f.write('NumRooms,Alley,Price\n') #列名 f.write('NA,Pave,127500\n') #每行一个样本 f.write('2,NA,106000\n') f.write('4,NA,178100\n') f.write('NA,NA,140000\n') data = pd.read_csv(data_file) data 2.2 处理缺失值 1 2 3 4 5 6 7 8 9 inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2] #按列拆分为两个表 # 数值缺失值填充 inputs = inputs.fillna(inputs.mean(numeric_only=True)) inputs # 类别缺失值填充 inputs = pd.get_dummies(inputs, dummy_na=True, dtype = float) inputs 2.3 转换为张量 1 2 3 4 import torch X, y = torch.tensor(inputs.values), torch.tensor(outputs.values) X, y # 深度学习通常用float32 3. 线性代数 3.1 标量只有一个元素的张量普通、小写的字母表示 1 2 3 4 5 6 import torch x = torch.tensor(3.0) y = torch.tensor(4.0) x + y, x - y, x / y, x ** y 3.2 向量具有一个轴的张量粗体、小写的字母表示 1 2 3 x = torch.arange(4) x x.shape 向量/轴的维度表示向量或轴的长度； ...

D2L--第三章线性神经网络

1. 线性回归 1.1 线性回归的基本元素线性模型：目标(y)可以表示为输入特征的加权和，参数包括权重向量w和偏置b 损失函数：表示目标的实际值与预测值之间的差距；一般数值越小，损失越小。回归问题常用平方误差函数，如下公式。 ...

D2L--第四章多层感知机

1. 多层感知机 1.1 隐藏层之前所学的线性模型意味着单调假设，并不适用于更复杂的建模问题，例如体温与疾病；图片某个像素点的强度与猫或狗的关系等；多层感知机（MLP）：在输入层与输出层之间加入一个或多个隐藏层，以学习更加复杂的模型情况；只有隐藏层与输出层涉及到神经元计算与参数更新，因此如下示例MLP的层数是2；对于其中的隐藏层需要应用非线性的激活函数（σ），以突破对仍为线性本质的限制。 ...

D2L--第五章深度学习计算

1. 层和块 1.1 自定义块块/模块（block）可以描述单个层、由多个层（lay）组成的组件或整个神经网络模型本身。复杂的模块也可以由简单的模块组成从编程的角度，块由类表示，一般继承自torch的nn.Module ...