深度学习 | Li's Bioinfo-Blog

Torchtext构建Vocab词表

torchtext.vocab 1 from torchtext.vocab import vocab 1. 定义词汇表基于词元的频率统计表，OrderedDict 对象 1 2 3 4 vocab(ordered_dict = , #一个 OrderedDict 对象，包含词汇和它们的频率。 min_freq = 1, #指定词汇表中词出现的最小频率。 specials = None, #一个列表，包含特殊标记（如 <unk>, <pad>, <bos>, <eos> 等）。 special_first = True) #一个布尔值，决定特殊标记是否在词汇表的开头。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from collections import Counter, OrderedDict counter = Counter(["a", "a", "b", "b", "b"]) counter # Counter({'b': 3, 'a': 2}) sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True) sorted_by_freq_tuples # [('b', 3), ('a', 2)] ordered_dict = OrderedDict(sorted_by_freq_tuples) # OrderedDict([('b', 3), ('a', 2)]) v1 = vocab(ordered_dict) # Vocab() 直接基于可迭代对象 1 2 3 4 5 6 7 build_vocab_from_iterator( iterator = , # Iterator used to build Vocab. Must yield list or iterator of tokens. min_freq = , specials = , special_first = , max_tokens =None #最多引入多少个词元 ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from torchtext.vocab import build_vocab_from_iterator # 定义一个简单的迭代器 # yield关键字，用于定义生成器函数。惰性取值：按需生成值，而不是一次性生成所有值，适合处理大数据集。 def yield_tokens(data_iter): for text in data_iter: yield text.split() # 示例数据 data = ["this is a sentence", "this is another sentence"] # 构建词汇表 v2 = build_vocab_from_iterator(yield_tokens(data), min_freq=1, specials=['<unk>', '<pad>'], special_first=True) ### 直接使用token list作为输入 token_lists = [ ["this", "is", "a", "sentence"], ["this", "is", "another", "sentence"] ] # 构建词汇表 v3 = build_vocab_from_iterator(token_lists) 2. 查询词汇表 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # 1. 查看词元的索引 v1['a'], v1['b'] # 2. 设置当查询不在词汇表的新词元时，返回的索引 v2.set_default_index(v2[unk_token]) v2['sss'] # 3. 查询索引对应的词元 v2.get_itos()[:3] # v2['out of vocab'] is v2[unk_token] v2.get_stoi() # {'a': 5, # 'this': 4, # 'another': 6, # 'sentence': 3, # 'is': 2, # '<pad>': 1, # '<unk>': 0}

Pytorch-Dataset与Dataloader的使用.

在 PyTorch 中，Dataset、DataLoader 和 Sampler 是用于数据加载和处理的核心组件。它们相互配合，使得数据的加载和批处理更加高效和灵活。 Dataset 是一个抽象类，用于表示数据集。 DataLoader 是一个迭代器，用于将数据集分成小批量。 Sampler 可以自定义更加复杂的采样策略。 1. Dataset 将训练数据（如特征和标签）封装为一个可迭代的 PyTorch Dataset 类。有如下两种方式。 ...

Pytorch-常用神经网络层-torch.nn

1 2 import torch import torch.nn as nn 1. 线性层 1 2 3 4 5 input = torch.randn(3, 4) #3个样本，每个样本4个特征 linear = nn.Linear(in_features=4, out_features=2) linear(input).shape # torch.Size([3, 2]) 2. 正则化层 2.1 dropout 1 2 3 4 5 6 7 dropout = nn.Dropout(p=0.5) input = torch.randn(3, 4) dropout(input) # tensor([[ 0.0000, -0.0000, -0.5670, -0.0000], # [ 4.7224, -4.0010, 0.0000, -0.0000], # [-0.9960, 0.0000, 0.6658, -0.0000]]) 2.2 批归一化对每个特征（在不同样本的分布）进行归一化 1 2 3 4 5 6 7 batchnorm = nn.BatchNorm1d(num_features=4) batchnorm(input).shape # (batch_size, emb_len) # torch.Size([3, 4]) input2 = torch.randn(2, 4, 3) # (batch_size, emb_len, seq_len) batchnorm(input2).shape # torch.Size([2, 4, 3]) 2.3 层归一化对每个样本的所有特征分布进行归一化 1 2 3 4 5 6 7 8 layer_norm = nn.LayerNorm(normalized_shape=4) layer_norm(input).shape # torch.Size([3, 4]) input2 = torch.randn(2, 3, 4) # (batch_size, seq_len, emb_len) layer_norm(input2).shape # torch.Size([2, 3, 4]) 3. 激活函数 D2L的简单学习笔记记录了torch部分经典激活函数的计算函数。通常可以直接计算。 1 2 torch.relu(torch.tensor(0.5)) # tensor(0.5000) torch.sigmoid(torch.tensor(0.5)) # tensor(0.6225) 基于torch.nn实现的激活函数多为模块类 1 2 3 4 5 relu = nn.ReLU() relu(torch.tensor(0.5)) sigmoid = nn.Sigmoid() softmax = nn.Softmax(dim=1) #对轴1进行softmax转换，使其和为1 基于Relu的常见变体 1 2 3 4 5 6 7 8 # 负数不置为0，而是乘一个很小的系数 self_activation = nn.LeakyReLU(negative_slope=0.01) # 负数不置为0，而是乘一个可学习的参数 self_activation = nn.PReLU(num_parameters=1) # 整合高斯分布，在神经网络中表现出良好的性能，特别是在Transformer模型中 self_activation = nn.GELU() 4. 嵌入层为每个离散的整型索引返回一个固定大小的向量，通常用于自然语言处理中的词嵌入。 ...

GPU,CUDA以及Pytorch之间的关系

https://pytorch.org/get-started/locally/ 1. 概念 GPU（图形处理单元）是执行并行计算的硬件。具有不同的型号，例如GeForce RTX 3080、Tesla V100等 https://www.topcpu.net/gpu-r/fp32-float CUDA是NVIDIA 提供的并行计算软件平台，使开发者能够利用 GPU 的强大计算能力 ...

深度学习常规Config训练配置

1. 优化器optimizer Adam：自适应学习率优化器 1 optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) AdamW：Adam 的变体，加入了权重衰减来改善正则化效果，在 Transformer 类模型中表现良好。 1 optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) LAMB 是专为大批量训练设计的优化器，适合大型模型（e.g. BERT）。 ...

Flash Attention注意力优化

注意力计算注意力计算的三要素分别是：Query， Key，Value。而在自注意力计算中，三者则是等价的。结合如下图示例：一个序列有2个词元，每个词元有3个特征 ,即输入为(2, 3) 每个Query词元会计算与其它词元Key的“相似度”（包括自己），再经过softmax（每行的和等于1）转换，得到 2 × 2 权重矩阵然后将其与Value矩阵进行乘法运算(2, 2) × (2, 3)，得到新的(2, 3)输出结果形象理解：对于词元A的输出特征1，等于输入词元A, B的特征的加权和。多头注意力：本质上可以理解为将特征维度分成多个部分，每个部分称为一个“头”。每个头独立进行注意力计算，然后将所有头的输出合并在一起；以期学习不同的关系和模式。 ...

机器学习日志指标log记录

在深度/机器学习模型训练时，有必要展示或者记录每个batch/epoch的各种损失以及精度信息。除了最简单的print方式，目前有多种库提供了高级的API实现方式。下面就scGPT项目的学习过程，整理三种方式。 ...

torch张量的维度操作

在深度学习的前向传播中，最重要的是理解每个计算步骤的输入前与输入后的维度形状。与之对应的时需要熟悉一些常见的维度操作方法，根据项目的学习总结记录如下： ...

单机多卡torchrun分布式训练

1. 背景 DDP分布式训练与DP并行训练在之前了解多GPU训练时，学习过一种数据并行方式DataParallel (DP)。其核心将模型复制到每个 GPU，然后在每个 GPU 上分配一小部分数据并行执行计算。最后，主 GPU 汇总所有 GPU 的梯度并更新模型参数。实现角度也非常简单，使用nn.DataParallel()即可。 ...

torch模型组成模块参数查询、管理、保存

1. 示例模型两层MLP的神经网络 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import torch import torch.nn as nn import torch.nn.functional as F class SimpleMLP(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(SimpleMLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = F.relu(self.fc1(x)) x = self.fc2(x) return x # Example usage input_size = 256 # Number of input features hidden_size = 128 # Number of neurons in the hidden layer output_size = 2 # Number of output classes model = SimpleMLP(input_size, hidden_size, output_size) 2. 组成模块查询通过递归的方式遍历模型的所有层，包括嵌套在其他层内的子模块 1 2 3 4 5 6 7 8 9 10 # torch.nn.Module类 model.modules for module in model.modules(): print(f"Module: {module}") for name, module in model.named_modules(): print(f"{name}: {module}") # fc1: Linear(in_features=256, out_features=128, bias=True) # fc2: Linear(in_features=128, out_features=2, bias=True) 3. 模型参数查询 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # torch.nn.Parameter类 for param in model.parameters(): print(f"{param.shape}") for name, param in model.named_parameters(): print(f"{name}: {param.shape}") # fc1.weight: torch.Size([128, 256]) # fc1.bias: torch.Size([128]) # fc2.weight: torch.Size([2, 128]) # fc2.bias: torch.Size([2]) for name, param in model.fc1.named_parameters(): print(f"{name}: {param.shape}") # weight: torch.Size([128, 256]) # bias: torch.Size([128]) # 模型总参数量 total_parameters = sum(p.numel() for p in model.parameters()) # 查看具体某一层的参数 param = next(iter(model.fc1.parameters())) type(param) # torch.nn.parameter.Parameter param.shape # torch.Size([128, 256]) param.numel() # 32768 param.requires_grad # True # 参数冻结，即不更新该module参数 param.requires_grad=False 4. 模型（参数）保存与加载 1 2 3 4 5 6 7 type(model.state_dict()) # save torch.save(model.state_dict(), 'model.pth') # pt后缀也可 # load model.load_state_dict(torch.load('model.pth')) pretrained_params = torch.load(model_pt, map_location='cuda:2') 一个实际加载的示例函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # def load_pretrained( model: torch.nn.Module, pretrained_params: dict = None, strict: bool = False, prefix: list = None, use_flash_attn = True, verbose: bool = True, ) -> torch.nn.Module: # 修改特定参数的key name if not use_flash_attn: pretrained_params = { k.replace("Wqkv.", "in_proj_"): v for k, v in pretrained_params.items() } # 只加载特定keys的参数 if prefix is not None and len(prefix) > 0: if isinstance(prefix, str): prefix = [prefix] pretrained_params = { k: v for k, v in pretrained_params.items() if any(k.startswith(p) for p in prefix) } model_dict = model.state_dict() # 严格加载：全部参数需要匹配 if strict: if verbose: for k, v in pretrained_params.items(): print(f"Loading parameter {k} with shape {v.shape}") model_dict.update(pretrained_params) model.load_state_dict(model_dict) # 部分加载：只加载部分能够匹配的参数（key name以及 value shape） else: if verbose: for k, v in pretrained_params.items(): if k in model_dict and v.shape == model_dict[k].shape: print(f"Loading parameter {k} with shape {v.shape}") pretrained_params = { k: v for k, v in pretrained_params.items() if k in model_dict and v.shape == model_dict[k].shape } model_dict.update(pretrained_params) model.load_state_dict(model_dict) return model