1
2


import torch
import torch.nn as nn

1. 线性层

1
2
3
4
5


input = torch.randn(3, 4) #3个样本，每个样本4个特征

linear = nn.Linear(in_features=4, out_features=2)
linear(input).shape
# torch.Size([3, 2])

2. 正则化层

2.1 dropout

1
2
3
4
5
6
7


dropout = nn.Dropout(p=0.5)

input = torch.randn(3, 4)
dropout(input)
# tensor([[ 0.0000, -0.0000, -0.5670, -0.0000],
#         [ 4.7224, -4.0010,  0.0000, -0.0000],
#         [-0.9960,  0.0000,  0.6658, -0.0000]])

2.2 批归一化

对每个特征（在不同样本的分布）进行归一化

1
2
3
4
5
6
7


batchnorm = nn.BatchNorm1d(num_features=4)
batchnorm(input).shape       # (batch_size, emb_len)
# torch.Size([3, 4])

input2 = torch.randn(2, 4, 3) # (batch_size, emb_len, seq_len)
batchnorm(input2).shape
# torch.Size([2, 4, 3])

2.3 层归一化

对每个样本的所有特征分布进行归一化

1
2
3
4
5
6
7
8


layer_norm = nn.LayerNorm(normalized_shape=4)

layer_norm(input).shape
# torch.Size([3, 4])

input2 = torch.randn(2, 3, 4) # (batch_size, seq_len, emb_len)
layer_norm(input2).shape
# torch.Size([2, 3, 4])

3. 激活函数

D2L的简单学习笔记记录了torch部分经典激活函数的计算函数。通常可以直接计算。

1
2


torch.relu(torch.tensor(0.5))     # tensor(0.5000)
torch.sigmoid(torch.tensor(0.5))  # tensor(0.6225)

基于torch.nn实现的激活函数多为模块类

1
2
3
4
5


relu = nn.ReLU()
relu(torch.tensor(0.5))

sigmoid = nn.Sigmoid()
softmax = nn.Softmax(dim=1) #对轴1进行softmax转换，使其和为1

基于Relu的常见变体

1
2
3
4
5
6
7
8


# 负数不置为0，而是乘一个很小的系数
self_activation = nn.LeakyReLU(negative_slope=0.01)

# 负数不置为0，而是乘一个可学习的参数
self_activation = nn.PReLU(num_parameters=1)

# 整合高斯分布，在神经网络中表现出良好的性能，特别是在Transformer模型中
self_activation = nn.GELU()

4. 嵌入层

为每个离散的整型索引返回一个固定大小的向量，通常用于自然语言处理中的词嵌入。

1
2
3
4
5
6
7


embedding = nn.Embedding(num_embeddings=1000, embedding_dim=64)
# num_embeddings表示词汇表的大小，即共有多少个不同的词元
# embedding_dim表示嵌入向量的长度

input_indices = torch.tensor([1, 2, 3, 4])
embedding(input_indices).shape
# torch.Size([4, 64])

5. Transformer

定义一个Transformer块：包括标准的注意力层以及前馈神经网络层。
- 在另一篇笔记，有介绍torch实现的注意力层（nn.MultiheadAttention）
- 值得注意的是，在torch v2版本中默认采用的是flash加速版本的nn.MultiheadAttention。其带来的副作用就是不能输出attn weight信息。https://github.com/pytorch/pytorch/issues/99304
- 目前想的一个策略是：在预训练的前n-1的epoch中，采用默认的计算方式以加速计算；然后在最后一个epoch中，再采用自定义一个nn.TransformerEncoderLayer子类，用以支持attn weight的输出。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


d_model = 512  # 总的输入特征维度
nhead = 8      # 注意力头数

# d_model: 输入和输出的特征维度 
# nhead: 注意力头数
# dim_feedforward: FFN的神经元个数（MLP）
# dropout: 丢失率，默认0.1
# batch_first: 默认为False，即输入为(seq, batch, feature); 设置为True，则输入为(batch, seq, feature) 
# norm_first: 默认为False，即在Attention和FFN之前进行norm
# activation: 前馈网络中使用的激活函数，默认是 relu，可以选择其他激活函数如 gelu。

# 创建 Transformer 编码器层
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, 
                                           nhead=nhead, 
                                           dim_feedforward=d_model*2, 
                                           dropout=0.1, 
                                           batch_first=True)
 
input = torch.randn(2, 10, 512) # (批量大小, 序列长度, 特征维度)
encoder_layer(input).shape
# torch.Size([2, 10, 512])

堆叠多个Transformer块，组成编码器

1
2
3
4


num_layers = 12
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
transformer_encoder(input).shape
# torch.Size([2, 10, 512])

统计model模型的参数量：
1
2
num_parameters = sum(p.numel() for p in model.parameters())
print(f"模型参数量: {num_parameters}")
https://blog.csdn.net/weixin_43135178/article/details/140313635

1. 线性层#

2. 正则化层#

2.1 dropout#

2.2 批归一化#

2.3 层归一化#

3. 激活函数#

4. 嵌入层#

5. Transformer#

1. 线性层

2. 正则化层

2.1 dropout

2.2 批归一化

2.3 层归一化

3. 激活函数

4. 嵌入层

5. Transformer