1. Collator数据处理

目的：将dataset的初始数据进行规范化批量处理，用以后续的前向计算

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# Start from dataset (Sequences could have diff lengths)
Dataset({
    features: ['input_ids'],
    num_rows: 5
})

# End to encoded batch input (BatchEncoding格式)
{'input_ids': tensor([[350, 241, 345, 705, 695,   1, 427, 645,  99, 943,   0,   0,   0,   0],
        [196, 464, 546, 626, 413,   1, 973,  98, 824,   1, 410,   0,   0,   0],
        [475, 665,   1, 164, 306, 788,  53, 562, 232, 216, 252, 990,   0,   0],
        [  1, 966, 734, 897, 171, 357, 217, 850, 529, 895, 728, 234, 799,   0],
        [713,  76,   1, 428, 913, 890, 143, 992, 832, 963, 555,  18, 354, 455]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
 'labels': tensor([[-100, -100, -100, -100, -100,  716, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100, -100, -100, -100,  665, -100, -100, -100,  686, -100, -100, -100, -100],
        [-100, -100,   56, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [ 619,  966, -100, -100, -100,  357, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100, -100,  218, -100, -100, -100, -100, -100, -100,  963, -100, -100, -100, -100]])}

常见的关键字段包括：

input_ids: 编码后的序列token id
attention_mask: 注意力的掩码标注。例如会对pad填充字符，标记为0，表示不参与注意力计算
labels: 监督学习任务的样本标签，-100表示掩码/填充token，在计算损失时予以忽略
token_type_ids: 用于区分句子对

tokenizer

对于一个批量的序列数据，数据预处理的首先最重要的一步是将不同长度的序列，进行填充至相同长度。此外根据需要，也包括trunc截取操作。

在huggingface的Transformer中提供了标准的API工具。

在相关API调用时，常见三个参数 padding, max_length, truncation

padding参数：

逻辑值True或者字符串"longest" 表示填充到批量内最长序列的长度（如果是单个序列，则不填充）
字符串"max_length"表示填充到预设的最大长度（此时需要设置max_length参数）
逻辑值False表示不填充

“longest"填充策略本质上相当于将max_length参数设置为批量内序列最长序列长度的"max_length"填充策略。

truncation参数：

逻辑值True或者字符串"longest_first" 表示根据预设的最大限制长度（max_length参数）进行截取操作。
逻辑值False或者字符串"do_not_truncate"表示不进行截取

collator

collators是最终的接口形式，一般都会将tokenizer作为其参数之一，完成全流程的数据预处理操作。根据不同的任务类型与输入类型，可以对Collator以及tokenizer进行个性化修改。
虽然huggingface中提供了成熟的tokenizer（e.g. PreTrainedTokenizer），但尝试之后觉得单细胞组学模型处理不太适用，需要个性化修改。先前尝试了自定义的修改，出现各种问题
这里参考Geneformer的代码。其在第一步已经将单细胞数据，按非零基因表达从高到低排序，并截取排名靠前的基因（e.g. 2048）、将基因名转换为token id，保存为dataset。所以理论上仅需要考虑pad操作（不需要trunc），再结合特定任务进行修改。已将相关代码gene_pad_tokenizer.py整理，上传，方便后续的调用。如下为使用演示

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# 加载函数，以及示例数据

import torch
from datasets import Dataset
import geneformer
from gene_pad_tokenizer import GeneformerPreCollator as GeneTokenizer

# 加载词表（dict）
token_dictionary = geneformer.TOKEN_DICTIONARY_FILE
import pickle
with open(token_dictionary, "rb") as f:
    token_dictionary = pickle.load(f)
token_dictionary["<mask>"]
# 1
token_dictionary["<cls>"]
# 2
token_dictionary["<pad>"]
# 0
   
# 示例数据
demo_dat = {"input_ids":[]}
for i in range(10, 15):
    demo_dat["input_ids"].append(np.random.permutation(range(4,1000))[:i].tolist())

demo_dataset = Dataset.from_dict(demo_dat)
demo_dataset = demo_dataset.with_format("torch")
# Dataset({ 
#     features: ['input_ids'],
#     num_rows: 5
# })

方法1：分步

tokenzier

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


gene_tokenizer = GeneTokenizer(token_dictionary=token_dictionary, 
                               padding_side="right", 
                               model_input_names=["input_ids"])

gene_tokenizer.convert_ids_to_tokens(0)
# '<pad>'
gene_tokenizer.convert_tokens_to_ids("<cls>")
# 2

# 模拟Dataloader迭代的批量格式 list of dicts(samples), 每个dict代表一个样本
encoded_inputs = [{"input_ids": v} for v in demo_dataset["input_ids"]]
# [{'input_ids': tensor([350, 241, 345, 705, 695, 716, 427, 645,  99, 943])},
#  {'input_ids': tensor([196, 464, 546, 626, 413, 665, 973,  98, 824, 686, 410])},
#  {'input_ids': tensor([475, 665,  56, 164, 306, 788,  53, 562, 232, 216, 252, 990])},
#  {'input_ids': tensor([619, 966, 734, 897, 171, 357, 217, 850, 529, 895, 728, 234, 799])},
#  {'input_ids': tensor([713,  76, 218, 428, 913, 890, 143, 992, 832, 963, 555,  18, 354, 455])}]

padded_inputs = gene_tokenizer.pad(
    encoded_inputs,
    padding=True,
    max_length=None,
    return_attention_mask=True,
    return_tensors="pt"
)
type(padded_inputs)
# 输出为BatchEncoding格式
# transformers.tokenization_utils_base.BatchEncoding

# {'input_ids': tensor([[350, 241, 345, 705, 695, 716, 427, 645,  99, 943,   0,   0,   0,   0],
#         [196, 464, 546, 626, 413, 665, 973,  98, 824, 686, 410,   0,   0,   0],
#         [475, 665,  56, 164, 306, 788,  53, 562, 232, 216, 252, 990,   0,   0],
#         [619, 966, 734, 897, 171, 357, 217, 850, 529, 895, 728, 234, 799,   0],
#         [713,  76, 218, 428, 913, 890, 143, 992, 832, 963, 555,  18, 354, 455]]), 
#  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

collator，这里直接使用huggingface提供的用于mlm任务的掩码处理（也可以自定义，例如Geneformer的collator_for_classification.py）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=gene_tokenizer, mlm=True, mlm_probability=0.15
)

# 首先转换为list of dict
padded_inputs_reshape = [{"input_ids": padded_inputs["input_ids"][i], 
                          "attention_mask": padded_inputs["attention_mask"][i]} 
                         for i in range(len(padded_inputs["input_ids"]))]
# [{'input_ids': tensor([350, 241, 345, 705, 695, 716, 427, 645,  99, 943,   0,   0,   0,   0]),
#   'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0])},
#  {'input_ids': tensor([196, 464, 546, 626, 413, 665, 973,  98, 824, 686, 410,   0,   0,   0]),
#   'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0])},
#  {'input_ids': tensor([475, 665,  56, 164, 306, 788,  53, 562, 232, 216, 252, 990,   0,   0]),
#   'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])},
#  {'input_ids': tensor([619, 966, 734, 897, 171, 357, 217, 850, 529, 895, 728, 234, 799,   0]),
#   'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])},
#  {'input_ids': tensor([713,  76, 218, 428, 913, 890, 143, 992, 832, 963, 555,  18, 354, 455]),
#   'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}]

# Collator处理
batch_input = data_collator(padded_inputs_reshape)
type(batch_input)
# transformers.tokenization_utils_base.BatchEncoding
# {'input_ids': tensor([[350, 241, 345, 705, 695,   1, 427, 645,  99, 943,   0,   0,   0,   0],
#         [196, 464, 546, 626, 413,   1, 973,  98, 824,   1, 410,   0,   0,   0],
#         [475, 665,   1, 164, 306, 788,  53, 562, 232, 216, 252, 990,   0,   0],
#         [  1, 966, 734, 897, 171, 357, 217, 850, 529, 895, 728, 234, 799,   0],
#         [713,  76,   1, 428, 913, 890, 143, 992, 832, 963, 555,  18, 354, 455]]), 
#  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
#         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
#  'labels': tensor([[-100, -100, -100, -100, -100,  716, -100, -100, -100, -100, -100, -100, -100, -100],
#         [-100, -100, -100, -100, -100,  665, -100, -100, -100,  686, -100, -100, -100, -100],
#         [-100, -100,   56, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
#         [ 619,  966, -100, -100, -100,  357, -100, -100, -100, -100, -100, -100, -100, -100],
#         [-100, -100,  218, -100, -100, -100, -100, -100, -100,  963, -100, -100, -100, -100]])}

方法2：一步到位

1

batch_input_v2 = data_collator(encoded_inputs)

方法3：DataLoader

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from torch.utils.data import DataLoader
demo_loader = DataLoader(demo_dataset, 
                         batch_size=2, 
                         collate_fn = data_collator)
batch_input_v3 = next(iter(demo_loader))
# {'input_ids': tensor([[350, 241, 345, 705, 695, 716, 427, 645,  99, 943,   0],
#                       [196, 464, 546,   1, 413, 665, 973,  98, 824, 686, 410]]), 
# 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
#                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 
# 'labels': tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
#                   [-100, -100, -100,  626, -100, -100, -100, -100, -100, -100, -100]])}

在个性化的下游任务中，可根据需要修改上述Collator与tokenizer。目前的学习经验是，如果是更适合对批量数据整体进行操作，则修改Collator部分即可。如果是需要对每个批量内样本进行修改，则修改tokenizer的pad方法（核心是修改_pad()）

2. Model

1

# SDPA is used by default for torch>=2.1.when an implementation is available,

BertConfig

Bert模型配置 - BertConfig [Default]

vocab_size: 词汇表大小 [30522]

hidden_size: 隐藏层/嵌入维度 [768]

num_hidden_layers: 注意力层的数量 [12]

num_attention_heads: 注意力头的个数 [12]

intermediate_size: FFN的维度 [3072]

hidden_act: 激活函数 [“gelu”]

hidden_dropout_prob: 隐藏层的Dropout概率 [0.1]

attention_probs_dropout_prob: 注意力的Dropout概率 [0.1]

max_position_embeddings: 最大序列长度 [512]

num_labels: 分类任务的类别数 [2]

output_hidden_states: 是否输出全部隐藏层状态 [False]

output_attentions: 是否输出注意力MAP [False]

**kwargs: 支持添加、设置自定义的模型配置

BertModel

基础模型 - BertModel，为bert模型的encoder部分，不包括head任务头

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

BertForPreTraining

标准预训练模型 - BertForPreTraining
- 包括MLM与NSP两部分

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


BertForPreTraining(
  (bert): BertModel(
	......
  )
  (cls): BertPreTrainingHeads(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
    (seq_relationship): Linear(in_features=768, out_features=2, bias=True)
  )
)

BertForMaskedLM

MLM预训练模型 - BertForMaskedLM

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


BertForMaskedLM(
  (bert): BertModel(
	......
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (transform_act_fn): GELUActivation()
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=30522, bias=True)
    )
  )
)

BertForNextSentencePrediction

NSP预训练模型 - BertForNextSentencePrediction

1

(cls)部分的(predictions)变为(seq_relationship)

BertForSequenceClassification

句子分类微调模型 - BertForSequenceClassification
- 对于cls token的分类器

1
2
3
4
5
6
7
8


BertForSequenceClassification(
  (bert): BertModel(
	......
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True) 
  # out_features可通过config的num_labels=3设置
)

BertForTokenClassification

token分类微调模型 - BertForTokenClassification
- 对于所有token的分类器

1
2
3
4
5
6
7


BertForTokenClassification(
  (bert): BertModel(
	......
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=10, bias=True)
)

3. Trainer

Hugging Face 的 Trainer 类是其 Transformers 库中用于简化模型训练和评估的核心组件之一。

如下为一个简单示例，相关重要参数包括：

model：训练模型，例如上面学习的模型类
args：训练超参数，如学习率、批量大小、epoch 数。【要与上面的模型超参数区别开】
data_collator：小批量数据预处理。【填充+任务特定的操作，详见上】
train_dataset：训练数据集，huggingface的datasets格式
eval_dataset：验证数据集
compute_metrics：计算除了loss以外的评估指标

参考 compute_metrics from geneformer/classifier_utils.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


from transformers import Trainer, TrainingArguments

training_args = {
    "learning_rate": max_lr,
    "do_train": True
}
training_args = TrainingArguments(**training_args)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=eval_data, 
    compute_metrics=compute_metrics
)
# train
trainer.train()
# evaluate
metrics = trainer.evaluate()
# predict
predictions = trainer.predict(test_dataset)
# save model
trainer.save_model(model_output_dir)

如果使用想直接使用model预测：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


padded_batch.set_format(type="torch")
input_data_batch = padded_batch["input_ids"]
attn_msk_batch = padded_batch["attention_mask"]
label_batch = padded_batch[label_name]
with torch.no_grad():
    outputs = model(
        input_ids=input_data_batch.to("cuda"),
        attention_mask=attn_msk_batch.to("cuda"),
        labels=label_batch.to("cuda"),
    )

TrainingArguments常见超参数选项

output_dir="./result" 模型和检查点文件保存的目录
num_train_epochs = 3 训练轮数，可以小数
per_device_train_batch_size = 8 训练集批量数
per_device_eval_batch_size = 64 验证集批量数
gradient_accumulation_steps = 1 梯度累计步数
learning_rate = 5e-5 （最大）学习率
lr_scheduler_type = "cosine" 学习率规划方式
warmup_steps = 10000 学习率预热步数
weight_decay = 0.01 权重衰减
logging_steps = 500 记录log日志的频数
save_steps = 500 模型检查点的频数
evaluation_strategy = "epoch"/"steps"/no 验证集评估维度
eval_steps = 500 验证集评估步数（if “steps”）

1. Collator数据处理#

tokenizer#

collator#

2. Model#

BertConfig#

BertModel#

BertForPreTraining#

BertForMaskedLM#

BertForNextSentencePrediction#

BertForSequenceClassification#

BertForTokenClassification#

3. Trainer#