image-20241015193100700

Hugging Face是一家专注于自然语言处理(NLP)和人工智能(AI)的公司,可以认为是AI领域的Github。(下面简称HF)

  • 一方面整理、收集了NLP等AI任务常用的数据集,预训练模型
  • 另一方面提供系列工具库,用以高效地训练AI模型。具体包括如下几个核心库。
    • Transformers
    • Datasets
    • Tokenizers

1. 获取模型/数据集

HF提供了大量的公开AI数据集以及预训练数据集,同时也提供了下载的方式

  • 下载数据集
    • 如下,会自动下载https://huggingface.co/datasets/lhoestq/demo1数据集
1
2
from datasets import load_dataset
dataset = load_dataset("lhoestq/demo1")
  • 加载模型,会有两种使用场景
    • 调用模型的分词器(tokenizer)
    • 调用预训练模型本身(model checkpoint)
1
2
3
4
5
6
7
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

如上,至少在国内的下载速度非常慢。可以使用它的镜像源网站,下载数据集/模型到本地。然后在上面的命令中,提供路径参数即可

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 下载命令行工具
pip install -U huggingface_hub
# 设置环境变量
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型
huggingface-cli download \
--resume-download bert-base-uncased \
--local-dir path/to/bert-base-uncased
# 下载数据集
huggingface-cli download --repo-type dataset \
--resume-download \
lhoestq/demo1 --local-dir path/to/lhoestq_demo1
1
2
3
dataset = load_dataset("path/to/lhoestq_demo1")

tokenizer = AutoTokenizer.from_pretrained("path/to/bert-base-uncased")

!下面学习相关用法时,均先将数据集/模型下载到datasets文件夹后,再加载。

2. 加载模型

2.1 完整模型

  • HF目前有收集了大量不同NLP任务的微调后模型,可根据微调任务进行分类。详见:https://huggingface.co/docs/transformers/v4.45.2/en/main_classes/pipelines#transformers.pipeline

  • 可以在https://huggingface.co/models?pipeline_tag=token-classification&sort=trending 查看每种task下,可以调用的模型

  • 主要可通过pipeline函数调用,执行特定场景的推理任务

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from transformers import pipeline
#示例1
classifier = pipeline(task="sentiment-analysis", model="./datasets/sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)
# [{'label': 'POSITIVE', 'score': 0.9598047137260437},
#  {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

#示例2
pipe = pipeline(task="token-classification", model="datasets/Medical-NER")
result = pipe('45 year old woman diagnosed with CAD')

2.2 预训练模型

  • 首先,预训练模型需要接受原始输入的词元化表示(tokenizer)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./datasets/sentiment-analysis")
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
# {'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
#           2607,  2026,  2878,  2166,  1012,   102],
#         [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
#              0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
#         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

在对输入序列进行tokenizer词元化处理时,有两个关键参数:

  • padding参数进行短序列进行词元补长,此时attention_mask将对于填充词元注意力设置为0;
  • truncation参数限制最大词元长度,此时需要另外通过max_length设置预期最大长度。
1
2
3
4
5
6
7
8
9
# 展示相关计算细节
sequence = "Using a Transformer network is simple"

tokenizer.tokenize(sequence)
# ['using', 'a', 'transform', '##er', 'network', 'is', 'simple']
tokenizer.encode(sequence)
# [101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 102]
tokenizer.decode([101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 102])
# '[CLS] using a transformer network is simple [SEP]'

!model对于输入词元的要求:(1) tensor,(2) list of list。即上面tokenizer()的结果

  • 加载模型的预训练部分
1
2
3
4
5
6
7
8
9
from transformers import AutoModel

model = AutoModel.from_pretrained("./datasets/sentiment-analysis")
model

# 根据预训练模型,学习每个词元的嵌入表示
outputs = model(**inputs) # 将字典解包为关键字参数,即上面的input_ids, attention_mask
print(outputs.last_hidden_state.shape)
# torch.Size([2, 16, 768])

2.3 经典模型

  • HF的Transformers模块提供了大量经典的AI大模型,涉及test/vision/audio等领域。我们可以很方便的加载其架构,以及预训练参数
  • 例如 Bert:https://huggingface.co/docs/transformers/model_doc/bert (Transformer/API/MODELS/TEXT MODELS/BERT)
1
2
3
4
5
6
7
8
9
from transformers import BertConfig, BertModel
config = BertConfig()
config

model = BertModel(config)

#或者
from transformers import BertModel
model = BertModel.from_pretrained("./datasets/bert-base-cased")

3. 模型微调

如下演示一个基于bert的微调示例

  1. 加载模型框架
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification

checkpoint = "datasets/bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

AutoModelForSequenceClassification 表示基于预训练模型进行序列分类的微调任务的训练API。此外也有适用其它微调任务的API,参考:https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#

  1. 微调数据集
  • (1)加载
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
raw_datasets = load_dataset("./datasets/glue", "mrpc")
# DatasetDict({
#     train: Dataset({
#         features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
#         num_rows: 3668
#     })
#     validation: Dataset({
#         features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
#         num_rows: 408
#     })
#     test: Dataset({
#         features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
#         num_rows: 1725
#     })
# })
  • (2)词元化处理
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# 利用Datasets库的Apache Arrow储存,高效访问
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) #暂时先不填充
tokenized_datasets["train"].column_names
# ['sentence1',
#  'sentence2',
#  'label',
#  'idx',
#  'input_ids',
#  'token_type_ids',
#  'attention_mask']

data_collator = DataCollatorWithPadding(tokenizer=tokenizer) #小批量动态填充
  • (3)数据清洗
1
2
3
4
5
6
7
8
# 移除多余列
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
# 列重命名
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# 转为tensor
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names
# ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
  • (4)小批量迭代器:每次迭代一个字典
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}
# {'labels': torch.Size([8]),
#  'input_ids': torch.Size([8, 81]),
#  'token_type_ids': torch.Size([8, 81]),
#  'attention_mask': torch.Size([8, 81])}
  1. 训练参数
  • (1)优化器
1
2
3
from transformers import AdamW
# an optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)
  • (2)学习率调度器
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",     # 线性调度
    optimizer=optimizer,
    num_warmup_steps=0,  # 指定预热步数为 0,即不使用学习率预热。
    num_training_steps=num_training_steps,
)
# 如上表示,学习率从初始值线性下降到 0
  1. 开始训练
  • train step
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
#  1376/1377 [01:12<00:00, 17.79it/s]
  • evaluation step
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import evaluate
# https://github.com/huggingface/evaluate/issues/472
metric = evaluate.load("./evaluate/metrics/glue/glue.py","mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()
# {'accuracy': 0.8480392156862745, 'f1': 0.8934707903780069}

此外,TF也提供了accelerate库用于方便的执行分布式训练。但在实操过程中,遇到了如下问题 NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE=“1” and NCCL_IB_DISABLE=“1” or use accelerate launch which will do this automatically. 可通过如下方式避免报错

1
2
3
import os
os.environ["NCCL_IB_DISABLE"] = "1"
os.environ["NCCL_P2P_DISABLE"] = "1"