HuggingFace | 基础组件之Pipeline

发布时间 2023-07-31 11:53:54作者: 张Zong在修行

什么是Pipeline

  • Pipeline
    • 将数据预处理、模型调用、结果后处理三部分组装成的流水线
    • 使我们能够直接输入文本便获得最终的答案

Pipeline支持的任务类型

查看Pipeline支持的任务类型:

from transformers.pipelines import SUPPORTED_TASKS

print(SUPPORTED_TASKS.items())

for k, v in SUPPORTED_TASKS.items():
    print(k, v)

通过SUPPORTED_TASKS可以查看可以支持的任务。可以查看各种任务的实现类是什么,tf实现的类和pt实现的类。

Pipeline的创建与使用方式

  • 根据任务类型直接创建Pipeline
pipe = pipeline("text-classification")
  • 指定任务类型,再指定模型,创建基于指定模型的Pipeline
pipe = pipeline("text-classification", model="uer/roberta-base-finetuned-dianping-chinese")
  • 预先加载模型,再创建Pipeline
# 这种方式,必须同时指定model和tokenizer
model = AutoModelForSequenceClassification.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
tokenizer = AutoTokenizer.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
  • 使用GPU进行推理加速
pipe = pipeline("text-classification", model="uer/roberta-base-finetuned-dianping-chinese", device=0)

测试CPU和GPU运行的速度:

pipe.model.device
# device(type='cuda', index=0)

import torch
import time
times = []
for i in range(100):
    torch.cuda.synchronize()
    start = time.time()
    pipe("我觉得不太行!")
    torch.cuda.synchronize()
    end = time.time()
    times.append(end - start)
print(sum(times) / 100)
# cpu时间:2.1s
# gpu时间:0.6s

Pipeline的背后实现

  • Step1 初始化Tokenizer
tokenizer = AutoTokenizer.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
  • Step2 初始化Model
model = AutoModelForSequenceClassification.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
  • Step3 数据预处理
input_text = "我觉得不太行!"
inputs = tokenizer(input_text, return_tensors="pt")
  • Step4 模型预测
res = model(**inputs)
logits = res.logits
  • Step5 结果后处理
logits = torch.softmax(logits, dim=-1)
pred = torch.argmax(logits).item()
result = model.config.id2label.get(pred)

代码整合:

# 导入依赖
from transformers import *
import torch

# 初始化模型 和 初始化Tokenizer
tokenizer = AutoTokenizer.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
model = AutoModelForSequenceClassification.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")

# 数据预处理
input_text = "我觉得不太行!"
inputs = tokenizer(input_text, return_tensors="pt")

# 模型预测
res = model(**inputs)
logits = res.logits

# 查看标签属性
model.config.id2label # {0: 'negative (stars 1, 2 and 3)', 1: 'positive (stars 4 and 5)'}

# 结果后处理
logits = torch.softmax(logits, dim=-1)
pred = torch.argmax(logits).item()
result = model.config.id2label.get(pred) # 'negative (stars 1, 2 and 3)'