HuggingFace

发布时间 2023-04-16 01:03:16作者: LinXiaoshu

HuggingFace使用

可以在这里找到此网站的文档,非常详细.

安装transformers库:pip install transformers
导入所需的模型、tokenizerconfigurationfrom transformers import AutoModel, AutoTokenizer, AutoConfig
加载预训练模型:model = AutoModel.from_pretrained(model_name_or_path)
加载tokenizertokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
加载configurationconfig = AutoConfig.from_pretrained(model_name_or_path)

两个参考链接: link1&link2.

示例:翻译模型Helsinki-NLP/opus-mt-en-zh

我是先看了NLP的文档,于是找到了一个英译中的模型:Helsinki-NLP/opus-mt-en-zh按照网站的文档,可以非常方便地调用pipeline模块,它直接封装了NLP中的预处理-模型计算-后处理的三个阶段,看起来就是端到端的了.比如:

from transformers import pipeline
translator = pipeline(model="Helsinki-NLP/opus-mt-en-zh")
translator("The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use.")
# -------------------------output--------------------
# [{'translation_text': '这些管道是使用推断模型的极好和容易的方式。这些管道是图书馆中大部分复杂代码抽象的物体,提供了用于几项任务的简单API,包括名称实体识别、隐蔽语言建模、感官分析、地物提取和问题回答。使用实例见任务摘要。'}]

Hmm...怎么说呢,勉勉强强吧.当然,或许你希望自己调用这几个过程.刚开始尝试了

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = "Helsinki-NLP/opus-mt-en-zh"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

到这里都没问题,但是调用生成的时候就报错了:

src_sentence = [
    "Tom isn't going to give me that.",
    "I have just arrived by train. she said."
]

input = tokenizer(src_sentence)
translated = model.generate(**tokenizer(src_sentence, return_tensors="pt", padding=True))
# output:
# TypeError: The current model class (MarianModel) is not compatible with `.generate()`, as it doesn't have a language model head. Please use one of the following classes instead: {'MarianMTModel', 'MarianForCausalLM'}

其实还蛮奇怪的.AutoModel不好用吗?后来在这个模型的讨论区看到作者回复的使用方式是这样的

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

这么用就没问题了.当然,这个模型确实是MarianMT类型的.所以如果你这样调用也可以:

from transformers import MarianMTModel

model_name = "Helsinki-NLP/opus-mt-en-zh"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

translated = model1.generate(**tokenizer1(src_sentence, return_tensors="pt", padding=True))
[tokenizer1.decode(t, skip_special_tokens=True) for t in translated]

这部分来自MarianMT的官方文档.

题外话:如果你想知道Marian和MarianMT,可以参考这里的回答.简单来说Marian就是为神经网络机器翻译提供的一套训练工具,这个名字是为了纪念二战时期的波兰密码学家Marian Rejewski,他参与了Enigma的破译.