Huggingface 空白标记生成器不“快”

发布于 2025-01-16 12:13:46 字数 911 浏览 1 评论 0原文

我想在预先标记化的文本上运行 NER，并具有以下代码：

from tokenizers.pre_tokenizers import Whitespace
#from transformers import convert_slow_tokenizer
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
wstok = Whitespace()
#wstok = convert_slow_tokenizer.convert_slow_tokenizer(wstok)
ner_pipe = pipeline("ner", model=model, tokenizer=wstok)
tokens = ['Some', 'example', 'tokens', 'here', '.']
entities = ner_pipe(' '.join(tokens))

这给了我以下错误：

AttributeError：“tokenizers.pre_tokenizers.Whitespace”对象没有属性“is_fast”

在我看来，简单的空白标记化应该相当“快”，但这可能不是他们在这里的意思:)。

我看过这篇帖子（因此注释掉的行在代码片段中），但这告诉我 Whitespace 类不属于可以转换的类。

有人对如何在 Huggingface 中获得快速空白标记生成器有任何想法吗？

原文

I want to run NER on pre-tokenized text, and have the following code:

from tokenizers.pre_tokenizers import Whitespace
#from transformers import convert_slow_tokenizer
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
wstok = Whitespace()
#wstok = convert_slow_tokenizer.convert_slow_tokenizer(wstok)
ner_pipe = pipeline("ner", model=model, tokenizer=wstok)
tokens = ['Some', 'example', 'tokens', 'here', '.']
entities = ner_pipe(' '.join(tokens))

Which gives me the following error:

AttributeError: 'tokenizers.pre_tokenizers.Whitespace' object has no attribute 'is_fast'

Seems to me that plain and simple whitespace tokenization should be pretty "fast", but that's probably not what they mean here :).

I've seen this post (hence the commented out lines in the code snippet), but that tells me that the Whitespace class is not among the ones that can be converted.

Anyone any ideas on how I can get a fast Whitespace tokenizer in huggingface?

分享到QQ

分享到微博