Huggingface 空白标记生成器不“快”
我想在预先标记化的文本上运行 NER,并具有以下代码:
from tokenizers.pre_tokenizers import Whitespace
#from transformers import convert_slow_tokenizer
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
wstok = Whitespace()
#wstok = convert_slow_tokenizer.convert_slow_tokenizer(wstok)
ner_pipe = pipeline("ner", model=model, tokenizer=wstok)
tokens = ['Some', 'example', 'tokens', 'here', '.']
entities = ner_pipe(' '.join(tokens))
这给了我以下错误:
AttributeError:“tokenizers.pre_tokenizers.Whitespace”对象没有属性“is_fast”
在我看来,简单的空白标记化应该相当“快”,但这可能不是他们在这里的意思:)。
我看过 这篇 帖子(因此注释掉的行在代码片段中),但这告诉我 Whitespace 类不属于可以转换的类。
有人对如何在 Huggingface 中获得快速空白标记生成器有任何想法吗?
I want to run NER on pre-tokenized text, and have the following code:
from tokenizers.pre_tokenizers import Whitespace
#from transformers import convert_slow_tokenizer
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
wstok = Whitespace()
#wstok = convert_slow_tokenizer.convert_slow_tokenizer(wstok)
ner_pipe = pipeline("ner", model=model, tokenizer=wstok)
tokens = ['Some', 'example', 'tokens', 'here', '.']
entities = ner_pipe(' '.join(tokens))
Which gives me the following error:
AttributeError: 'tokenizers.pre_tokenizers.Whitespace' object has no attribute 'is_fast'
Seems to me that plain and simple whitespace tokenization should be pretty "fast", but that's probably not what they mean here :).
I've seen this post (hence the commented out lines in the code snippet), but that tells me that the Whitespace class is not among the ones that can be converted.
Anyone any ideas on how I can get a fast Whitespace tokenizer in huggingface?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论