为什么使用NLTK以外的其他语言可以用其他语言来tokenize文本?

发布于 2025-01-30 06:29:45 字数 1372 浏览 1 评论 0原文

我正在尝试使用Word.tokenizenltk.tokenize使用word.tokenize。我发现的是,无论我选择哪种语言,无论我尝试使用的字符串是什么语言,都将默认为英语。

例如,当我尝试将一些德语文本引用并指定语言是德语时:

from nltk.tokenize import word_tokenize

test_de = "Das lange Zeit verarmte und daher von Auswanderung betroffene Irland " \
          "hat sich inzwischen zu einer hochmodernen, in manchen Gegenden multikulturellen " \
          "Industrie- und Dienstleistungsgesellschaft gewandelt."

print(word_tokenize(test_de, 'german'))

我得到了此输出:

['Das', 'lange', 'Zeit', 'verarmte', 'und', 'daher', 'von', 'Auswanderung', 'betroffene', 'Irland', 'hat', 'sich', 'inzwischen', 'zu', 'einer', 'hochmodernen', ',', 'in', 'manchen', 'Gegenden', 'multikulturellen', 'Industrie-', 'und', 'Dienstleistungsgesellschaft', 'gewandelt', '.']

您可以看到德语化合物词“ dienstleistungsgesellschaft”之类。

当我尝试将英语文本引用时,但指定该语言是德语:

from nltk.tokenize import word_tokenize

test_en = "This is some test text. It's short. It doesn't say very much."

print(word_tokenize(test_en, 'german'))

我得到了此输出:

['This', 'is', 'some', 'test', 'text', '.', 'It', "'s", 'short', '.', 'It', 'does', "n't", 'say', 'very', 'much', '.']

即使我指定了德语,它仍然像英语文本一样被象征化。您会看到它正在将英语复合令牌分开,例如“ n't”和“ s”。

我做错了吗?除英语以外,我如何将其他语言引用?

I'm trying to tokenise strings in different languages using word.tokenize from nltk.tokenize. What I'm finding is that, no matter what language I select, and no matter what language the string I try to tokenise is in, the tokeniser defaults to English.

For example, when I try to tokenise some German text and specify that the language is German:

from nltk.tokenize import word_tokenize

test_de = "Das lange Zeit verarmte und daher von Auswanderung betroffene Irland " \
          "hat sich inzwischen zu einer hochmodernen, in manchen Gegenden multikulturellen " \
          "Industrie- und Dienstleistungsgesellschaft gewandelt."

print(word_tokenize(test_de, 'german'))

I get this output:

['Das', 'lange', 'Zeit', 'verarmte', 'und', 'daher', 'von', 'Auswanderung', 'betroffene', 'Irland', 'hat', 'sich', 'inzwischen', 'zu', 'einer', 'hochmodernen', ',', 'in', 'manchen', 'Gegenden', 'multikulturellen', 'Industrie-', 'und', 'Dienstleistungsgesellschaft', 'gewandelt', '.']

You can see that German compound words like 'Dienstleistungsgesellschaft' aren't split into their components, 'Dienstleistungs' and 'gesellschaft'.

When I try to tokenise English text, the but specify that the language is German:

from nltk.tokenize import word_tokenize

test_en = "This is some test text. It's short. It doesn't say very much."

print(word_tokenize(test_en, 'german'))

I get this output:

['This', 'is', 'some', 'test', 'text', '.', 'It', "'s", 'short', '.', 'It', 'does', "n't", 'say', 'very', 'much', '.']

It's still clearly being tokenised like English text, even though I specified German. You can see it's splitting off English compound tokens like "n't" and "'s".

Am I doing something wrong? How can I tokenise other languages than English?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

仙女山的月亮 2025-02-06 06:29:45

nltk可以将几种语言引用,包括德语(请参阅上一个问题)。但是,传统上,复合分裂不是令牌化的一部分。尽管在大多数情况下,这很简单,有时可能是模棱两可的,您需要上下文才能正确解决分裂。例如,“ waldecke”一词可能有两个段“ wald·ecke”和“ wal·decke”,但是大多数情况下,只有第一个细分才有意义。

可能想要的是在令牌化文本上应用复合分离器。有几个选项,包括基于规则的take 机器学习工具

请注意,大多数使用神经网络的NLP使用统计子字段(例如字节对编码或句子),因此它们避免了需要进行语言动机的分割。

NLTK can tokenize several languages including German (see a previous SO question). However, compound splitting is traditionally not a part of tokenization. Although, it is rather simple in most cases, sometimes, it might be ambiguous and you need context to resolve the splitting correctly. E.g, word "Waldecke" might have two segmentations "Wald⋅ecke" and "Wal⋅decke", but most of the time, only the first segmentation makes sense.

What probably want is to apply a compound splitter over a tokenized text. There are several options including both rule-based tooks and machine-learned tools.

Note that most current NLP using neural networks uses statistical subword segmentation (such as Byte-Pair Encoding or SentencePiece), so they avoid the need for linguistically motivated segmentation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文