当前位置：文江博客话题详情

NLTK 和语言检测

发布于 2024-09-08 10:19:04 字数 98 浏览 10 评论 0原文

如何使用 NLTK 检测文本是用什么语言编写的？

我见过的例子使用了nltk.detect，但是当我在我的Mac上安装它时，我找不到这个包。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

芯好空 2024-09-15 10:19:04

您是否遇到过以下代码片段？

english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)

来自 http://groups.google。 com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

或者以下演示文件？

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

Have you come across the following code snippet?

english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)

from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

Or the following demo file?

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

回复收藏 0 原文

没︽人懂的悲伤 2024-09-15 10:19:04

这个库也不是来自 NLTK，但肯定有帮助。

$ sudo pip install langdetect

支持的 Python 版本 2.6、2.7、3.x。

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

https://pypi.python.org/pypi/langdetect?

PS：不要指望它总是能正常工作：

>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'

This library is not from NLTK either but certainly helps.

$ sudo pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

https://pypi.python.org/pypi/langdetect?

P.S.: Don't expect this to work correctly always:

>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'

回复收藏 0 原文

贱贱哒 2024-09-15 10:19:04

虽然这不在 NLTK 中，但我使用另一个基于 Python 的库取得了很好的结果：

https://github.com/saffsd/ langid.py

这非常容易导入，并且在其模型中包含大量语言。

回复收藏 0 原文

少跟Wǒ拽 2024-09-15 10:19:04

超级晚了，但是，您可以在 nltk 中使用 textcat 分类器，此处。这篇论文讨论了该算法。

它返回 ISO 639-3 格式的国家/地区代码，因此我将使用 pycountry 来获取全名。

例如，加载库

import nltk
import pycountry
from nltk.stem import SnowballStemmer

现在让我们看两个短语，并猜测它们的语言：

phrase_one = "good morning"
phrase_two = "goeie more"

tc = nltk.classify.textcat.TextCat() 
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)

guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)

English
Afrikaans

然后您可以将它们传递到其他nltk函数中，例如示例：

stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk

免责声明显然这并不总是有效，尤其是对于稀疏数据

极端示例

guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)

Super late but, you could use textcat classifier in nltk, here. This paper discusses the algorithm.

It returns a country code in ISO 639-3, so I would use pycountry to get the full name.

For example, load the libraries

import nltk
import pycountry
from nltk.stem import SnowballStemmer

Now let's look at two phrases, and guess their language:

phrase_one = "good morning"
phrase_two = "goeie more"

tc = nltk.classify.textcat.TextCat() 
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)

guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)

English
Afrikaans

You could then pass them into other nltk functions, for example:

stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk

Disclaimer obviously this will not always work, especially for sparse data

Extreme example

guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)

回复收藏 0 原文

拿命拼未来 2024-09-15 10:19:04

polyglot.detect 可以检测语言：

from polyglot.detect import Detector

foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
print(Detector(foreign).language)

name: Spanish     code: es       confidence:  98.0 read bytes:   865

polyglot.detect can detect the language:

from polyglot.detect import Detector

foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
print(Detector(foreign).language)

name: Spanish     code: es       confidence:  98.0 read bytes:   865

回复收藏 0 原文

~没有更多了~