NLTK 和语言检测
如何使用 NLTK 检测文本是用什么语言编写的?
我见过的例子使用了nltk.detect
,但是当我在我的Mac上安装它时,我找不到这个包。
How do I detect what language a text is written in using NLTK?
The examples I've seen use nltk.detect
, but when I've installed it on my mac, I cannot find this package.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您是否遇到过以下代码片段?
来自 http://groups.google。 com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active
或者以下演示文件?
https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py
Have you come across the following code snippet?
from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active
Or the following demo file?
https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py
这个库也不是来自 NLTK,但肯定有帮助。
支持的 Python 版本 2.6、2.7、3.x。
https://pypi.python.org/pypi/langdetect?
PS:不要指望它总是能正常工作:
This library is not from NLTK either but certainly helps.
Supported Python versions 2.6, 2.7, 3.x.
https://pypi.python.org/pypi/langdetect?
P.S.: Don't expect this to work correctly always:
虽然这不在 NLTK 中,但我使用另一个基于 Python 的库取得了很好的结果:
https://github.com/saffsd/ langid.py
这非常容易导入,并且在其模型中包含大量语言。
Although this is not in the NLTK, I have had great results with another Python-based library :
https://github.com/saffsd/langid.py
This is very simple to import and includes a large number of languages in its model.
超级晚了,但是,您可以在
nltk
中使用textcat
分类器,此处。这篇论文讨论了该算法。它返回 ISO 639-3 格式的国家/地区代码,因此我将使用 pycountry 来获取全名。
例如,加载库
现在让我们看两个短语,并
猜测
它们的语言:然后您可以将它们传递到其他
nltk
函数中,例如示例:免责声明显然这并不总是有效,尤其是对于稀疏数据
极端示例
Super late but, you could use
textcat
classifier innltk
, here. This paper discusses the algorithm.It returns a country code in ISO 639-3, so I would use
pycountry
to get the full name.For example, load the libraries
Now let's look at two phrases, and
guess
their language:You could then pass them into other
nltk
functions, for example:Disclaimer obviously this will not always work, especially for sparse data
Extreme example
polyglot.detect 可以检测语言:
polyglot.detect can detect the language: