如何检测语言

发布于 2024-09-08 06:41:50 字数 128 浏览 8 评论 0 原文

是否有任何好的开源引擎可以通过概率度量来检测文本所使用的语言?我可以在本地运行并且不查询 Google 或 Bing 的一个吗?我想检测大约 1500 万页 OCR 文本中每一页的语言。

并非所有文档都包含使用拉丁字母的语言。

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

Not all documents will contain languages which use the Latin alphabet.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

半暖夏伤 2024-09-15 06:41:50

根据您正在做的事情,您可能需要查看 python 自然语言处理工具包 (NLTK),它对贝叶斯学习算法有一些支持。

一般来说,字母和词频可能是最快的评估,但如果您需要做除语言识别之外的任何事情,NLTK(或一般的贝叶斯学习算法)可能会很有用。如果您发现前两种方法的错误率太高,贝叶斯方法可能也会有用。

Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.

In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.

飘过的浮云 2024-09-15 06:41:50

您当然可以构建自己的,给出一些关于的统计数据 ="http://en.wikipedia.org/wiki/Letter_frequency#Relative_frequency_of_letters_in_other_languages" rel="noreferrer">字母频率,二合字母频率等。

然后将其作为开源发布。 ,您有一个用于检测文本语言的开源引擎!

You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.

Then release it as open source. And voila, you have an open source engine for detecting the language of text!

勿忘心安 2024-09-15 06:41:50

为了将来的参考,我最终使用的引擎是 libtextcat,它是在 BSD 许可下的,但似乎自 2003 年以来就没有维护过。不过,它做得很好,并且可以轻松集成到我的工具链中

For future reference, the engine I ended up using is libtextcat which is under BSD license but seems not to be maintained since 2003. Still, it does a good job and integrates easily in my toolchain

吃不饱 2024-09-15 06:41:50

尝试 CLD2:

安装

export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user

运行

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

提供了

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Supports 282 languages.

为其他人

Try CLD2:

Installation

export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user

Run

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

Gives

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Supports 282 languages.

Others

[旋木] 2024-09-15 06:41:50

我认为您不需要任何非常复杂的东西 - 例如,要检测文档是否是英语,具有相当高的确定性,只需测试它是否包含 N 个最常见的英语单词 - 类似于:

"the a an is to are in on in it"

如果它包含所有那些,我想说它几乎肯定是英语。

I don't think you need anything very sophisticated - for example to detect if a document is in English, with a pretty high level of certainty, simply test if it contains the N most common English words - something like:

"the a an is to are in on in it"

If it contains all of those, I would say it is almost definitely English.

鸠书 2024-09-15 06:41:50

您也可以尝试 Ruby 的 WhatLanguage gem,它很好且简单,我已用于 Twitter 数据分析。查看:http://www. youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp 快速演示

You could alternatively try Ruby's WhatLanguage gem, it's nice and simple and I've used in for Twitter data analysis. Check out: http://www.youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp for a quick demo

恋竹姑娘 2024-09-15 06:41:50

查看 Github 上的 Franc。它是用 JavaScript 编写的,因此您可以在浏览器中使用,也可以在 Node 中使用。

  • franc 支持的语言比任何其他库或 Google 都多;
  • franc 很容易分叉以支持 335 种语言;法郎就如
  • 与竞争对手一样快。

Check out Franc on Github. It's written in JavaScript, so you could use in a browser and maybe in Node too.

  • franc supports more languages than any other library, or Google;
  • franc is easily forked to support 335 languages; franc is just as
  • fast as the competition.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文