根据给定 URL 自动确定网站页面的自然语言

发布于 2024-07-27 22:09:33 字数 690 浏览 4 评论 0原文

我正在寻找一种方法,根据给定的 URL,自动确定网站页面使用的自然语言。

在 Python 中,如下函数:

def LanguageUsed (url):
    #stuff

返回语言说明符(例如,“en”表示英语,“jp”表示日语,等等)

结果摘要: 我有一个合理的解决方案,使用 Python 中的 来自 PyPi 的 oice.langdet 代码< /a>. 它在区分英语和非英语方面做得不错,这就是我目前所需要的。 请注意,您必须使用 Python urllib 获取 html。 另外,oice.langdet 是 GPL 许可证。

有关在 Python 中使用 Trigrams 的更通用解决方案(如其他人建议的那样),请参阅此 来自 ActiveState 的 Python Cookbook Recipe

Google 自然语言检测 API 运行得非常好(如果不是我见过的最好的)。 然而,它是 Javascript,其 TOS 禁止自动使用它。

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

梦幻的味道 2024-08-03 22:09:33

URL 本身没有任何内容可以表明语言。

一种选择是使用自然语言工具包来尝试根据内容识别语言,但即使如果你能让 NLP 部分发挥作用,速度会相当慢。 而且,它可能不可靠。 请记住,大多数用户代理会在每个请求中传递类似的内容

Accept-Language: en-US

,并且许多大型网站将根据该标头提供不同的内容。 较小的网站会更可靠,因为他们不会关注语言标题。

您还可以使用服务器位置(即服务器所在的国家/地区)作为语言的代理,使用 GeoIP。 它显然并不完美,但比使用 TLD 好得多。

There is nothing about the URL itself that will indicate language.

One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like

Accept-Language: en-US

with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.

You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

暮凉 2024-08-03 22:09:33

您可能想尝试基于 ngram 的检测。

TextCat DEMO (LGPL )似乎工作得很好(识别近 70 种语言)。 Thomas Mangin 提供了一个 python 端口 此处使用相同的语料库。

编辑:TextCat 竞争对手页面也提供了一些有趣的链接。

Edit2:我想知道是否为 http://www.mnogosearch.org/guesser/ 会很难...

You might want to try ngram based detection.

TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.

Edit: TextCat competitors page provides some interesting links too.

Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...

演出会有结束 2024-08-03 22:09:33

nltk 可能会有所帮助(如果您必须认真处理页面的文本,即标题和url 本身不能很好地确定语言以满足您的目的); 我不认为NLTK直接提供了“告诉我这个文本是哪种语言”的功能(尽管NLTK很大并且不断增长,所以它实际上可能有它),但是你可以尝试根据各种可能的方式解析给定的文本自然语言,并根据每种语言的规则检查哪些语言给出最合理的解析、单词集等。

nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.

楠木可依 2024-08-03 22:09:33

这通常通过使用字符 n 元模型来完成。 您可以在此处找到最先进的语言标识符对于Java。 如果您需要帮助将其转换为 Python,请询问。 希望能帮助到你。

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

就是爱搞怪 2024-08-03 22:09:33

最好的选择确实是使用 Google 的自然语言检测 API。 它返回页面语言的 iso 代码以及概率索引。

请参阅 http://code.google.com/apis/ajaxlanguage/documentation/

Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.

See http://code.google.com/apis/ajaxlanguage/documentation/

避讳 2024-08-03 22:09:33

在 Python 中,lan​​gdetect 包(在此处找到)可以执行此操作。
它基于 Google 的自动语言检测,默认支持 55 种语言。

它是通过使用安装的

pip install langdetect

,然后例如运行

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

将分别返回“en”和“de”。

In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.

It is installed by using

pip install langdetect

And then for example running

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

Will return 'en' and 'de' respectively.

友谊不毕业 2024-08-03 22:09:33

不存在仅适用于 URL 的通用方法。 您可以检查顶级域来了解一些信息,并查找可能表示某种语言的 URL(例如两个斜杠之间的“en”或“es”),并假设任何未知内容都是英语,但这不是一个完美的解决方案。

据我所知,确定页面使用的自然语言的唯一通用方法是抓取页面的文本并检查每种语言中的某些常见单词。 例如,如果“a”、“an”和“the”在页面中多次出现,则很可能包含英文文本; “el”和“la”可能暗示西班牙语; 等等。

There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.

So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文