根据给定 URL 自动确定网站页面的自然语言

发布于 2024-07-27 22:09:33 字数 690 浏览 4 评论 0原文

我正在寻找一种方法，根据给定的 URL，自动确定网站页面使用的自然语言。

在 Python 中，如下函数：

def LanguageUsed (url):
    #stuff

返回语言说明符（例如，“en”表示英语，“jp”表示日语，等等）

结果摘要：我有一个合理的解决方案，使用 Python 中的来自 PyPi 的 oice.langdet 代码< /a>. 它在区分英语和非英语方面做得不错，这就是我目前所需要的。请注意，您必须使用 Python urllib 获取 html。另外，oice.langdet 是 GPL 许可证。

有关在 Python 中使用 Trigrams 的更通用解决方案（如其他人建议的那样），请参阅此来自 ActiveState 的 Python Cookbook Recipe。

Google 自然语言检测 API 运行得非常好（如果不是我见过的最好的）。然而，它是 Javascript，其 TOS 禁止自动使用它。

原文

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦幻的味道 2024-08-03 22:09:33

URL 本身没有任何内容可以表明语言。

一种选择是使用自然语言工具包来尝试根据内容识别语言，但即使如果你能让 NLP 部分发挥作用，速度会相当慢。而且，它可能不可靠。请记住，大多数用户代理会在每个请求中传递类似的内容

Accept-Language: en-US

，并且许多大型网站将根据该标头提供不同的内容。较小的网站会更可靠，因为他们不会关注语言标题。

您还可以使用服务器位置（即服务器所在的国家/地区）作为语言的代理，使用 GeoIP。它显然并不完美，但比使用 TLD 好得多。

There is nothing about the URL itself that will indicate language.

One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like

Accept-Language: en-US

with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.

You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

回复收藏 0 原文

暮凉 2024-08-03 22:09:33

您可能想尝试基于 ngram 的检测。

TextCat DEMO (LGPL ）似乎工作得很好（识别近 70 种语言）。 Thomas Mangin 提供了一个 python 端口此处使用相同的语料库。

编辑：TextCat 竞争对手页面也提供了一些有趣的链接。

Edit2：我想知道是否为 http://www.mnogosearch.org/guesser/ 会很难...

回复收藏 0 原文

演出会有结束 2024-08-03 22:09:33

nltk 可能会有所帮助（如果您必须认真处理页面的文本，即标题和url 本身不能很好地确定语言以满足您的目的）；我不认为NLTK直接提供了“告诉我这个文本是哪种语言”的功能（尽管NLTK很大并且不断增长，所以它实际上可能有它），但是你可以尝试根据各种可能的方式解析给定的文本自然语言，并根据每种语言的规则检查哪些语言给出最合理的解析、单词集等。

回复收藏 0 原文

楠木可依 2024-08-03 22:09:33

这通常通过使用字符 n 元模型来完成。您可以在此处找到最先进的语言标识符对于Java。如果您需要帮助将其转换为 Python，请询问。希望能帮助到你。

回复收藏 0 原文

就是爱搞怪 2024-08-03 22:09:33

最好的选择确实是使用 Google 的自然语言检测 API。它返回页面语言的 iso 代码以及概率索引。

请参阅 http://code.google.com/apis/ajaxlanguage/documentation/

回复收藏 0 原文

避讳 2024-08-03 22:09:33

在 Python 中，langdetect 包（在此处找到）可以执行此操作。
它基于 Google 的自动语言检测，默认支持 55 种语言。

它是通过使用安装的

pip install langdetect

，然后例如运行

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

将分别返回“en”和“de”。

In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.

It is installed by using

pip install langdetect

And then for example running

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

Will return 'en' and 'de' respectively.

回复收藏 0 原文

友谊不毕业 2024-08-03 22:09:33

不存在仅适用于 URL 的通用方法。您可以检查顶级域来了解一些信息，并查找可能表示某种语言的 URL（例如两个斜杠之间的“en”或“es”），并假设任何未知内容都是英语，但这不是一个完美的解决方案。

据我所知，确定页面使用的自然语言的唯一通用方法是抓取页面的文本并检查每种语言中的某些常见单词。例如，如果“a”、“an”和“the”在页面中多次出现，则很可能包含英文文本； “el”和“la”可能暗示西班牙语；等等。

回复收藏 0 原文

~没有更多了~

关于作者

他夏了夏天

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

根据给定 URL 自动确定网站页面的自然语言

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

根据给定 URL 自动确定网站页面的自然语言

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。