如何以编程方式确定网站内容是用什么语言编写的

发布于 2024-12-15 11:20:40 字数 107 浏览 1 评论 0原文

我想以编程方式确定网站内容所使用的语言。

我唯一想到的是将网站内容与特定语言常见的一组单词进行比较,并根据匹配百分比确定语言。

有没有更好、更稳健的方法来解决这个问题?

I would like to programmatically determine language that content of a website is written in.

The only thing that comes into my mind is to compare content of the website with some set of words that are common to the particular language, and based on match percentage determine the language.

Are there any better and more robust ways to solve the problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

杯别 2024-12-22 11:20:40

如果您可以使用 API(而不必编写自己的 API),请查看此问题的特定答案:https://stackoverflow.com/questions/6151668/alternative-to-google-translate- api/8121813#8121813

引用:

如果您只需要语言检测,您可以使用免费的网络服务:

http://detectlanguage.com

它与 Google Translate API 请求/响应格式兼容。

If you can use an API (instead of having to write your own), have a look at this particular answer to this question: https://stackoverflow.com/questions/6151668/alternative-to-google-translate-api/8121813#8121813

Quote:

If you just need language detection, you can use free web service:

http://detectlanguage.com

It is compatible with Google Translate API request/response formats.

缘字诀 2024-12-22 11:20:40

带有语言分类示例的神经网络教程
基于字母的平均频率
http://fann.sourceforge.net/fann_en.pdf

Neural Network tutorial with Language classifying example
based on average frequencies of the letters
http://fann.sourceforge.net/fann_en.pdf

﹎☆浅夏丿初晴 2024-12-22 11:20:40

我不知道你是否对特定语言有偏好,但是Python也有一个用于语言检测的包,名为语言检测

与其他提出的方法相比,它的优点是:

  • 离线,因此不需要额外的 API 调用。
  • 开箱即用。无需定制培训。

它基于 Google 的自动语言检测,默认支持 55 种语言。

安装

您可以使用安装它

pip install langdetect

然后例如运行

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

将分别返回'en'和'de'。

这假设您已经拥有以纯文本形式提供的网站内容。如果您需要下载内容,您可以使用 requests 包

I don't know if you have a preference for specific languages, but Python also has a package for language detection, called langdetect.

Compared to the other proposed methods, it has the advantage of being:

  • Offline, so no extra API calls required.
  • Ready out of the box. No custom training is needed.

It is based on Googles automatic language detection and supports by default 55 languages.

Installation

You can install it by using

pip install langdetect

And then for example running

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

Will return 'en' and 'de' respectively.

This assumes that you already have the content of the site available as clean text. If you need to download the content, you can for example use the requests package.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文