如何以编程方式确定网站内容是用什么语言编写的
我想以编程方式确定网站内容所使用的语言。
我唯一想到的是将网站内容与特定语言常见的一组单词进行比较,并根据匹配百分比确定语言。
有没有更好、更稳健的方法来解决这个问题?
I would like to programmatically determine language that content of a website is written in.
The only thing that comes into my mind is to compare content of the website with some set of words that are common to the particular language, and based on match percentage determine the language.
Are there any better and more robust ways to solve the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您可以使用 API(而不必编写自己的 API),请查看此问题的特定答案:https://stackoverflow.com/questions/6151668/alternative-to-google-translate- api/8121813#8121813
引用:
If you can use an API (instead of having to write your own), have a look at this particular answer to this question: https://stackoverflow.com/questions/6151668/alternative-to-google-translate-api/8121813#8121813
Quote:
带有语言分类示例的神经网络教程
基于字母的平均频率
http://fann.sourceforge.net/fann_en.pdf
Neural Network tutorial with Language classifying example
based on average frequencies of the letters
http://fann.sourceforge.net/fann_en.pdf
我不知道你是否对特定语言有偏好,但是Python也有一个用于语言检测的包,名为语言检测。
与其他提出的方法相比,它的优点是:
它基于 Google 的自动语言检测,默认支持 55 种语言。
安装
您可以使用安装它
然后例如运行
将分别返回'en'和'de'。
这假设您已经拥有以纯文本形式提供的网站内容。如果您需要下载内容,您可以使用 requests 包。
I don't know if you have a preference for specific languages, but Python also has a package for language detection, called langdetect.
Compared to the other proposed methods, it has the advantage of being:
It is based on Googles automatic language detection and supports by default 55 languages.
Installation
You can install it by using
And then for example running
Will return 'en' and 'de' respectively.
This assumes that you already have the content of the site available as clean text. If you need to download the content, you can for example use the requests package.