从非英语网站获取纯文本内容

发布于 2024-12-12 12:13:41 字数 315 浏览 3 评论 0原文

我正在尝试获取非英语网站的纯文本内容。例如，我想获取 http://www.bbc.co.uk/hindi 的印地语内容/

对于英文网站的文本转储，我使用 wget 来获取内容。然后使用 HTML 解析器删除 HTML 标签并给我干净的文本。

在非英语网站上工作的等效工具是什么？

这只是我正在探索的一些宠物项目。速度并不是太重要。我会在 Linux 环境中编码，最好使用 Python 或 Java 或 C/C++（按顺序）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

つ可否回来 2024-12-19 12:13:42

听起来您用来解析 HTML 的方法在遇到 unicode 时会失败。有一个名为 BeautifulSoup 的模块，非常适合解析各种网站，并且可以很好地处理 unicode。尝试交互：

>>> import urllib, BeautifulSoup
>>> html = urllib.urlopen( 'http://www.bbc.co.uk/hindi/' ).read()
>>> soup = BeautifulSoup.BeautifulSoup( html )
>>> print soup.find( 'title' ).contents
[u'BBC Hindi - \u092a\u0939\u0932\u093e \u092a\u0928\u094d\u0928\u093e']

我的终端无法打印这些字符，但是您通常显示的印地语文本在这里也应该可以工作。

It sounds like the method you're using to parse HTML falls down when encountering unicode. There's a module called BeautifulSoup that's great for parsing all manner of websites, and it handles unicode just fine. Try interactively:

>>> import urllib, BeautifulSoup
>>> html = urllib.urlopen( 'http://www.bbc.co.uk/hindi/' ).read()
>>> soup = BeautifulSoup.BeautifulSoup( html )
>>> print soup.find( 'title' ).contents
[u'BBC Hindi - \u092a\u0939\u0932\u093e \u092a\u0928\u094d\u0928\u093e']

My terminal can't print these characters, but however you usually display Hindi text should work here as well.

回复收藏 0 原文

~没有更多了~