使用 Python 抓取英文单词
我想从《纽约时报》头版等网站上删除所有英文单词。我用Python写了这样的东西:
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE)
print tokens
这工作正常,但我得到了HTML关键字,例如“img”,“src”以及英文单词。有没有一种简单的方法可以从 Web scaping / HTML 中仅获取英文单词?
我看到这篇帖子,它似乎只谈论抓取的机制,没有提到任何工具谈论关于如何过滤掉非语言元素。我对链接、格式等不感兴趣。只是简单的文字。任何帮助将不胜感激。
I would like to scrape all English words from, say, New York Times front page. I wrote something like this in Python:
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE)
print tokens
This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?
I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您确定想要“英语”单词吗?就它们出现在某些词典中的意义上来说?例如,如果您抓取了《纽约时报》的一篇文章,您是否想要包含“奥巴马”(或“佩林”,代表你们蓝州人),即使它们可能还没有出现在任何词典中?
在许多情况下,最好解析 HTML(按照 Bryan 建议使用 BeautifulSoup)并仅包含文本节点(也许还有一些针对人类的属性,例如“标题”和“alt”)。
Are you sure you want "English" words -- in the sense that they appear in some dictionary? For example, if you scraped an NYT article, would you want to include "Obama" (or "Palin" for you Blue-Staters out there), even though they probably don't appear in any dictionaries yet?
Better, in many cases, to parse the HTML (using BeautifulSoup as Bryan suggests) and include only the text-nodes (and maybe some aimed-at-humans attributes like "title" and "alt").
您需要某种英语词典参考资料。执行此操作的一个简单方法是使用拼写检查器。我想到了PyEnchant。
来自 PyEnchant 网站:
在您的情况下,也许是这样的:
如果这还不够,并且您不希望“英语单词”出现在 HTML 标记(例如属性)中,您可能可以使用 BeautifulSoup 仅解析出重要文本。
You would need some sort of English dictionary reference. A simple way of doing this would be to use a spellchecker. PyEnchant comes to mind.
From the PyEnchant website:
In your case, perhaps something along the lines of:
If that's not enough and you don't want "English words" that may appear in an HTML tag (such as an attribute) you could probably use BeautifulSoup to parse out only the important text.
Html2Text 可能是一个不错的选择。
Html2Text can be a good option.
我喜欢使用 lxml 库 来实现此目的:
然后,为了确保抓取的单词只是英文单词,您可以使用从文本文件或 NLTK 加载的字典中查找它们,该字典附带许多很酷的语料库和语言处理工具。
I love using the lxml library for this:
Then to ensure the scraped words are only English words you could use look them up in a dictionary you load from a text file or NLTK that comes with many cool corpora and language processing tools.
您可以替换所有 <.*>;没有任何东西或空间。使用 re 模块,并确保您了解贪婪和非贪婪模式匹配。为此你需要非贪婪。
然后,一旦您剥离了所有标签,就应用您正在使用的策略。
You can replace all <.*> with nothing or a space. Use the re module, and make sure you understand greedy and non greedy pattern matching. You need non-greedy for this.
Then once you have stripped all the tags, apply the strategy you were using.