使用 Python 抓取英文单词

发布于 2024-11-16 12:03:04 字数 682 浏览 5 评论 0原文

我想从《纽约时报》头版等网站上删除所有英文单词。我用Python写了这样的东西：

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

这工作正常，但我得到了HTML关键字，例如“img”，“src”以及英文单词。有没有一种简单的方法可以从 Web scaping / HTML 中仅获取英文单词？

我看到这篇帖子，它似乎只谈论抓取的机制，没有提到任何工具谈论关于如何过滤掉非语言元素。我对链接、格式等不感兴趣。只是简单的文字。任何帮助将不胜感激。

原文

I would like to scrape all English words from, say, New York Times front page. I wrote something like this in Python:

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?

I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

假扮的天使 2024-11-23 12:03:04

您确定想要“英语”单词吗？就它们出现在某些词典中的意义上来说？例如，如果您抓取了《纽约时报》的一篇文章，您是否想要包含“奥巴马”（或“佩林”，代表你们蓝州人），即使它们可能还没有出现在任何词典中？

在许多情况下，最好解析 HTML（按照 Bryan 建议使用 BeautifulSoup）并仅包含文本节点（也许还有一些针对人类的属性，例如“标题”和“alt”）。

回复收藏 0 原文

断桥再见 2024-11-23 12:03:04

您需要某种英语词典参考资料。执行此操作的一个简单方法是使用拼写检查器。我想到了PyEnchant。

来自 PyEnchant 网站：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

在您的情况下，也许是这样的：

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

如果这还不够，并且您不希望“英语单词”出现在 HTML 标记（例如属性）中，您可能可以使用 BeautifulSoup 仅解析出重要文本。

You would need some sort of English dictionary reference. A simple way of doing this would be to use a spellchecker. PyEnchant comes to mind.

From the PyEnchant website:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

In your case, perhaps something along the lines of:

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

If that's not enough and you don't want "English words" that may appear in an HTML tag (such as an attribute) you could probably use BeautifulSoup to parse out only the important text.

回复收藏 0 原文

绝情姑娘 2024-11-23 12:03:04

Html2Text 可能是一个不错的选择。

导入 html2text
打印 html2text.html2text(your_html_string)

回复收藏 0 原文

等数载，海棠开 2024-11-23 12:03:04

我喜欢使用 lxml 库来实现此目的：

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

然后，为了确保抓取的单词只是英文单词，您可以使用从文本文件或 NLTK 加载的字典中查找它们，该字典附带许多很酷的语料库和语言处理工具。

I love using the lxml library for this:

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

Then to ensure the scraped words are only English words you could use look them up in a dictionary you load from a text file or NLTK that comes with many cool corpora and language processing tools.

回复收藏 0 原文