使用 Python 抓取英文单词

发布于 2024-11-16 12:03:04 字数 682 浏览 3 评论 0原文

我想从《纽约时报》头版等网站上删除所有英文单词。我用Python写了这样的东西:

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

这工作正常,但我得到了HTML关键字,例如“img”,“src”以及英文单词。有没有一种简单的方法可以从 Web scaping / HTML 中仅获取英文单词?

我看到这篇帖子,它似乎只谈论抓取的机制,没有提到任何工具谈论关于如何过滤掉非语言元素。我对链接、格式等不感兴趣。只是简单的文字。任何帮助将不胜感激。

I would like to scrape all English words from, say, New York Times front page. I wrote something like this in Python:

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?

I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

假扮的天使 2024-11-23 12:03:04

您确定想要“英语”单词吗?就它们出现在某些词典中的意义上来说?例如,如果您抓取了《纽约时报》的一篇文章,您是否想要包含“奥巴马”(或“佩林”,代表你们蓝州人),即使它们可能还没有出现在任何词典中?

在许多情况下,最好解析 HTML(按照 Bryan 建议使用 BeautifulSoup)并仅包含文本节点(也许还有一些针对人类的属性,例如“标题”和“alt”)。

Are you sure you want "English" words -- in the sense that they appear in some dictionary? For example, if you scraped an NYT article, would you want to include "Obama" (or "Palin" for you Blue-Staters out there), even though they probably don't appear in any dictionaries yet?

Better, in many cases, to parse the HTML (using BeautifulSoup as Bryan suggests) and include only the text-nodes (and maybe some aimed-at-humans attributes like "title" and "alt").

断桥再见 2024-11-23 12:03:04

您需要某种英语词典参考资料。执行此操作的一个简单方法是使用拼写检查器。我想到了PyEnchant

来自 PyEnchant 网站:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

在您的情况下,也许是这样的:

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

如果这还不够,并且您不希望“英语单词”出现在 HTML 标记(例如属性)中,您可能可以使用 BeautifulSoup 仅解析出重要文本。

You would need some sort of English dictionary reference. A simple way of doing this would be to use a spellchecker. PyEnchant comes to mind.

From the PyEnchant website:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

In your case, perhaps something along the lines of:

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

If that's not enough and you don't want "English words" that may appear in an HTML tag (such as an attribute) you could probably use BeautifulSoup to parse out only the important text.

绝情姑娘 2024-11-23 12:03:04

Html2Text 可能是一个不错的选择。

导入 html2text

打印 html2text.html2text(your_html_string)

Html2Text can be a good option.

import html2text

print html2text.html2text(your_html_string)

等数载,海棠开 2024-11-23 12:03:04

我喜欢使用 lxml 库 来实现此目的:

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

然后,为了确保抓取的单词只是英文单词,您可以使用从文本文件或 NLTK 加载的字典中查找它们,该字典附带许多很酷的语料库和语言处理工具。

I love using the lxml library for this:

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

Then to ensure the scraped words are only English words you could use look them up in a dictionary you load from a text file or NLTK that comes with many cool corpora and language processing tools.

咋地 2024-11-23 12:03:04

您可以替换所有 <.*>;没有任何东西或空间。使用 re 模块,并确保您了解贪婪和非贪婪模式匹配。为此你需要非贪婪。

然后,一旦您剥离了所有标签,就应用您正在使用的策略。

You can replace all <.*> with nothing or a space. Use the re module, and make sure you understand greedy and non greedy pattern matching. You need non-greedy for this.

Then once you have stripped all the tags, apply the strategy you were using.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文