当前位置：文江博客话题详情

网络挖掘、抓取或爬行？我应该使用什么工具/库？

发布于 2024-12-09 02:20:58 字数 1539 浏览 0 评论 0 原文

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

我们不允许提出寻求软件库、教程、工具、书籍或其他场外资源推荐的问题。您可以编辑问题，以便用事实和引文来回答。

9 年前已关闭。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话里做英雄 2024-12-16 02:20:58

当使用 Python 时，您可能会对 mechanize 和 BeautifulSoup。

Mechanize 有点模拟浏览器（包括代理选项、伪造浏览器标识、页面重定向等），并允许轻松获取表单、链接……不过，文档有点粗糙/稀疏。

一些示例代码（来自 mechanize 网站）可以给您一个想法：

import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
html_response = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
print br.title()
print  html_response

BeautifulSoup 允许非常轻松地解析 html 内容（您可以使用 mechanize 获取），并且支持正则表达式。

一些示例代码：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_response)

rows = soup.findAll('tr')
for r in rows[2:]:  #ignore first two rows
    cols = r.findAll('td')
    print cols[0].renderContents().strip()    #print content of first column

因此，上面的这 10 行几乎可以复制粘贴，以便打印网站上每个表行的第一列的内容。

When going Python, you might be interested in mechanize and BeautifulSoup.

Mechanize sort of simulates a browser (including options for proxying, faking browser identifications, page redirection etc.) and allows easy fetching of forms, links, ... The documentation is a bit rough/sparse though.

Some example code (from the mechanize website) to give you an idea:

import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
html_response = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
print br.title()
print  html_response

BeautifulSoup allows to parse html content (which you could have fetched with mechanize) pretty easily, and supports regexes.

Some example code:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_response)

rows = soup.findAll('tr')
for r in rows[2:]:  #ignore first two rows
    cols = r.findAll('td')
    print cols[0].renderContents().strip()    #print content of first column

So, these 10 lines above are pretty much copy-paste ready to print the content of the first column of every table row on a website.

回复收藏 0 原文

闻呓 2024-12-16 02:20:58

这里确实没有什么好的解决办法。你是对的，你怀疑 Python 可能是最好的起点，因为它对正则表达式的支持非常强大。

为了实现这样的事情，丰富的 SEO（搜索引擎优化）知识将有所帮助，因为有效地优化搜索引擎网页会告诉您搜索引擎的行为方式。我会从 SEOMoz 这样的网站开始。

至于识别“关于我们”页面，您只有 2 个选择：

a) 对于每个页面，获取“关于我们”页面的链接并将其提供给您的爬虫。

b) 解析页面的所有链接以查找某些关键字，例如“关于我们”、“关于”、“了解更多”等。

在使用选项 b 时，请小心，因为您可能会陷入无限循环，因为网站会多次链接到同一页面，特别是如果链接位于页眉或页脚中，则页面甚至可能链接回自身。为了避免这种情况，您需要创建一个已访问链接的列表，并确保不要重新访问它们。

最后，我建议您的抓取工具遵守 robot.txt 文件中的说明，并且最好不要点击标记为 rel="nofollow" 的链接，因为这些链接大多是用于外部链接。再次强调，通过阅读 SEO 来了解这一点以及更多内容。

问候，

回复收藏 0 原文

余生再见 2024-12-16 02:20:58

尝试scrapy。它是一个Python 的网络抓取库。
如果需要一个简单的 python 脚本，请尝试使用 python 中的 urllib2 。

回复收藏 0 原文

时光与爱终年不遇 2024-12-16 02:20:58

Python==> Curl<--爬虫的最佳实现

下面的代码可以在一台不错的服务器上300秒爬取10000个页面。

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

Python ==> Curl <-- the best implementation of crawler

The following code can crawl 10,000 pages in 300 secs on a nice server.

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

回复收藏 0 原文

隔岸观火 2024-12-16 02:20:58

如果您要构建一个爬虫，您需要（特定于 Java）：

了解如何使用 java.net.URL 和 java.net.URLConnection 类或使用 HttpClient 库
了解 http 请求/响应标头
了解重定向（HTTP、 HTML 和 Javascript）
了解内容编码（字符集）
使用良好的库来解析格式错误的 HTML（例如 cyberNecko、Jericho、JSoup）
向不同主机发出并发 HTTP 请求，但确保不发出任何请求每大约 5 秒向同一主机发送多个
页面保留您已获取的页面，因此您无需每天重新获取它们
不要经常更改（HBase 可能很有用）。
一种从当前页面提取链接以抓取下一个页面的方法
服从 robots.txt

还有一堆其他东西。

这并不困难，但有很多棘手的边缘情况（例如重定向、检测编码（检查 Tika））。

对于更基本的要求，您可以使用 wget。
Heretrix 是另一种选择，但也是另一个需要学习的框架。

识别“关于我们”页面可以使用各种启发式方法来完成：

上的入站链接文本
页面标题
页面
URL

内容如果您想更定量地了解它，您可以使用机器学习和分类器（可能是贝叶斯）。

保存首页显然更容易，但首页重定向（有时到不同的域，通常在 HTML 元重定向标记甚至 JS 中实现）非常常见，因此您需要处理这个问题。

回复收藏 0 原文

执笔绘流年 2024-12-16 02:20:58

Heritrix 有一点陡峭的学习曲线，但可以配置为仅主页，并且“看起来像”（使用正则表达式过滤器）的页面将被抓取。

更多开源Java（网络）爬虫：http://java-source.net/open-source/爬虫

回复收藏 0 原文

~没有更多了~

关于作者

情栀口红

暂无简介

0 文章

0 评论

469 人气

关注发私信

友情链接

文江博客

网络挖掘、抓取或爬行？我应该使用什么工具/库？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

网络挖掘、抓取或爬行？我应该使用什么工具/库？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。