import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
html_response = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
print br.title()
print html_response
BeautifulSoup 允许非常轻松地解析 html 内容(您可以使用 mechanize 获取),并且支持正则表达式。
一些示例代码:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_response)
rows = soup.findAll('tr')
for r in rows[2:]: #ignore first two rows
cols = r.findAll('td')
print cols[0].renderContents().strip() #print content of first column
Mechanize sort of simulates a browser (including options for proxying, faking browser identifications, page redirection etc.) and allows easy fetching of forms, links, ... The documentation is a bit rough/sparse though.
Some example code (from the mechanize website) to give you an idea:
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
html_response = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
print br.title()
print html_response
BeautifulSoup allows to parse html content (which you could have fetched with mechanize) pretty easily, and supports regexes.
Some example code:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_response)
rows = soup.findAll('tr')
for r in rows[2:]: #ignore first two rows
cols = r.findAll('td')
print cols[0].renderContents().strip() #print content of first column
So, these 10 lines above are pretty much copy-paste ready to print the content of the first column of every table row on a website.
在使用选项 b 时,请小心,因为您可能会陷入无限循环,因为网站会多次链接到同一页面,特别是如果链接位于页眉或页脚中,则页面甚至可能链接回自身。为了避免这种情况,您需要创建一个已访问链接的列表,并确保不要重新访问它们。
最后,我建议您的抓取工具遵守 robot.txt 文件中的说明,并且最好不要点击标记为 rel="nofollow" 的链接,因为这些链接大多是用于外部链接。再次强调,通过阅读 SEO 来了解这一点以及更多内容。
问候,
There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.
In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.
As far as identifying the "about us" page, you only have 2 options:
a) For each page get the link of the about us page and feed it to your crawler.
b) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.
in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you'll need to create a list of visited links and make sure not to revisit them.
Finally, I would recommend having your crawler respect instructions in the robot.txt file and it's probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links. Again, learn this and more by reading up on SEO.
#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $
#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
# concurrent connections>]
#
import sys
import pycurl
# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
import signal
from signal import SIGPIPE, SIG_IGN
signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
pass
# Get args
num_conn = 10
try:
if sys.argv[1] == "-":
urls = sys.stdin.readlines()
else:
urls = open(sys.argv[1]).readlines()
if len(sys.argv) >= 3:
num_conn = int(sys.argv[2])
except:
print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
raise SystemExit
# Make a queue with (url, filename) tuples
queue = []
for url in urls:
url = url.strip()
if not url or url[0] == "#":
continue
filename = "doc_%03d.dat" % (len(queue) + 1)
queue.append((url, filename))
# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"
# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
c = pycurl.Curl()
c.fp = None
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, 30)
c.setopt(pycurl.TIMEOUT, 300)
c.setopt(pycurl.NOSIGNAL, 1)
m.handles.append(c)
# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
# If there is an url to process and a free curl object, add to multi stack
while queue and freelist:
url, filename = queue.pop(0)
c = freelist.pop()
c.fp = open(filename, "wb")
c.setopt(pycurl.URL, url)
c.setopt(pycurl.WRITEDATA, c.fp)
m.add_handle(c)
# store some info
c.filename = filename
c.url = url
# Run the internal curl state machine for the multi stack
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
# Check for curl objects which have terminated, and add them to the freelist
while 1:
num_q, ok_list, err_list = m.info_read()
for c in ok_list:
c.fp.close()
c.fp = None
m.remove_handle(c)
print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
freelist.append(c)
for c, errno, errmsg in err_list:
c.fp.close()
c.fp = None
m.remove_handle(c)
print "Failed: ", c.filename, c.url, errno, errmsg
freelist.append(c)
num_processed = num_processed + len(ok_list) + len(err_list)
if num_q == 0:
break
# Currently no more I/O is pending, could do something in the meantime
# (display a progress bar, etc.).
# We just call select() to sleep until some more data is available.
m.select(1.0)
# Cleanup
for c in m.handles:
if c.fp is not None:
c.fp.close()
c.fp = None
c.close()
m.close()
Python ==> Curl <-- the best implementation of crawler
The following code can crawl 10,000 pages in 300 secs on a nice server.
#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $
#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
# concurrent connections>]
#
import sys
import pycurl
# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
import signal
from signal import SIGPIPE, SIG_IGN
signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
pass
# Get args
num_conn = 10
try:
if sys.argv[1] == "-":
urls = sys.stdin.readlines()
else:
urls = open(sys.argv[1]).readlines()
if len(sys.argv) >= 3:
num_conn = int(sys.argv[2])
except:
print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
raise SystemExit
# Make a queue with (url, filename) tuples
queue = []
for url in urls:
url = url.strip()
if not url or url[0] == "#":
continue
filename = "doc_%03d.dat" % (len(queue) + 1)
queue.append((url, filename))
# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"
# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
c = pycurl.Curl()
c.fp = None
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, 30)
c.setopt(pycurl.TIMEOUT, 300)
c.setopt(pycurl.NOSIGNAL, 1)
m.handles.append(c)
# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
# If there is an url to process and a free curl object, add to multi stack
while queue and freelist:
url, filename = queue.pop(0)
c = freelist.pop()
c.fp = open(filename, "wb")
c.setopt(pycurl.URL, url)
c.setopt(pycurl.WRITEDATA, c.fp)
m.add_handle(c)
# store some info
c.filename = filename
c.url = url
# Run the internal curl state machine for the multi stack
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
# Check for curl objects which have terminated, and add them to the freelist
while 1:
num_q, ok_list, err_list = m.info_read()
for c in ok_list:
c.fp.close()
c.fp = None
m.remove_handle(c)
print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
freelist.append(c)
for c, errno, errmsg in err_list:
c.fp.close()
c.fp = None
m.remove_handle(c)
print "Failed: ", c.filename, c.url, errno, errmsg
freelist.append(c)
num_processed = num_processed + len(ok_list) + len(err_list)
if num_q == 0:
break
# Currently no more I/O is pending, could do something in the meantime
# (display a progress bar, etc.).
# We just call select() to sleep until some more data is available.
m.select(1.0)
# Cleanup
for c in m.handles:
if c.fp is not None:
c.fp.close()
c.fp = None
c.close()
m.close()
保存首页显然更容易,但首页重定向(有时到不同的域,通常在 HTML 元重定向标记甚至 JS 中实现)非常常见,因此您需要处理这个问题。
If you are going to builld a crawler you need to (Java specific):
Learn how to use the java.net.URL and java.net.URLConnection classes or use the HttpClient library
Understand http request/response headers
Understand redirects (both HTTP, HTML and Javascript)
Understand content encodings (charsets)
Use a good library for parsing badly formed HTML (e.g. cyberNecko, Jericho, JSoup)
Make concurrent HTTP requests to different hosts, but ensure you issue no more than one to the same host every ~5 seconds
Persist pages you have fetched, so you don't need to refetch them every day if they
don't change that often (HBase can be useful).
A way of extracting links from the current page to crawl next
Obey robots.txt
A bunch of other stuff too.
It's not that difficult, but there are lots of fiddly edge cases (e.g. redirects, detecting encoding (checkout Tika)).
For more basic requirements you could use wget.
Heretrix is another option, but yet another framework to learn.
Identifying About us pages can be done using various heuristics:
inbound link text
page title
content on page
URL
if you wanted to be more quantitative about it you could use machine learning and a classifier (maybe Bayesian).
Saving the front page is obviously easier but front page redirects (sometimes to different domains, and often implemented in the HTML meta redirect tag or even JS) are very common so you need to handle this.
Heritrix has a bit of a steep learning curve, but can be configured in such a way that only the homepage, and a page that "looks like" (using a regex filter) an about page will get crawled.
发布评论
评论(6)
当使用 Python 时,您可能会对 mechanize 和 BeautifulSoup。
Mechanize 有点模拟浏览器(包括代理选项、伪造浏览器标识、页面重定向等),并允许轻松获取表单、链接……不过,文档有点粗糙/稀疏。
一些示例代码(来自 mechanize 网站)可以给您一个想法:
BeautifulSoup 允许非常轻松地解析 html 内容(您可以使用 mechanize 获取),并且支持正则表达式。
一些示例代码:
因此,上面的这 10 行几乎可以复制粘贴,以便打印网站上每个表行的第一列的内容。
When going Python, you might be interested in mechanize and BeautifulSoup.
Mechanize sort of simulates a browser (including options for proxying, faking browser identifications, page redirection etc.) and allows easy fetching of forms, links, ... The documentation is a bit rough/sparse though.
Some example code (from the mechanize website) to give you an idea:
BeautifulSoup allows to parse html content (which you could have fetched with mechanize) pretty easily, and supports regexes.
Some example code:
So, these 10 lines above are pretty much copy-paste ready to print the content of the first column of every table row on a website.
这里确实没有什么好的解决办法。你是对的,你怀疑 Python 可能是最好的起点,因为它对正则表达式的支持非常强大。
为了实现这样的事情,丰富的 SEO(搜索引擎优化)知识将有所帮助,因为有效地优化搜索引擎网页会告诉您搜索引擎的行为方式。我会从 SEOMoz 这样的网站开始。
至于识别“关于我们”页面,您只有 2 个选择:
a) 对于每个页面,获取“关于我们”页面的链接并将其提供给您的爬虫。
b) 解析页面的所有链接以查找某些关键字,例如“关于我们”、“关于”、“了解更多”等。
在使用选项 b 时,请小心,因为您可能会陷入无限循环,因为网站会多次链接到同一页面,特别是如果链接位于页眉或页脚中,则页面甚至可能链接回自身。为了避免这种情况,您需要创建一个已访问链接的列表,并确保不要重新访问它们。
最后,我建议您的抓取工具遵守
robot.txt
文件中的说明,并且最好不要点击标记为rel="nofollow"
的链接,因为这些链接大多是用于外部链接。再次强调,通过阅读 SEO 来了解这一点以及更多内容。问候,
There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.
In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.
As far as identifying the "about us" page, you only have 2 options:
a) For each page get the link of the about us page and feed it to your crawler.
b) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.
in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you'll need to create a list of visited links and make sure not to revisit them.
Finally, I would recommend having your crawler respect instructions in the
robot.txt
file and it's probably a great idea not to follow links markedrel="nofollow"
as these are mostly used on external links. Again, learn this and more by reading up on SEO.Regards,
尝试scrapy。它是一个Python 的网络抓取库。
如果需要一个简单的 python 脚本,请尝试使用 python 中的 urllib2 。
Try out scrapy. It is a web scraping library for python.
If a simple python-script is expected, try urllib2 in python.
Python==> Curl<--爬虫的最佳实现
下面的代码可以在一台不错的服务器上300秒爬取10000个页面。
Python ==> Curl <-- the best implementation of crawler
The following code can crawl 10,000 pages in 300 secs on a nice server.
如果您要构建一个爬虫,您需要(特定于 Java):
不要经常更改(HBase 可能很有用)。
还有一堆其他东西。
这并不困难,但有很多棘手的边缘情况(例如重定向、检测编码(检查 Tika))。
对于更基本的要求,您可以使用 wget。
Heretrix 是另一种选择,但也是另一个需要学习的框架。
识别“关于我们”页面可以使用各种启发式方法来完成:
内容如果您想更定量地了解它,您可以使用机器学习和分类器(可能是贝叶斯)。
保存首页显然更容易,但首页重定向(有时到不同的域,通常在 HTML 元重定向标记甚至 JS 中实现)非常常见,因此您需要处理这个问题。
If you are going to builld a crawler you need to (Java specific):
don't change that often (HBase can be useful).
A bunch of other stuff too.
It's not that difficult, but there are lots of fiddly edge cases (e.g. redirects, detecting encoding (checkout Tika)).
For more basic requirements you could use wget.
Heretrix is another option, but yet another framework to learn.
Identifying About us pages can be done using various heuristics:
if you wanted to be more quantitative about it you could use machine learning and a classifier (maybe Bayesian).
Saving the front page is obviously easier but front page redirects (sometimes to different domains, and often implemented in the HTML meta redirect tag or even JS) are very common so you need to handle this.
Heritrix 有一点陡峭的学习曲线,但可以配置为仅主页,并且“看起来像”(使用正则表达式过滤器)的页面将被抓取。
更多开源Java(网络)爬虫:http://java-source.net/open-source/爬虫
Heritrix has a bit of a steep learning curve, but can be configured in such a way that only the homepage, and a page that "looks like" (using a regex filter) an about page will get crawled.
More open source Java (web) crawlers: http://java-source.net/open-source/crawlers