在启用 cookie 的网站上使用 urlretrieve 的多线程网络抓取工具
我正在尝试编写我的第一个 Python 脚本,经过大量谷歌搜索,我认为我已经完成了。然而,我需要一些帮助才能冲过终点线。
我需要编写一个脚本来登录启用 cookie 的站点,抓取一堆链接,然后生成一些进程来下载文件。我的程序以单线程运行,所以我知道代码可以工作。但是,当我尝试创建下载工作人员池时,我遇到了困难。
#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool
def FetchReports(links,Username,Password,VendorID):
pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
pool.map(Fetch.DownloadJob,links)
pool.close()
pool.join()
#worker.py
import mechanize
import atexit
def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
Login(User,Password)
global SiteBase
SiteBase = _SiteBase
global DataPath
DataPath = _DataPath
atexit.register(Logout)
def DownloadJob(link):
mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
return True
在此版本中,代码失败,因为 cookie 尚未传输到工作程序以供 urlretrieve 使用。没问题,我能够使用 mechanize 的 .cookiejar 类将 cookie 保存在管理器中,并将它们传递给工作人员。
#worker.py
import mechanize
import atexit
from multiprocessing import current_process
def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
global cookies
cookies = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
Login(User,Password,opener) # note I pass the opener to Login so it can catch the cookies.
global SiteBase
SiteBase = _SiteBase
global DataPath
DataPath = _DataPath
cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)
atexit.register(Logout)
def DownloadJob(link):
cj = mechanize.LWPCookieJar()
cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
file = open(DataPath+'\\'+filename, "wb")
file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
file.close
但是,这失败了,因为 opener(我认为)想要将二进制文件移回管理器进行处理,并且我收到“无法腌制对象”错误消息,指的是它试图读取文件的网页。
显而易见的解决方案是从 cookie jar 中读取 cookie,并在发出 urlretrieve 请求时手动将它们添加到标头中,但我试图避免这种情况,这就是我寻求建议的原因。
I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line.
I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then spawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall.
#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool
def FetchReports(links,Username,Password,VendorID):
pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
pool.map(Fetch.DownloadJob,links)
pool.close()
pool.join()
#worker.py
import mechanize
import atexit
def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
Login(User,Password)
global SiteBase
SiteBase = _SiteBase
global DataPath
DataPath = _DataPath
atexit.register(Logout)
def DownloadJob(link):
mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
return True
In this revision, the code fails because the cookies have not been transferred to the worker for urlretrieve to use. No problem, I was able to use mechanize's .cookiejar class to save the cookies in the manager, and pass them to the worker.
#worker.py
import mechanize
import atexit
from multiprocessing import current_process
def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
global cookies
cookies = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
Login(User,Password,opener) # note I pass the opener to Login so it can catch the cookies.
global SiteBase
SiteBase = _SiteBase
global DataPath
DataPath = _DataPath
cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)
atexit.register(Logout)
def DownloadJob(link):
cj = mechanize.LWPCookieJar()
cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
file = open(DataPath+'\\'+filename, "wb")
file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
file.close
But, THAT fails because opener (I think) wants to move the binary file back to the manager for processing, and I get an "unable to pickle object" error message, referring to the webpage it's trying to read to the file.
The obvious solution is to read the cookies in from the cookie jar and manually add them to the header when making the urlretrieve request, but I am trying to avoid that, and that is why I am fishing for suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以正确的方式创建多线程网络抓取器很困难。我相信你可以处理它,但为什么不使用已经完成的东西呢?
我真的建议你看看 Scrapy http://scrapy.org/
它是一个非常灵活的开源网络爬虫框架也将处理您在这里需要的大部分内容。对于Scrapy,运行并发蜘蛛是一个配置问题,而不是编程问题(http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider)。您还将获得对 cookie、代理、HTTP 身份验证等的支持。
对我来说,用 Scrapy 重写我的爬虫程序花了大约 4 个小时。所以请问问自己:你真的想自己解决线程问题,还是愿意爬到别人的肩膀上,专注于网页抓取问题,而不是线程问题?
附言。你现在用的是机械化吗?请从 mechanize 常见问题解答 http://wwwsearch.sourceforge.net/mechanize/faq.html:
“它是线程安全的吗?
不。据我所知,您可以在线程代码中使用 mechanize,但它不提供同步:您必须自己提供同步。”
如果您确实想继续使用 mechanize,请开始阅读有关如何提供同步的文档。 (例如http://effbot.org/zone/thread-synchronization.htm,< a href="http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm" rel="noreferrer">http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)
Creating a multi-threaded web scraper the right way is hard. I'm sure you could handle it, but why not use something that has already been done?
I really really suggest you to check out Scrapy http://scrapy.org/
It is a very flexible open source web scraper framework that will handle most of the stuff you would need here as well. With Scrapy, running concurrent spiders is a configuration issue, not a programming issue (http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider). You will also get support for cookies, proxies, HTTP Authentication and much more.
For me, it took around 4 hours to rewrite my scraper in Scrapy. So please ask yourself: do you really want to solve the threading issue yourself or instead climb to the shoulders of others and focus on the issues of web scraping, not threading?
PS. Are you using mechanize now? Please notice this from mechanize FAQ http://wwwsearch.sourceforge.net/mechanize/faq.html:
"Is it threadsafe?
No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself."
If you really want to keep using mechanize, start reading through documentation on how to provide synchronization. (e.g. http://effbot.org/zone/thread-synchronization.htm, http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)
工作了大半天后,事实证明 Mechanize 不是问题,看起来更像是编码错误。经过大量的调整和咒骂,我已经让代码正常工作了。
对于像我这样的未来 Google 员工,我提供以下更新的代码:
因为我只是从列表中下载链接,所以 mechanize 的非线程安全性质似乎不是问题 [完全披露:我已经运行了这个过程整整三个次,因此在进一步测试中可能会出现问题]。多处理模块及其工作池完成了所有繁重的工作。在文件中维护 cookie 对我来说很重要,因为我从中下载的网络服务器必须为每个线程提供自己的会话 ID,但实现此代码的其他人可能不需要使用它。我确实注意到它似乎“忘记”了 init 调用和 run 调用之间的变量,因此 cookiejar 可能不会跳转。
After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.
For future Googlers like myself, I am providing the updated code below:
Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.
为了在第一个代码示例中启用 cookie 会话,请将以下代码添加到函数 DownloadJob 中:
然后您就可以像这样检索 url:
In order to enable cookie session in the first code example, add the following code to the function DownloadJob:
And then you may retrieve the url as you do: