使用scrapy抓取参数化url
我有一个使用 python scrapy 运行的蜘蛛,它正在抓取除带有参数(即 &
符号)的页面之外的所有页面,例如 http://www.amazon.co.uk /gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=4671 28533&pf_rd_i=468294
。
错误日志显示[scrapy] ERROR:xxx匹配查询不存在。
我正在将CrawlSpider
与以下SgmlLinkExtractor规则
一起使用,
rules = (
Rule(SgmlLinkExtractor(allow='[a-zA-Z0-9.:\/=_?&-]+$'),
'parse',
follow=True,
),
)
非常感谢感谢您抽出宝贵的时间,并谨提前向您表示感谢。
I have a spider running using python scrapy, which is scraping all pages apart from pages with parameters(i.e. &
symbols), such as, http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294
.
The error log says [scrapy] ERROR: xxx matching query does not exist.
I am using the CrawlSpider
with the following SgmlLinkExtractor rule
rules = (
Rule(SgmlLinkExtractor(allow='[a-zA-Z0-9.:\/=_?&-]+
Will really appreciate for your time and would like to have the privilege to thank you in advance.
),
'parse',
follow=True,
),
)
Will really appreciate for your time and would like to have the privilege to thank you in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了回应我自己的答案,我的所有代码都很好。它失败的原因是我调用 scrapy 的方式。当我使用单引号时,它会在
&
中中断。使用双引号来调用蜘蛛是解决方案。
To respond to my own answer, all my code were fine. The reason it was failing is the way I was calling the scrapy. It breaks in
&
as I was using single quote.Using the double quote to call the spider is the solution.
就
re.serach()
而言,您的表达式与 url 匹配。您是否尝试过使用 r'regexpression' 以便 python 将字符串视为原始字符串?它似乎使用原始字符串和处理后的字符串进行匹配,但最好让 python 将正则表达式视为原始字符串。Your expression matches the url as far as
re.serach()
is concerned. Have you tried usingr'regexpression'
so python treats the string as a raw string? It appears to match both using a raw and a processed string but it is always best to have python treat regex as raw strings.