使用scrapy抓取参数化url

发布于 2024-10-28 08:13:49 字数 631 浏览 5 评论 0原文

我有一个使用 python scrapy 运行的蜘蛛,它正在抓取除带有参数(即 & 符号)的页面之外的所有页面,例如 http://www.amazon.co.uk /gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=4671 28533&pf_rd_i=468294

错误日志显示[scrapy] ERROR:xxx匹配查询不存在。

我正在将CrawlSpider与以下SgmlLinkExtractor规则一起使用,

rules = (
       Rule(SgmlLinkExtractor(allow='[a-zA-Z0-9.:\/=_?&-]+$'),
            'parse',
            follow=True,
        ),
)

非常感谢感谢您抽出宝贵的时间,并谨提前向您表示感谢。

I have a spider running using python scrapy, which is scraping all pages apart from pages with parameters(i.e. & symbols), such as, http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294.

The error log says [scrapy] ERROR: xxx matching query does not exist.

I am using the CrawlSpider with the following SgmlLinkExtractor rule

rules = (
       Rule(SgmlLinkExtractor(allow='[a-zA-Z0-9.:\/=_?&-]+

Will really appreciate for your time and would like to have the privilege to thank you in advance.

), 'parse', follow=True, ), )

Will really appreciate for your time and would like to have the privilege to thank you in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

演多会厌 2024-11-04 08:13:49

为了回应我自己的答案,我的所有代码都很好。它失败的原因是我调用 scrapy 的方式。当我使用单引号时,它会在 & 中中断。
使用双引号来调用蜘蛛是解决方案。

To respond to my own answer, all my code were fine. The reason it was failing is the way I was calling the scrapy. It breaks in & as I was using single quote.
Using the double quote to call the spider is the solution.

暮年 2024-11-04 08:13:49

re.serach() 而言,您的表达式与 url 匹配。您是否尝试过使用 r'regexpression' 以便 python 将字符串视为原始字符串?它似乎使用原始字符串和处理后的字符串进行匹配,但最好让 python 将正则表达式视为原始字符串。

>>> import re
>>> url="http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294" 
>>> m = re.search(r'[a-zA-Z0-9.:\/=_?&-]+
, url) 
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'

>>> m = re.search('[a-zA-Z0-9.:\/=_?&-]+
, url)
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'

Your expression matches the url as far as re.serach() is concerned. Have you tried using r'regexpression' so python treats the string as a raw string? It appears to match both using a raw and a processed string but it is always best to have python treat regex as raw strings.

>>> import re
>>> url="http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294" 
>>> m = re.search(r'[a-zA-Z0-9.:\/=_?&-]+
, url) 
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'

>>> m = re.search('[a-zA-Z0-9.:\/=_?&-]+
, url)
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文