在运行时生成 python 正则表达式来匹配来自“n”的数字到无限
我正在使用 scrapy 抓取网站并从中提取数据,scrapy 使用基于正则表达式的规则来检查是否必须解析页面或必须遵循链接。
我正在为我的蜘蛛实现恢复功能,因此它可以继续从上次访问的页面爬行。为此,当蜘蛛启动时,我从数据库中获取最后一个跟随的链接。
我的网站网址看起来像 http://foobar.com/page1.html
,因此,通常,遵循这样的每个链接的规则的正则表达式将类似于 /page\d+\。 html
.
但是我如何编写正则表达式才能匹配第 15 页等内容?另外,由于我事先不知道起点,如何在运行时生成这个正则表达式?
I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.
I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.
My site urls look like http://foobar.com/page1.html
, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html
.
But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
为什么不对页码进行分组,然后检查它是否合格:
或者更具体地说,您所要求的:
Why not group the page number, then check if it is qualified:
Or more specifically what you requested:
试试这个:
事实证明,让它匹配大于参数的数字更容易,所以如果你给它 15,它将返回一个匹配数字 16 和更大的字符串,具体来说......
然后你可以将其替换为你的表达式
\d+
,如下所示:Try this:
It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it'll return a string for matching numbers 16 and greater, specifically...
You can then substitute this into your expression instead of
\d+
, like so:稍微扩展一下 Kabie 的答案:
如果您的网站中出现这种情况,很容易修改以处理前导 0。但这似乎是错误的做法。
scrapy 中还有其他一些选择。您可能正在使用 SgmlLinkExtractor,在这种情况下,最简单的方法是将您自己的函数作为 process_value 关键字参数传递以进行自定义过滤。
您可以对 CrawlSpider 进行很多自定义,但如果它不适合您的任务,您应该查看 基地蜘蛛
extending Kabie's answer a little:
It's easy to modify to handle leading 0's if that occurs in your website. But this seems like the wrong approach.
You have a few other options in scrapy. You're probably using SgmlLinkExtractor, in which case the easiest thing is to pass your own function as the process_value keyword argument to do your custom filtering.
You can customize CrawlSpider quite a lot, but if it doesn't fit your task, you should check out BaseSpider