在运行时生成 python 正则表达式来匹配来自“n”的数字到无限

发布于 2024-10-20 05:18:15 字数 325 浏览 3 评论 0原文

我正在使用 scrapy 抓取网站并从中提取数据,scrapy 使用基于正则表达式的规则来检查是否必须解析页面或必须遵循链接。

我正在为我的蜘蛛实现恢复功能,因此它可以继续从上次访问的页面爬行。为此,当蜘蛛启动时,我从数据库中获取最后一个跟随的链接。

我的网站网址看起来像 http://foobar.com/page1.html,因此,通常,遵循这样的每个链接的规则的正则表达式将类似于 /page\d+\。 html.

但是我如何编写正则表达式才能匹配第 15 页等内容?另外,由于我事先不知道起点,如何在运行时生成这个正则表达式?

I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.

I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.

My site urls look like http://foobar.com/page1.html, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html.

But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

青柠芒果 2024-10-27 05:18:15

为什么不对页码进行分组,然后检查它是否合格:

>>> m=re.match("/page(\d+)\.html","/page18.html")
>>> if m:
    ID=int(m.groups()[0])
>>> ID > 15
True

或者更具体地说,您所要求的:

>>> def genRegex(n):
    return ''.join('[' + "0123456789"[int(d):] + ']' for d in str(n))

>>> genRegex(123)
'[123456789][23456789][3456789]'

Why not group the page number, then check if it is qualified:

>>> m=re.match("/page(\d+)\.html","/page18.html")
>>> if m:
    ID=int(m.groups()[0])
>>> ID > 15
True

Or more specifically what you requested:

>>> def genRegex(n):
    return ''.join('[' + "0123456789"[int(d):] + ']' for d in str(n))

>>> genRegex(123)
'[123456789][23456789][3456789]'
丢了幸福的猪 2024-10-27 05:18:15

试试这个:

def digit_match_greater(n):
    digits = str(n)
    variations = []
    # Anything with more than len(digits) digits is a match:
    variations.append(r"\d{%d,}" % (len(digits)+1))
    # Now match numbers with len(digits) digits.
    # (Generate, e.g, for 15, "1[6-9]", "[2-9]\d")
    # 9s can be skipped -- e.g. for >19 we only need [2-9]\d.
    for i, d in enumerate(digits):
        if d != "9": 
            pattern = list(digits)
            pattern[i] = "[%d-9]" % (int(d) + 1)
            for j in range(i+1, len(digits)):
                pattern[j] = r"\d"
            variations.append("".join(pattern))
    return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)

事实证明,让它匹配大于参数的数字更容易,所以如果你给它 15,它将返回一个匹配数字 16 和更大的字符串,具体来说......

(?:(?:\d{3,})|(?:[2-9]\d)|(?:1[6-9]))

然后你可以将其替换为你的表达式\d+,如下所示:

exp = re.compile(r"page%s\.html" % digit_match_greater(last_page_visited))

Try this:

def digit_match_greater(n):
    digits = str(n)
    variations = []
    # Anything with more than len(digits) digits is a match:
    variations.append(r"\d{%d,}" % (len(digits)+1))
    # Now match numbers with len(digits) digits.
    # (Generate, e.g, for 15, "1[6-9]", "[2-9]\d")
    # 9s can be skipped -- e.g. for >19 we only need [2-9]\d.
    for i, d in enumerate(digits):
        if d != "9": 
            pattern = list(digits)
            pattern[i] = "[%d-9]" % (int(d) + 1)
            for j in range(i+1, len(digits)):
                pattern[j] = r"\d"
            variations.append("".join(pattern))
    return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)

It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it'll return a string for matching numbers 16 and greater, specifically...

(?:(?:\d{3,})|(?:[2-9]\d)|(?:1[6-9]))

You can then substitute this into your expression instead of \d+, like so:

exp = re.compile(r"page%s\.html" % digit_match_greater(last_page_visited))
柒七 2024-10-27 05:18:15

稍微扩展一下 Kabie 的答案:

def genregex(n):
    nstr = str(n)
    same_digit = ''.join('[' + "0123456789"[int(d):] + ']' for d in nstr)
    return "\d{%d,}|%s" % (len(nstr) + 1, same_digit)

如果您的网站中出现这种情况,很容易修改以处理前导 0。但这似乎是错误的做法。

scrapy 中还有其他一些选择。您可能正在使用 SgmlLinkExtractor,在这种情况下,最简单的方法是将您自己的函数作为 process_value 关键字参数传递以进行自定义过滤。

您可以对 CrawlSpider 进行很多自定义,但如果它不适合您的任务,您应该查看 基地蜘蛛

extending Kabie's answer a little:

def genregex(n):
    nstr = str(n)
    same_digit = ''.join('[' + "0123456789"[int(d):] + ']' for d in nstr)
    return "\d{%d,}|%s" % (len(nstr) + 1, same_digit)

It's easy to modify to handle leading 0's if that occurs in your website. But this seems like the wrong approach.

You have a few other options in scrapy. You're probably using SgmlLinkExtractor, in which case the easiest thing is to pass your own function as the process_value keyword argument to do your custom filtering.

You can customize CrawlSpider quite a lot, but if it doesn't fit your task, you should check out BaseSpider

只为守护你 2024-10-27 05:18:15
>>> import regex
>>> import random
>>> n=random.randint(100,1000000)
>>> n
435220
>>> len(str(n))
>>> '\d'*len(str(n))
'\\d\\d\\d\\d\\d\\d'
>>> reg='\d{%d}'%len(str(n))
>>> m=re.search(reg,str(n))
>>> m.group(0)
'435220'
>>> import regex
>>> import random
>>> n=random.randint(100,1000000)
>>> n
435220
>>> len(str(n))
>>> '\d'*len(str(n))
'\\d\\d\\d\\d\\d\\d'
>>> reg='\d{%d}'%len(str(n))
>>> m=re.search(reg,str(n))
>>> m.group(0)
'435220'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文