如何获得不确定的“＆lt; p＆gt;”使用零食的标签？

发布于 2025-02-10 10:17:36 字数 686 浏览 1 评论 0原文

如何使用scrapy获取不确定的＆lt; p＆gt;标签的文字？如以下两个示例所示：

获取所有＆lt; p＆gt;在＆lt; h2＆gt; h2＆gt; xxxx特征＆lt;/h2＆gt;或or ＆lt; h3＆lt; h3＆lt; /h3＆gt; ＆lt; div class =“ entry-content”＆gt;的内部，然后将＆lt; p＆gt;的块合并到其他字段中，但＆lt; p＆gt;的数量尚不确定。

第1页 page2

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪失控 2025-02-17 10:17:36

您可以尝试提取div的所有孩子，并执行正则测试，以查看是否是h2或h3，然后测试是否文本CONATINS “ Diet”或“特征”，如果它通过，则收集所有兄弟姐妹，其中＆lt; p＆gt;。

def parse(self, response):
    collect = False
    contents = []
    for selector in response.xpath("//div[@class='entry-content']/*"):
        val = selector.xpath("./text()").get()
        if collect and selector.re('<p'):
            contents.append(val)
            continue
        if val and selector.re(r'<h[23]'):
            if "Characteristics" in val or "Diet" in val:
                collect = True
        else:
            collect = False
    yield {"contents" : contents}

You can try extracting all the children of the div and perform a regex test to see if it is an h2 or h3 then test if the text conatins "Diet" or "Characteristics" and if it passes collect all siblings with that are <p>.

def parse(self, response):
    collect = False
    contents = []
    for selector in response.xpath("//div[@class='entry-content']/*"):
        val = selector.xpath("./text()").get()
        if collect and selector.re('<p'):
            contents.append(val)
            continue
        if val and selector.re(r'<h[23]'):
            if "Characteristics" in val or "Diet" in val:
                collect = True
        else:
            collect = False
    yield {"contents" : contents}

回复收藏 0 原文

~没有更多了~