如何根据正则表达式检索 HTML 标签

发布于 2025-01-04 02:51:43 字数 731 浏览 0 评论 0原文

我正在尝试提取每个 HTML 标记,包括正则表达式的匹配项。例如,假设我想要获取包括字符串“name”的每个标签,并且我有一个如下所示的 HTML 文档:

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

也许,我应该尝试使用正则表达式来捕获开始和结束 "<>" 之间的每个匹配项,但是,我希望能够根据这些匹配遍历解析树,这样我就可以获得兄弟姐妹或父母或“nextElements”。在上面的示例中,一旦我知道它们,就相当于得到 *

*

”是包含匹配的标签的父母或兄弟姐妹。

我尝试过 BeautifulSoap,但在我看来,当您已经知道您正在寻找哪种标签或基于其内容时,它很有用。在这种情况下,我想首先获得一个匹配,将该匹配作为起点,然后像 BeautifulSoap 和其他 HTML 解析器一样导航树。

建议?

I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

Probably, I should try a regular expression to catch every match between opening and closing "<>", however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head> or maybe <h2>*</h2> once I know they're parents or siblings of a tag containing the match.

I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.

Suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

千寻… 2025-01-11 02:51:43

使用lxml.html。这是一个很棒的解析器,它支持 xpath ,可以轻松表达您想要的任何内容。

下面的示例使用此 xpath 表达式:

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

这意味着,用英语来说:

找到任何在其文本中包含单词 'name' 的标签,然后获取
父级,然后是下一个兄弟级,并在其中找到带有该类的任何标签
'name' 最后返回其文本内容。

运行代码的结果是:

['This is also a tag to be retrieved']

这是完整的代码:

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

必读,“请不要使用正则表达式解析 HTML”答案在这里:
https://stackoverflow.com/a/1732454/17160

Use lxml.html. It's a great parser, it support xpath which can express anything you'd want easily.

The example below uses this xpath expression:

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

That means, in english:

Find me any tag that contains the word 'name' in its text, then get
the parent, and then the next sibling, and find inside that any tag with the class
'name' and finally return the text content of that.

The result of running the code is:

['This is also a tag to be retrieved']

Here's the full code:

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

Obligatory read, the "please don't parse HTML with regex" answer is here:
https://stackoverflow.com/a/1732454/17160

牛↙奶布丁 2025-01-11 02:51:43

给定以下条件:

  • 匹配必须出现在标记上的属性值中
  • 匹配必须出现在作为标记的直接子级的文本节点中

您可以使用 beautiful soup:

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

输出:

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>

Given the following conditions:

  • The match must occur in value of an attribute on the tag
  • The match must occur in a text node which is a direct child of the tag

You can use beautiful soup:

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

Output:

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文