如何根据正则表达式检索 HTML 标签
我正在尝试提取每个 HTML 标记,包括正则表达式的匹配项。例如,假设我想要获取包括字符串“name”的每个标签,并且我有一个如下所示的 HTML 文档:
<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>
也许,我应该尝试使用正则表达式来捕获开始和结束 "<>" 之间的每个匹配项
,但是,我希望能够根据这些匹配遍历解析树,这样我就可以获得兄弟姐妹或父母或“nextElements”。在上面的示例中,一旦我知道它们,就相当于得到 *
或
”是包含匹配的标签的父母或兄弟姐妹。*
我尝试过 BeautifulSoap,但在我看来,当您已经知道您正在寻找哪种标签或基于其内容时,它很有用。在这种情况下,我想首先获得一个匹配,将该匹配作为起点,然后像 BeautifulSoap 和其他 HTML 解析器一样导航树。
建议?
I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:
<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>
Probably, I should try a regular expression to catch every match between opening and closing "<>"
, however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head>
or maybe <h2>*</h2>
once I know they're parents or siblings of a tag containing the match.
I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.
Suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
lxml.html
。这是一个很棒的解析器,它支持 xpath ,可以轻松表达您想要的任何内容。下面的示例使用此 xpath 表达式:
这意味着,用英语来说:
运行代码的结果是:
这是完整的代码:
必读,“请不要使用正则表达式解析 HTML”答案在这里:
https://stackoverflow.com/a/1732454/17160
Use
lxml.html
. It's a great parser, it support xpath which can express anything you'd want easily.The example below uses this xpath expression:
That means, in english:
The result of running the code is:
Here's the full code:
Obligatory read, the "please don't parse HTML with regex" answer is here:
https://stackoverflow.com/a/1732454/17160
给定以下条件:
您可以使用 beautiful soup:
输出:
Given the following conditions:
You can use beautiful soup:
Output: