如何在Python中通过匹配字符串提取父html标签
我需要通过匹配html中的字符串来提取html中的父标签。 (IE) 我有很多原始的 html 资源。每个源都包含文本值“VIN:*”**以及一些字符。此文本值 (VIN:*) 在每个源中以各种格式放置,例如“
- ” 、“
然后我需要提取所有值以及“VIN:*”字符串。这意味着我需要获取其父标签。
例如,
<div class="class1">
Stock Number:
Z2079
<br>
**VIN:
2T2HK31UX9C110701**
<br>
Model Code:
9424
<img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>
这里我有 html 源的“VIN”。与此类似,我也有不同格式的其他 html 源的 VIN。
这些值必须在 Python 中提取。
有没有办法通过匹配Python中的字符串来提取父标签也有效?
I need to extract the parent tags in html by matching the string in html.
(i.e)
I have many raw html sources. Each source contains the text value "VIN:*"** with some characters. This text value(VIN:*) is placed in various formats in each source like "< ul >" , "< div >".etc..
Then I need to extract all values along with that "VIN:*" string. It means I need to get its parent tag.
For example,
<div class="class1">
Stock Number:
Z2079
<br>
**VIN:
2T2HK31UX9C110701**
<br>
Model Code:
9424
<img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>
Here I have the "VIN" for the html source. Similar to that I have VIN for other html sources also in different format.
These values have to be extracted in Python.
Is there any way to extract the parent tag by matching the string in Python also in effective way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我强烈建议使用BeautifulSoup;它提供了一些非常方便的 HTML 解析功能。例如,在这两种情况下,我将如何查找包含“VIN”的每个文本节点:
从那里,您只需遍历该集合,获取每个节点的父节点,获取所述父节点的内容,然后按照您所看到的方式解析它们合身:
I would strongly recommend going with BeautifulSoup on this; it provides some incredibly convenient functionality for parsing HTML. Here, for example, is how I would go about finding every text node that contains "VIN" in either case:
From there, you simply walk through that collection, grab each node's parent, grab said parent's contents, and parse them as you see fit:
对于如此简单的任务,即分析字符串,而不是解析它(解析 = 构建文本的树表示),您可以执行以下操作:
文本
代码:
结果
re.DOTALL
是有必要赋予点符号匹配换行符的能力(默认情况下,正则表达式模式中的点匹配除换行符之外的每个字符)\\1
是在这个位置指定的方法被检查的字符串,字符串中必须有与第一组捕获的相同部分,即([^ >]+)
'(?!.+?<( ?!br>)[^ >]+>.+?
是禁止在其他标签中查找的部分比.+?)'
在 HTML 元素的开始标记和结束标记之间遇到的第一个标记
之前。这部分是捕获 VIM 之前最接近的前面标签所必需的
如果这部分不存在,则正则表达式
会捕获以下结果:
差异是“artifice”而不是“baradino”
For a so simple task, that consists in ANLYZING the string, not PARSING it (parsing = building a tree representation of the text), you can do :
the text
the code:
the result
re.DOTALL
is necessary to give to the dot symbol the ability to match also the newline (by default , a dot in a regular expression pattern matches every character except newlines)\\1
is way to specify that at this place in the examined string, there must be the same portion of string that is captured by the first group, that is to say the part([^ >]+)
'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
is a part that says that it is forbidden to find a tag other than<br>
before the first tag<br>
encountered between an opening tag and the closing tag of an HTML element.This part is necessary to catch the closest preceding tag before VIM apart
<br>
If this part isn't present , the regex
catches the following result:
The difference is 'artifice' instead of 'baradino'
对于不使用任何 xml/html 解析器的纯字符串版本,您可以尝试正则表达式(re):
For a pure string version without using any xml/html-parser you might try regular expressions(re):