如何在Python中通过匹配字符串提取父html标签

发布于 2024-12-23 14:12:05 字数 824 浏览 2 评论 0原文

我需要通过匹配html中的字符串来提取html中的父标签。 (IE) 我有很多原始的 html 资源。每个源都包含文本值“VIN:*”**以及一些字符。此文本值 (VIN:*) 在每个源中以各种格式放置,例如“

    ” 、“
    ”等。

然后我需要提取所有值以及“VIN:*”字符串。这意味着我需要获取其父标签。

例如,

<div class="class1">

                            Stock Number:
                            Z2079
                            <br>
                            **VIN:
                            2T2HK31UX9C110701**
                            <br>
                            Model Code:
                            9424
                            <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

这里我有 html 源的“VIN”。与此类似,我也有不同格式的其他 html 源的 VIN。

这些值必须在 Python 中提取。

有没有办法通过匹配Python中的字符串来提取父标签也有效?

I need to extract the parent tags in html by matching the string in html.
(i.e)
I have many raw html sources. Each source contains the text value "VIN:*"** with some characters. This text value(VIN:*) is placed in various formats in each source like "< ul >" , "< div >".etc..

Then I need to extract all values along with that "VIN:*" string. It means I need to get its parent tag.

For example,

<div class="class1">

                            Stock Number:
                            Z2079
                            <br>
                            **VIN:
                            2T2HK31UX9C110701**
                            <br>
                            Model Code:
                            9424
                            <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Here I have the "VIN" for the html source. Similar to that I have VIN for other html sources also in different format.

These values have to be extracted in Python.

Is there any way to extract the parent tag by matching the string in Python also in effective way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

终弃我 2024-12-30 14:12:05

强烈建议使用BeautifulSoup;它提供了一些非常方便的 HTML 解析功能。例如,在这两种情况下,我将如何查找包含“VIN”的每个文本节点:

soup = your_html_here
vins = soup.findAll(text = lambda(x): x.lower.index('vin') != -1)

从那里,您只需遍历该集合,获取每个节点的父节点,获取所述父节点的内容,然后按照您所看到的方式解析它们合身:

for v in vins:
    parent_html = v.parent.contents
    # more code here

I would strongly recommend going with BeautifulSoup on this; it provides some incredibly convenient functionality for parsing HTML. Here, for example, is how I would go about finding every text node that contains "VIN" in either case:

soup = your_html_here
vins = soup.findAll(text = lambda(x): x.lower.index('vin') != -1)

From there, you simply walk through that collection, grab each node's parent, grab said parent's contents, and parse them as you see fit:

for v in vins:
    parent_html = v.parent.contents
    # more code here
街角卖回忆 2024-12-30 14:12:05

对于如此简单的任务,即分析字符串,而不是解析它(解析 = 构建文本的树表示),您可以执行以下操作:

文本

ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
    Stock Number:
    Z2079
    <br>
        **VIN:
        2T2HK31UX9C110701**
    <br>
    Model Code:
    9424
    <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Humpty Dumpty had a great fall
<ul cat="zoo">
    Stock Number:
    ARDEN3125
    <br>
        **VIN:
        SHAKAMOSK-230478-UBUN**
    </br>
    Model Code:
    101
    <img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>

All the king's horses and all the king's men
<artifice>
    <baradino>
        Stock Number:
        DERT5178
        <br>
            **VIN:
            Pandaia-67-Moro**
        <br>
        Model Code:
        1234
        <img class="imgcert" src="/images/Pertuis_cpo.jpg">
    </baradino>
    what what what who what
    <somerset who="maugham">
        Nothing to declare
    </somerset>
</artifice>

Couldn't put Humpty Dumpty again
<ending rtf="simi">
    Stock Number:
    ZZZ789
    <br>
        **VIN:
        0000012554-ENDENDEND**
    <br>
    Model Code:
    QS78-9
    <img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>

qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh''' 

代码:

import re

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

li = [ (mat.group(1),mat.group(2),mat.group(3).strip(' \n\r\t'))
       for mat in regx.finditer(ss) ]

for el in li:
    print '(%-15r, %-25r, %-25r)' % el

结果

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('baradino'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

re.DOTALL 是有必要赋予点符号匹配换行符的能力(默认情况下,正则表达式模式中的点匹配除换行符之外的每个字符)

\\1 是在这个位置指定的方法被检查的字符串,字符串中必须有与第一组捕获的相同部分,即 ([^ >]+)

'(?!.+?<( ?!br>)[^ >]+>.+?
.+?)'
是禁止在其他标签中查找的部分比
在 HTML 元素的开始标记和结束标记之间遇到的第一个标记
之前。
这部分是捕获 VIM 之前最接近的前面标签所必需的

如果这部分不存在,则正则表达式

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

会捕获以下结果:

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('artifice'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

差异是“artifice”而不是“baradino”

For a so simple task, that consists in ANLYZING the string, not PARSING it (parsing = building a tree representation of the text), you can do :

the text

ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
    Stock Number:
    Z2079
    <br>
        **VIN:
        2T2HK31UX9C110701**
    <br>
    Model Code:
    9424
    <img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>

Humpty Dumpty had a great fall
<ul cat="zoo">
    Stock Number:
    ARDEN3125
    <br>
        **VIN:
        SHAKAMOSK-230478-UBUN**
    </br>
    Model Code:
    101
    <img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>

All the king's horses and all the king's men
<artifice>
    <baradino>
        Stock Number:
        DERT5178
        <br>
            **VIN:
            Pandaia-67-Moro**
        <br>
        Model Code:
        1234
        <img class="imgcert" src="/images/Pertuis_cpo.jpg">
    </baradino>
    what what what who what
    <somerset who="maugham">
        Nothing to declare
    </somerset>
</artifice>

Couldn't put Humpty Dumpty again
<ending rtf="simi">
    Stock Number:
    ZZZ789
    <br>
        **VIN:
        0000012554-ENDENDEND**
    <br>
    Model Code:
    QS78-9
    <img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>

qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh''' 

the code:

import re

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

li = [ (mat.group(1),mat.group(2),mat.group(3).strip(' \n\r\t'))
       for mat in regx.finditer(ss) ]

for el in li:
    print '(%-15r, %-25r, %-25r)' % el

the result

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('baradino'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

re.DOTALL is necessary to give to the dot symbol the ability to match also the newline (by default , a dot in a regular expression pattern matches every character except newlines)

\\1 is way to specify that at this place in the examined string, there must be the same portion of string that is captured by the first group, that is to say the part ([^ >]+)

'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)' is a part that says that it is forbidden to find a tag other than <br> before the first tag <br> encountered between an opening tag and the closing tag of an HTML element.
This part is necessary to catch the closest preceding tag before VIM apart <br>
If this part isn't present , the regex

regx = re.compile('<([^ >]+) ?([^>]*)>'
                  '.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
                  re.DOTALL)

catches the following result:

('div'          , 'class="class1"'         , '2T2HK31UX9C110701'      )
('ul'           , 'cat="zoo"'              , 'SHAKAMOSK-230478-UBUN'  )
('artifice'     , ''                       , 'Pandaia-67-Moro'        )
('ending'       , 'rtf="simi"'             , '0000012554-ENDENDEND'   )

The difference is 'artifice' instead of 'baradino'

一抹淡然 2024-12-30 14:12:05

对于不使用任何 xml/html 解析器的纯字符串版本,您可以尝试正则表达式(re):

import re

html_doc = """ <div ...VIN ...  /div>"""

results = re.findall('<(.+>).*VIN.*+</\1', html_doc)

For a pure string version without using any xml/html-parser you might try regular expressions(re):

import re

html_doc = """ <div ...VIN ...  /div>"""

results = re.findall('<(.+>).*VIN.*+</\1', html_doc)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文