我如何使用 xpath 和 lxml 从以下可怕的 html 中选择这些元素?

发布于 2024-10-03 13:40:19 字数 1058 浏览 2 评论 0原文

我想仅使用 lxml 和一些聪明的 xpath 从此 html 中选择以下字符串。字符串会改变,但周围的 html 不会改变。

我需要...

  • 19/11/2010
  • AAAAAA/01
  • 正常
  • 英国
  • 此描述可能包含html但我仍然需要所有这些!

来自...

...
<p>
    <strong>Date:</strong> 19/11/2010<br>
    <strong>Ref:</strong> AAAAAA/01<br>
    <b>Type:</b> Normal<br>
    <b>Country:</b> United Kingdom<br>
</p>
<hr>
<p>
    <br>
    <b>1. Title:</b> The Title<br>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br>
    <b>3. Date:</b> 25th October<br>
...

</p>

...

到目前为止,我只想出了使用正则表达式和re:match来尝试将其拖出来,但即使这样也赢了例如,如果没有某些东西使我能够获取

节点的innerHTML,则无法工作。

有没有办法在不通过正则表达式对字符串进行后处理的情况下做到这一点?

谢谢 :)

i want to select the following strings from this html using just lxml and some clever xpath. The strings will change but the surrounding html will not.

i need...

  • 19/11/2010
  • AAAAAA/01
  • Normal
  • United Kingdom
  • This description may contains <bold>html</bold> but i still need all of it!

from...

...
<p>
    <strong>Date:</strong> 19/11/2010<br>
    <strong>Ref:</strong> AAAAAA/01<br>
    <b>Type:</b> Normal<br>
    <b>Country:</b> United Kingdom<br>
</p>
<hr>
<p>
    <br>
    <b>1. Title:</b> The Title<br>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br>
    <b>3. Date:</b> 25th October<br>
...

</p>

...

So far i've only come up with using regex expressions and re:match to try and drag it out, but even that won't work without something which enables me to get innerHTML of a the <p> nodes for exapmle.

is there any way to do this without post-processing the string through regex?

Thanks :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

眼角的笑意。 2024-10-10 13:40:19

非常难看!有了这个格式正确的输入:

<html>
<p>
    <strong>Date:</strong> 19/11/2010<br/>
    <strong>Ref:</strong> AAAAAA/01<br/>
    <b>Type:</b> Normal<br/>
    <b>Country:</b> United Kingdom<br/>
</p>
<hr/>
<p>
    <br/>
    <b>1. Title:</b> The Title<br/>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/>
    <b>3. Date:</b> 25th October<br/>
</p>
</html>

最简单的情况:

/html/p/strong[.='Date:']/following-sibling::text()[1]

评估为:

 19/11/2010

所有这些合而为一:

/html/p/*[self::strong[.='Date:' or .='Ref:']|
          self::b[.='Type:' or .='Country:']]
         /following-sibling::text()[1]

复杂的情况:

/html/p/node()[preceding-sibling::b[1][.='2. Description: ']]
              [following-sibling::b[1][.='3. Date:']]
              [not(self::br)]

Very ugly! With this properly wellformed input:

<html>
<p>
    <strong>Date:</strong> 19/11/2010<br/>
    <strong>Ref:</strong> AAAAAA/01<br/>
    <b>Type:</b> Normal<br/>
    <b>Country:</b> United Kingdom<br/>
</p>
<hr/>
<p>
    <br/>
    <b>1. Title:</b> The Title<br/>
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/>
    <b>3. Date:</b> 25th October<br/>
</p>
</html>

Simplest case:

/html/p/strong[.='Date:']/following-sibling::text()[1]

Evaluate to:

 19/11/2010

All of those in one:

/html/p/*[self::strong[.='Date:' or .='Ref:']|
          self::b[.='Type:' or .='Country:']]
         /following-sibling::text()[1]

The complex one:

/html/p/node()[preceding-sibling::b[1][.='2. Description: ']]
              [following-sibling::b[1][.='3. Date:']]
              [not(self::br)]
街角卖回忆 2024-10-10 13:40:19

这并不是那么困难。

给定这个 XML 文档:

<html> 
<p> 
    <strong>Date:</strong> 19/11/2010<br/> 
    <strong>Ref:</strong> AAAAAA/01<br/> 
    <b>Type:</b> Normal<br/> 
    <b>Country:</b> United Kingdom<br/> 
</p> 
<hr/> 
<p> 
    <br/> 
    <b>1. Title:</b> The Title<br/> 
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/> 
    <b>3. Date:</b> 25th October<br/> 
</p> 
</html> 

我需要...

  • 2010 年 11 月 19 日
  • AAAAAA/01
  • 正常
  • 英国

此 XPath 表达式选择上述所有文本节点

/*/p[1]/text()
  • 此描述可能包含 html,但我仍然需要全部
    的!

使用这个

/*/p[2]/b[2]/following-sibling::node()
                 [count(.|/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()) 
                = 
                  count((/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()))
                 ]

This isn't so difficult.

Given this XML document:

<html> 
<p> 
    <strong>Date:</strong> 19/11/2010<br/> 
    <strong>Ref:</strong> AAAAAA/01<br/> 
    <b>Type:</b> Normal<br/> 
    <b>Country:</b> United Kingdom<br/> 
</p> 
<hr/> 
<p> 
    <br/> 
    <b>1. Title:</b> The Title<br/> 
    <b>2. Description: </b> This description may contains <bold>html</bold> but i still need all of it!<br/> 
    <b>3. Date:</b> 25th October<br/> 
</p> 
</html> 

i need...

  • 19/11/2010
  • AAAAAA/01
  • Normal
  • United Kingdom

this XPath expression selects all of the above text nodes:

/*/p[1]/text()
  • This description may contains html but i still need all
    of it!

Use this:

/*/p[2]/b[2]/following-sibling::node()
                 [count(.|/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()) 
                = 
                  count((/*/p[2]/b[2]/following-sibling::br[1]/preceding-sibling::node()))
                 ]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文