从 HTML 文件中间的设定点提取上下文

发布于 2024-08-20 06:38:36 字数 812 浏览 4 评论 0原文

我有一些 HTML，并且正在某个点提取一个片段（内嵌图像），但我想显示该图像周围的一些上下文。

我正在使用 PHP，并且我知道 Symfony 和 Wordpress 都提供了处理在某些 HTML 中间切碎文本时发生的情况的函数（它会关闭所有打开的标签），但没有提供处理另一个方向的片段的函数。

因此，在这种情况下：

 'Snippet of text and a <a href="#moo">link right her'

我可以使用上述函数来修复，但是呢：

'nk right here</a> and then more text after the link.'

我已经考虑过即使是标签结束片段也可能是错误的解决方法的可能性，而我应该使用 Xpath 解析 HTML。但是，我找不到任何使用 xpath 创建这样的片段的示例或提及。

更新：

所以我当前的想法是：

向上移动解析树，直到到达包含所有内容的标签（在我的情况下为 div class=post ）。该 div 之前的最后一个节点是起点（最有可能是 ap 标签）。
从这里获取前一个同级（应该再次是 ap 标签）。
下降到该节点并获取最后一个子节点，将文本内容保存到临时字符串中。继续向后浏览这些子节点，直到我们获得足够的片段。

这仍然不理想，因为我不确定我需要走多远才能获取文本内容。

有谁知道这个想法在任何地方的实施吗？

原文

I have some HTML, and I'm extracting a snippet at a certain point (an inline image), but I'd like to show some context around this image.

I'm using PHP, and I know that both Symfony and Wordpress provide functions for dealing with what happens when you chop up text in the middle of some HTML (it closes all open tags), but nothing for dealing with snippets in the other direction.

So, in the case of :

 'Snippet of text and a <a href="#moo">link right her'

I can use the above-mentioned function to fix, but what about:

'nk right here</a> and then more text after the link.'

I've considered the possibility that even the tag-closing snippet is probably the wrong way to go about this, and I should instead be using Xpath to parse the HTML. However, I can't find any examples or mentions of using xpath to create snippets like this.

Update:

So my current idea is:

move up the parse tree until I get to the tag that encloses all the content (div class=post in my case). The last node that I have before this div is the starting point (most likely a p tag).
From here, get the previous sibling (which should be a p tag again).
Descend into this node and get the last children, saving the text content to a temporary string. Keep stepping back through these children, until we get enough of a snippet.

This still ins't ideal, as I'm not sure how far I'll have to step down to get the text content.

Does anyone know of an implementation of this idea anywhere?

分享到QQ

分享到微博