使用 ElementTree/lxml 查找文本字符串的父标签

发布于 2024-07-24 09:33:13 字数 376 浏览 5 评论 0原文

我正在尝试获取一串文本，并从 html 中“提取”段落/文档中的其余文本。

我当前的方法是尝试在已用 lxml 解析的 html 中找到字符串的“父标记”。（如果您知道解决此问题的更好方法，我会洗耳恭听！）

例如，在树中搜索“TEXT STRING HERE”并返回“p”标签。（请注意，我不会事先知道 html 的确切布局）

<html>
<head>
...
</head>
<body>
.... 
<div>
...
<p>TEXT STRING HERE ......</p>
...
</html>

感谢您的帮助！

原文

I'm trying to take a string of text, and "extract" the rest of the text in the paragraph/document from the html.

My current is approach is trying to find the "parent tag" of the string in the html that has been parsed with lxml. (if you know of a better way to tackle this problem, I'm all ears!)

For example, search the tree for "TEXT STRING HERE" and return the "p" tag. (note that I won't know the exact layout of the html beforehand)

<html>
<head>
...
</head>
<body>
.... 
<div>
...
<p>TEXT STRING HERE ......</p>
...
</html>

Thanks for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇妻 2024-07-31 09:33:13

这是使用 ElementTree 实现此目的的简单方法。它确实要求您的 HTML 输入是有效的 XML（因此我已向您的 HTML 添加了适当的结束标记）：

import elementtree.ElementTree as ET

html = """<html>
<head>
</head>
<body>
<div>
<p>TEXT STRING HERE ......</p> 
</div>
</body>
</html>"""

for e in ET.fromstring(html).getiterator():
    if e.text.find('TEXT STRING HERE') != -1:
        print "Found string %r, element = %r" % (e.text, e)

This is a simple way to do it with ElementTree. It does require that your HTML input is valid XML (so I have added the appropriate end tags to your HTML):

import elementtree.ElementTree as ET

html = """<html>
<head>
</head>
<body>
<div>
<p>TEXT STRING HERE ......</p> 
</div>
</body>
</html>"""

for e in ET.fromstring(html).getiterator():
    if e.text.find('TEXT STRING HERE') != -1:
        print "Found string %r, element = %r" % (e.text, e)

回复收藏 0 原文

~没有更多了~