正则表达式替换 html 文档中的文本节点
我有一个代表 html 文档的字符串。我试图用一些替换 html 替换该文档中的文本,当然排除标记和属性值。我认为这很简单,但是当您想用标记替换文本时,这非常乏味。例如,将 somekeyword
替换为 somekeyword
。
from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
exp = compile(pattern)
root = fromstring(text)
els = [el for el in root.getiterator() if el.text]
els = [el for el in els if el.text.strip()]
for el in els:
text = exp.sub(repl, el.text)
if text == el.text:
continue
parent = el.getparent()
new_el = fromstring(text)
new_el.tag = el.tag
for k, v in el.attrib.items():
new_el.attrib[k] = v
parent.replace(el, new_el)
return tostring(root)
markup_aware_sub('keyword', '<a>blah</a>', '<div><p>Text with keyword here</p></div>')
它有效,但前提是关键字正好是两个“嵌套”。必须有比上面更好的方法,但在谷歌搜索几个小时后我找不到任何东西。
I have a string that represents an html document. I'm trying to replace text in that document, excluding the markup and attribute values ofcourse with some replacement html. I thought it would be simple, but it is incredibly tedious when you want to replace the text with markup. For example, to replace somekeyword
with <a href = "link">somekeyword</a>
.
from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
exp = compile(pattern)
root = fromstring(text)
els = [el for el in root.getiterator() if el.text]
els = [el for el in els if el.text.strip()]
for el in els:
text = exp.sub(repl, el.text)
if text == el.text:
continue
parent = el.getparent()
new_el = fromstring(text)
new_el.tag = el.tag
for k, v in el.attrib.items():
new_el.attrib[k] = v
parent.replace(el, new_el)
return tostring(root)
markup_aware_sub('keyword', '<a>blah</a>', '<div><p>Text with keyword here</p></div>')
It works but only if the keyword is exactly two "nestings" down. There has to be a better way to do it than the above, but after googling for many hours I can't find anything.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这可能是您正在寻找的解决方案:
这将为您提供以下输出
lxml 方法的问题似乎仅在关键字只有一个嵌套时才会出现。它似乎适用于多个嵌套。所以我添加了一个if条件来捕获这个异常。
不是很优雅,但似乎可以工作。请检查一下。
This might be the solution you are lookin for:
This will give you the following output
The problem with your lxml approach only seems to occur when the keywords has only a single nesting. It seems to work fine with multiple nestings. So I added an if condition to catch this exception.
Not very elegant, but seems to work. Please check it out.