关于php函数preg_replace的问题
我想从 html 文件中动态删除特定标签及其内容,并考虑使用 preg_replace 但无法获得正确的语法。基本上它应该,例如,做类似的事情: 将“”之间(包括“”)之间的所有内容替换为空。
有人可以帮我解决这个问题吗?
I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
简单,伙计。
要获得 Ungreedy regexpr,请使用 U 修饰符
要使其成为多行,请使用 s 修饰符。
知道这一点后,要删除所有段落,请使用此模式:
解释:
:检测开头段落的部分(具有假设样式,例如 )]*>
(.*)?
:一切(在“Ungreedy 模式”下):显然,最后一段
希望有所帮助!
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
Explain :
<p[^>]*>
: part detecting an opening paragraph (with a hypothetic style, such as )(.*)?
: Everything (in "Ungreedy mode")</p>
: Obviously, the closing paragraphHope that help !
如果您尝试清理数据,通常建议您使用白名单,而不是将某些术语和标签列入黑名单。这更容易清理和防止 XSS 攻击。有一个名为 HTML Purifier 的著名库,尽管它很大并且有点慢,但在净化数据方面具有惊人的效果。
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
我建议不要尝试使用正则表达式来执行此操作。更安全的方法是使用类似
Simple HTML DOM
以下是 API 参考的链接:简单 HTML DOM API 参考
另一种选择是使用 DOMDocument
这里的想法是使用真正的 HTML 解析器来解析数据,然后你可以移动/遍历树并删除您需要的任何元素/属性/文本。与尝试使用正则表达式替换 HTML 中的数据相比,这是一种更简洁的方法。
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
如果您不知道标签之间的内容,菲尔的回应将不起作用。
如果中间没有其他标签,这将起作用,并且绝对是更简单的情况。显然,您可以将 div 替换为您需要的任何标签。
如果中间可能有其他标签,这应该可以工作,但可能会导致问题。如果是的话,您可能最好使用上面的 DOM 解决方案
这些尚未经过测试
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
These aren't tested
伪代码
HTML Before
HTML After
我知道这是一个黑客工作
PSEUDO CODE
HTML Before
HTML After
I know it's a hack job