需要修改非贪婪 grep 的行为

发布于 2024-12-04 18:20:26 字数 977 浏览 14 评论 0 原文

我正在尝试清除注入客户博客的大量垃圾邮件。问题之一是，最初进行注入的黑客的方式实际上导致了格式错误的多嵌入链接，因此我无法以简洁的方式获取它们。

我的想法是将 posts 表中的所有链接转储到一个文本文件中，然后从该列表中删除有效的链接，并从那里创建一个 bash 脚本，一次一行地删除恶意链接。我试图使用非贪婪的 grep 来转储链接，否则在帖子中有多个链接的情况下，它将从第一个链接的开头到最后一个链接的结尾。这是我正在使用的行：

grep -Po "<a href=[\'\"][^\'\"]*[\'\"]>.*?</a>" wp_posts.sql>full-link-list.txt

当它尝试解析嵌入在其他链接中的链接时，就会发生问题。例如，我得到这个：

<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>

从这样的部分：

<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>  do you buy viagra | buy cialis phentermine | cheap levitra online</a>

虽然并不是所有链接都像这样被破坏，如果我清除上面命令的输出，我认为这将使追踪碎片变得非常困难。我认为我需要的是抓住整个块的东西（即匹配打开 与相同数量的关闭），或者只是可能的最小内部匹配（即从内到外贪婪），然后我分多次进行，但我也愿意接受其他建议。对此有什么想法吗？谢谢！

原文

I am attempting to clean out a ton of spam that was injected into a client's blog. One of the issues is that the hack that originally did the injection did so in a way that it actually wound up with malformed multi-embeded links, so I am having trouble grabbing them in a concise way.

My thought was to dump all of the links in the posts table into a text file, then remove the valid ones from that list, and from there create a bash script that removed the malicious ones one line at a time. I was trying to use a non-greedy grep to dump the links, otherwise in cases where there was more than one link in the post it would go from the start of the first link to the end of the last one. This is the line I was using:

grep -Po "<a href=[\'\"][^\'\"]*[\'\"]>.*?</a>" wp_posts.sql>full-link-list.txt

The problem is happening when it tries to parse links embedded within other links. For instance, I get this:

<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>

from a section like this:

<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>  do you buy viagra | buy cialis phentermine | cheap levitra online</a>

Not all links are broken like this though, and if I clean out the ones output from the command above I think it will make it very difficult to track down the debris. What I think I need is either something that grabs the whole block (ie. matching opening <a href with the same number of closing </a>), or just the smallest inner match possible (ie. greedy from the inside out) and I then do it in multiple passes, but I am open to other suggestions too. Any thoughts on this? Thanks!