我正在尝试清除注入客户博客的大量垃圾邮件。问题之一是,最初进行注入的黑客的方式实际上导致了格式错误的多嵌入链接,因此我无法以简洁的方式获取它们。
我的想法是将 posts 表中的所有链接转储到一个文本文件中,然后从该列表中删除有效的链接,并从那里创建一个 bash 脚本,一次一行地删除恶意链接。我试图使用非贪婪的 grep 来转储链接,否则在帖子中有多个链接的情况下,它将从第一个链接的开头到最后一个链接的结尾。这是我正在使用的行:
grep -Po "<a href=[\'\"][^\'\"]*[\'\"]>.*?</a>" wp_posts.sql>full-link-list.txt
当它尝试解析嵌入在其他链接中的链接时,就会发生问题。例如,我得到这个:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>
从这样的部分:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a> do you buy viagra | buy cialis phentermine | cheap levitra online</a>
虽然并不是所有链接都像这样被破坏,如果我清除上面命令的输出,我认为这将使追踪碎片变得非常困难。我认为我需要的是抓住整个块的东西(即匹配打开 与相同数量的关闭
),或者只是可能的最小内部匹配(即从内到外贪婪),然后我分多次进行,但我也愿意接受其他建议。对此有什么想法吗?谢谢!
I am attempting to clean out a ton of spam that was injected into a client's blog. One of the issues is that the hack that originally did the injection did so in a way that it actually wound up with malformed multi-embeded links, so I am having trouble grabbing them in a concise way.
My thought was to dump all of the links in the posts table into a text file, then remove the valid ones from that list, and from there create a bash script that removed the malicious ones one line at a time. I was trying to use a non-greedy grep to dump the links, otherwise in cases where there was more than one link in the post it would go from the start of the first link to the end of the last one. This is the line I was using:
grep -Po "<a href=[\'\"][^\'\"]*[\'\"]>.*?</a>" wp_posts.sql>full-link-list.txt
The problem is happening when it tries to parse links embedded within other links. For instance, I get this:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>
from a section like this:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a> do you buy viagra | buy cialis phentermine | cheap levitra online</a>
Not all links are broken like this though, and if I clean out the ones output from the command above I think it will make it very difficult to track down the debris. What I think I need is either something that grabs the whole block (ie. matching opening <a href
with the same number of closing </a>
), or just the smallest inner match possible (ie. greedy from the inside out) and I then do it in multiple passes, but I am open to other suggestions too. Any thoughts on this? Thanks!
发布评论
评论(1)
我认为由内而外的方法是最好的选择。假设
元素内没有其他标签,则应该像将
.*?
更改为[^<>] 一样简单*。而且,正如您所说,进行多次传递。
虽然在许多正则表达式风格中可以一次性匹配整个嵌套结构,但每种风格的做法都不同,而且总是很难看。
I think the inside-out approach is your best bet. Assuming there are no other tags inside the
<a>
elements, it should be as simple as changing the.*?
to[^<>]*
. And, as you said, making multiple passes.While it is possible in many regex flavors to match the whole nested structure in one pass, every flavor does it differently, and it's always ugly.