当前位置：文江博客话题详情

是否有正则表达式可以从 URI 中删除特定的查询变量？

发布于 2024-07-23 08:28:41 字数 617 浏览 7 评论 0原文

我有一堆由守护进程使用 C、XML 和 XSL 生成的 HTML。然后我有一个 PHP 脚本，它获取 HTML 标记并将其显示在屏幕上。

我有大量符合 XHTML 1 的标记。我需要修改标记中的所有链接以删除 &utm_source=report&utm_medium=email&utm_campaign=report。

到目前为止我已经考虑了两种选择。

在 PHP 后端执行正则表达式搜索，删除 Analytics 代码
编写一些 Jquery 来循环访问链接，然后从 href 中删除 Analytics 代码。

障碍：

HTML 可能很大。 IE 超过 4MB（进行了一些测试，平均约为 100Kb）
它必须很快。我们得到大约 3K 想法？

现在我正在尝试使用 str_replace('&utm_source=report&utm_medium=email&utm_campaign=report','',$html); 但它不起作用。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莫相离 2024-07-30 08:28:41

您可以使用 sed 或其他一些低级工具来删除该部分：

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

但这会在任何地方删除该字符串，而不仅仅是在 URL 中。所以要小心。

You could use sed or some other low level tool to remove that parts:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

But that would remove this string anywhere and not just in URLs. So be careful.

回复收藏 0 原文

梦年海沫深 2024-07-30 08:28:41

如果字符串始终相同，我发现的最快的 php 函数是 strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

显然，您需要对速度进行基准测试，但这应该在那里。

if the string is always the same the fastest php function I;ve found for that is strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

Obviously you'll need to benchmark the speed, but that should be up there.

回复收藏 0 原文

风尘浪孓 2024-07-30 08:28:41

对于这么大的 HTML 块，我会将其外包给一个外部进程，可能是一个 Perl 脚本，

我对此并不肯定，因为我从未尝试过解析靠近那么多文本的任何地方，但我愿意 PHP 是不会很快做这件事。

您的预期负载是多少？您需要多久进行一次此类处理？这听起来像是您作为批处理操作执行的操作，根据我对此类任务的有限经验，该操作不一定需要超快，但足够快，可以在合理的时间内执行（即，你不会在一夜之间等待它或其他什么）

回复收藏 0 原文

南巷近海 2024-07-30 08:28:41

正则表达式是一种方法。或者，您可以使用 XPath 查找文档中的所有链接，然后循环处理每个链接。由于这是一个 XHTML 文档，并且如果假设其格式良好，则这种方法似乎是合理的。

回复收藏 0 原文

土豪 2024-07-30 08:28:41

如果您在后端以 CGI 模式运行 PHP 的 preg_replace_all() ，它会非常快地完成此操作。为什么有时不使用 cronjob 运行 php 脚本来处理所有 HTML？因此，您的前端 php 脚本只会将处理后的内容发送到浏览器，而不进行任何计算。

回复收藏 0 原文

羅雙樹 2024-07-30 08:28:41

我最终推迟使用 str_replace 并通过文档的整个内容替换字符串:(。

回复收藏 0 原文

弱骨蛰伏 2024-07-30 08:28:41

几年前，我遇到了这个问题，并提出了以下正则表达式来替换 url 中这些 utm 变量的任何实例：

/(\?|\&)?utm_[a-z]+=[^\&]+/

示例用法：

preg_replace('/(\?|\&)?utm_[a-z]+=[^\&]+/', '', 'http://mashable.com/2010/12/14/android-quick-start-guide/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Mashable+%28Mashable%29');

我在博客中介绍了该经验此处

I encountered this problem a couple of years ago and came up with the following regex to replace any instances of those utm variables in urls:

/(\?|\&)?utm_[a-z]+=[^\&]+/

An example usage:

preg_replace('/(\?|\&)?utm_[a-z]+=[^\&]+/', '', 'http://mashable.com/2010/12/14/android-quick-start-guide/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Mashable+%28Mashable%29');

I blogged about the experience here

回复收藏 0 原文

何必那么矫情 2024-07-30 08:28:41

不是真正的正则表达式，但它可能对您有帮助（未经测试）：

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;

Not really a RegExp but it may help you (not tested):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;

回复收藏 0 原文

~没有更多了~