是否有正则表达式可以从 URI 中删除特定的查询变量?

发布于 2024-07-23 08:28:41 字数 617 浏览 7 评论 0原文

我有一堆由守护进程使用 C、XML 和 XSL 生成的 HTML。 然后我有一个 PHP 脚本,它获取 HTML 标记并将其显示在屏幕上。

我有大量符合 XHTML 1 的标记。 我需要修改标记中的所有链接以删除 &utm_source=report&utm_medium=email&utm_campaign=report

到目前为止我已经考虑了两种选择。

  1. 在 PHP 后端执行正则表达式搜索,删除 Analytics 代码
  2. 编写一些 Jquery 来循环访问链接,然后从 href 中删除 Analytics 代码。

障碍:

  1. HTML 可能很大。 IE 超过 4MB(进行了一些测试,平均约为 100Kb)
  2. 它必须很快。我们得到大约 3K 想法?

现在我正在尝试使用 str_replace('&utm_source=report&utm_medium=email&utm_campaign=report','',$html); 但它不起作用。

I have a bunch of HTML that is generated by a daemon using C, XML and XSL. Then I have a PHP script which picks up the HTML markup and displays it on the screen

I have a huge swathe of XHTML 1 compliant markup. I need to modify all of the links in the markup to remove &utm_source=report&utm_medium=email&utm_campaign=report.

So far I've considered two options.

  1. Do a regex search in the PHP backend which trims out the Analytics code
  2. Write some Jquery to loop through the links and then trim out the Analytics code from the href.

Hurdles:

  1. The HTML can be HUGE. I.E. more than 4MB (ran some tests, they average at about 100Kb)
  2. It has to be fast.We get approximately 3K
    Thoughts?

Right now I'm trying to use str_replace('&utm_source=report&utm_medium=email&utm_campaign=report','',$html); but it's not working.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

莫相离 2024-07-30 08:28:41

您可以使用 sed 或其他一些低级工具来删除该部分:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

但这会在任何地方删除该字符串,而不仅仅是在 URL 中。 所以要小心。

You could use sed or some other low level tool to remove that parts:

find /path/to/dir -type f -name '*.html' -exec sed -i 's/&utm_source=report&utm_medium=email&utm_campaign=report//g' {} \;

But that would remove this string anywhere and not just in URLs. So be careful.

梦年海沫深 2024-07-30 08:28:41

如果字符串始终相同,我发现的最快的 php 函数是 strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

显然,您需要对速度进行基准测试,但这应该在那里。

if the string is always the same the fastest php function I;ve found for that is strtr

PHP strtr

string strtr ( string $str , string $from , string $to )

$html = strtr($html, array('&utm_source=report&utm_medium=email&utm_campaign=report' => ''));

Obviously you'll need to benchmark the speed, but that should be up there.

风尘浪孓 2024-07-30 08:28:41

对于这么大的 HTML 块,我会将其外包给一个外部进程,可能是一个 Perl 脚本,

我对此并不肯定,因为我从未尝试过解析靠近那么多文本的任何地方,但我愿意 PHP 是不会很快做这件事。

您的预期负载是多少? 您需要多久进行一次此类处理? 这听起来像是您作为批处理操作执行的操作,根据我对此类任务的有限经验,该操作不一定需要超快,但足够快,可以在合理的时间内执行(即,你不会在一夜之间等待它或其他什么)

With HTML chunks that big, I'd farm this out to an external process, probably a perl script

I'm not positive since I've never attempted to parse anywhere near that much text, but I'm willing to be that PHP is not going to do this quickly.

What is your expected load? How often are you going to have to do this type of processing? This sounds like something that you'd do as a batch operation, which, in my admittedly limited experience with such tasks, doesn't need to necessarily super fast, but fast enough that it will execute in a reasonable amount of time (i.e., you're not waiting for it overnight or whatever)

南巷近海 2024-07-30 08:28:41

正则表达式是一种方法。 或者,您可以使用 XPath 查找文档中的所有链接,然后循环处理每个链接。 由于这是一个 XHTML 文档,并且如果假设其格式良好,则这种方法似乎是合理的。

Regex is one way. Alternately you could use XPath to find all links within the document and then work on each of those in a loop. Since this is an XHTML document and if assuming it is well formed, this approach seems reasonable.

土豪 2024-07-30 08:28:41

如果您在后端以 CGI 模式运行 PHP 的 preg_replace_all() ,它会非常快地完成此操作。 为什么有时不使用 cronjob 运行 php 脚本来处理所有 HTML? 因此,您的前端 php 脚本只会将处理后的内容发送到浏览器,而不进行任何计算。

PHP's preg_replace_all() will do this quite fast if you run it in CGI mode in backend. Why not using cronjob to run php script sometimes to process all your HTMLs? So, then your frontend php-script will only put the processed contents to browser without any calculations.

羅雙樹 2024-07-30 08:28:41

我最终推迟使用 str_replace 并通过文档的整个内容替换字符串:(。

I eventually deferred to using str_replace and replacing the string through the entire contents of the document :(.

弱骨蛰伏 2024-07-30 08:28:41

几年前,我遇到了这个问题,并提出了以下正则表达式来替换 url 中这些 utm 变量的任何实例:

/(\?|\&)?utm_[a-z]+=[^\&]+/

示例用法:

preg_replace('/(\?|\&)?utm_[a-z]+=[^\&]+/', '', 'http://mashable.com/2010/12/14/android-quick-start-guide/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Mashable+%28Mashable%29');

我在博客中介绍了该经验 此处

I encountered this problem a couple of years ago and came up with the following regex to replace any instances of those utm variables in urls:

/(\?|\&)?utm_[a-z]+=[^\&]+/

An example usage:

preg_replace('/(\?|\&)?utm_[a-z]+=[^\&]+/', '', 'http://mashable.com/2010/12/14/android-quick-start-guide/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Mashable+%28Mashable%29');

I blogged about the experience here

何必那么矫情 2024-07-30 08:28:41

不是真正的正则表达式,但它可能对您有帮助(未经测试):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;

Not really a RegExp but it may help you (not tested):

$xmlPrologue = '<?xml version="1.0"?>';
$source = '...'; // you're business

$dom = new DOMDocument($source);
$dom->loadXML($source);

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    list($base, $queryString) = explode('?', $link->getAttribute('href'));

    // read GET parameters inside an array
    parse_str(, $queryString/* assigned by reference */);

    // get rid of unwanted GET params
    unset($queryString['utm_source']);
    unset($queryString['utm_medium']);
    unset($queryString['utm_email']);
    unset($queryString['utm_report']);

    // recompose query string
    $queryString = http_build_query($queryString, null, '&');
    // or (not sure which we'll work the best)
    $queryString = http_build_query($queryString, null, '&');

    // assign the newly cleaned href attribute
    $link->setAttribute('href', $base . '?' . $queryString);
}

$html = $dom->saveXML();

// strip the XML declaration. Puts IE in quirks mode
$html = substr_replace($html, '', 0, strlen($xmlPrologue));
$html = trim($html);

echo $html;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文