preg_replace 难题:替换主题末尾的零个或多个字符

发布于 2024-09-12 17:41:07 字数 549 浏览 13 评论 0原文

假设 $d 是一个目录路径,我想确保它以一个斜杠 (/) 开头和结尾。它最初可能有零个、一个或多个前导斜杠和/或尾随斜杠。

我尝试过:

preg_replace('%^/*|/*$', '/', $d);

它适用于前导斜杠,但令我惊讶的是,如果 $d 至少有一个尾随斜杠,则会产生两个尾随斜杠。如果主题是,例如,'foo///',则 preg_replace() 首先匹配三个尾部斜杠并将其替换为一个斜杠,然后匹配末尾的零个斜杠并将其替换为削减。 (您可以通过用 '[$0]' 替换第二个参数来验证这一点。)我发现这相当违反直觉。

虽然有许多其他方法可以解决根本问题(我实现了一个),但这对我来说成为了 PCRE 难题:单个 preg_replace 中的什么(标量)模式可以完成这项工作?

附加问题(编辑)

任何人都可以解释为什么这个模式与字符串末尾的方式匹配,但在开头的行为却不同?

Say $d is a directory path and I want to ensure that it starts and ends with exactly one slash (/). It may initially have zero, one or more leading and/or trailing slashes.

I tried:

preg_replace('%^/*|/*

which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///' then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'.) I find this rather counterintuitive.

While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace does this job?

ADDITIONAL QUESTION (edit)

Can anyone explain why this pattern matches the way it does at the end of the string but does not behave similarly at the start?

, '/', $d);

which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///' then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'.) I find this rather counterintuitive.

While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace does this job?

ADDITIONAL QUESTION (edit)

Can anyone explain why this pattern matches the way it does at the end of the string but does not behave similarly at the start?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

拥抱没勇气 2024-09-19 17:41:07
$path = '/' . trim($path, '/') . '/';

首先删除开头或结尾的所有斜杠,然后再次添加单个斜杠。

$path = '/' . trim($path, '/') . '/';

This first removes all slashes at beginning or end and then adds single ones again.

笑梦风尘 2024-09-19 17:41:07

给定像 /* 这样可以合法匹配零个字符的正则表达式,正则表达式引擎必须确保它在同一位置不会匹配多次,否则会陷入无限循环。因此,如果它确实消耗了零个字符,引擎会在尝试另一场匹配之前向前跳转一个位置。据我所知,这是正则表达式引擎自行执行任何操作的唯一情况。

您看到的是相反的情况:正则表达式消耗一个或多个字符,然后在下一轮中它尝试从停止的位置开始匹配。没关系,这个特定的正则表达式只能匹配一个字符,而且它已经匹配了尽可能多的字符;它仍然可以选择不匹配任何内容,所以这就是它的作用。

那么,为什么你的正则表达式不像最后那样在开始时匹配两次呢?由于起始锚点 (^)。如果主题以一个或多个斜杠开头,它会消耗它们,然后尝试匹配个斜杠,但它会失败,因为它不再位于字符串的开头。如果开头没有斜杠,则手动碰撞具有相同的效果。

在主题的结尾,这是一个不同的故事。如果那里没有斜杠,则它不匹配任何内容,尝试碰撞并失败;故事结束。但是,如果它确实匹配一个或多个斜杠,它会消耗它们并尝试再次匹配 - 并且会成功,因为 $ 锚点仍然匹配。

所以一般来说,如果你想阻止这种双重匹配,你可以在匹配的开头添加一个条件来阻止它,就像 ^ 锚点一样对于第一种选择:

preg_replace('%^/*|(?<!/)/*$%', '/', $d);

...或者确保正则表达式的一部分必须消耗至少一个字符:

preg_replace('%^/*|([^/])/*$%', '$1/', $d);

但在这种情况下,您有一个更简单的选择,如 John Kugelman 所示:只需捕获您想要保留的部分并把剩下的都扔掉。

Given a regex like /* that can legitimately match zero characters, the regex engine has to make sure that it never matches more than once in the same spot, or it would get stuck in an infinite loop. Thus, if it does consume zero characters, the engine jumps forward one position before attempting another match. As far as I know, that's the only situation in which the regex engine does anything on its own initiative.

What you're seeing is the opposite situation: the regex consumes one or more characters, then on the next go-round it tries to start matching at the spot where it left off. Never mind that this particular regex can't match anything but the one character, and it already matched as many of those as it could; it still has the option of matching nothing, so that's what it does.

So, why doesn't your regex match twice at the beginning, like it does at the end? Because of the start anchor (^). If the subject starts with one or more slashes, it consumes them and then tries to match zero slashes, but it fails because it's not at the beginning of the string any more. And if there are no slashes at the beginning, the manual bump-along has the same affect.

At the end of the subject it's a different story. If there are no slashes there, it matches nothing, tries to bump along and fails; end of story. But if it does match one or more slashes, it consumes them and tries to match again--and succeeds because the $ anchor still matches.

So in general, if you want to prevent this kind of double match, you can either add a condition to the beginning of the match to prevent it, like the ^ anchor does for the first alternative:

preg_replace('%^/*|(?<!/)/*$%', '/', $d);

...or make sure that part of the regex has to consume at least one character:

preg_replace('%^/*|([^/])/*$%', '$1/', $d);

But in this case you have a much simpler option, as demonstrated by John Kugelman: just capture the part you want to keep and chuck the rest.

So尛奶瓶 2024-09-19 17:41:07
preg_replace('%^/*(.*?)/*$%', '/\1/', $d)
preg_replace('%^/*(.*?)/*$%', '/\1/', $d)
葬﹪忆之殇 2024-09-19 17:41:07

它可以在单个 preg_replace 中完成

preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);

it can be done in a single preg_replace

preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);
债姬 2024-09-19 17:41:07

对模式的一个小更改是在字符串末尾分离出两个关键问题:

  1. 用一个斜杠替换多个斜杠
  2. 用一个斜杠替换没有斜杠

一种模式(以及在字符串开头匹配的现有部分)字符串)看起来像:

#^/*|/+$|$(?<!/)#

一个稍微不太简洁但更精确的选项是非常明确地仅匹配零个或两个或更多斜杠;这个想法是,为什么要用一个斜杠替换一个斜杠?

#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#

旁白:nikic 建议 使用 trim(删除前导/尾随斜杠,然后添加您自己的斜杠)是一个很好的建议。

A small change to your pattern would be to separate out the two key concerns at the end of the string:

  1. Replace multiple slashes with one slash
  2. Replace no slashes with one slash

A pattern for that (and the existing part for matching at the start of the string) would look like:

#^/*|/+$|$(?<!/)#

A slightly less concise, but more precise, option would be to be very explicit about only matching zero or two-or-more slashes; the notion being, why replace one slash with one slash?

#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#

Aside: nikic's suggestion to use trim (to remove leading/trailing slashes, then add your own) is a good one.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文