我们有一个带有动态 URL 方案的 PHP 应用程序,它要求对字符进行百分比编码,甚至是“非保留字符",例如实际上不需要编码的括号或撇号。应用程序认为以“错误”方式编码的 URL 会被规范化,然后重定向到“正确”编码。
但 Google 和其他用户代理将以不同的方式规范化百分比编码/解码,这意味着当 Googlebot 请求页面时,它会询问“错误”的 URL,而当它返回到“正确”URL 的重定向时,Googlebot 将拒绝遵循重定向并将拒绝索引该页面。
是的,这是我们这边的一个错误。 HTTP 规范要求服务器以相同的方式处理百分比编码和非百分比编码的未保留字符。但现在解决应用程序代码中的问题并不简单,因此我希望通过使用 Apache 重写规则来避免代码更改,该规则将确保从应用程序的角度“正确”编码 URL ,这意味着撇号、括号等都是百分比编码的,并且空格被编码为 +
而不是 %20
。
这是一个示例,我想重写第一个并最终得到第二种形式:
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux )
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29
这是另一个:
- www.splunkbase.com/apps/All/4.x /app:Benford's+Law+Fraud+Detection+Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
这是另一个:
- www.splunkbase。 com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection +附加组件
如果应用程序只看到这些网址的第二种形式,那么它不会发送任何重定向,Google 将能够索引该页面。
我是重写规则的新手,从我对 mod 的阅读中可以清楚地看出-rewrite 文档 mod_rewrite 会进行一些自动编码/解码,这可能有助于或损害我想做的事情,尽管不确定。
对于重写规则来处理上述情况有什么建议吗?我同意为每个特殊字符制定一条规则,因为它们的数量不多,但单个规则(如果可能)将是理想的。
We have a PHP app with a dynamic URL scheme which requires characters to be percent-encoded, even "unreserved characters" like parentheses or aphostrophes which aren't actually required to be encoded. URLs which the app deems to be encoded the "wrong" way are canonicalized and then redirected to the "right" encoding.
But Google and other user agents will canonicalize percent-encoding/decoding differently, meaning when Googlebot requests the page it will ask for the "wrong" URL, and when it gets back a redirect to the "right" URL, Googlebot will refuse to follow the redirect and will refuse to index the page.
Yes, this is a bug on our end. The HTTP specs require that servers treat percent-encoded and non-percent-encoded unreserved characters identically. But fixing the problem in the app code is non-straightforward right now, so I was hoping to avoid a code change by using an Apache rewrite rule which would ensure that URLs are encoded "properly" from the point-of-view of the app, meaning that apopstrophes, parentheses, etc. are all percent-encoded and that spaces are encoded as +
and not %20
.
Here's one example, where I want to rewrite the first and end up with the second form:
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+(Linux)
- www.splunkbase.com/apps/All/4.x/Add-On/app:OPSEC+LEA+for+Check+Point+%28Linux%29
Here's another:
- www.splunkbase.com/apps/All/4.x/app:Benford's+Law+Fraud+Detection+Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
Here's another:
- www.splunkbase.com/apps/All/4.x/app:Benford%27s%20Law%20Fraud%20Detection%20Add-on
- www.splunkbase.com/apps/All/4.x/app:Benford%27s+Law+Fraud+Detection+Add-on
If the app sees only the second form of these URLs, then it won't send any redirects and Google will be able to index the page.
I'm a newbie with rewrite rules, and it was clear from my read of the mod-rewrite documentation that mod_rewrite does some automatic encoding/decoding which may help or hurt what I want to do, although not sure.
Any advice for rewrite rules to handle the above cases? I'm OK with a rule for each special character since there's not many of them, but a single rule (if possible) would be ideal.
发布评论
评论(2)
该解决方案实际上可能相当简单,但由于使用了
B
标志。我不确定它是否正确地处理了每种情况(诚然,我有点怀疑它不会涉及比这更多的工作),但我相信源代码应该如此。另请记住,mod_rewrite 转换不会更新
REQUEST_URI
的值,因此,如果您的应用程序依赖该值来确定请求的 URL,则您所做的更改无论如何都不会可见。好消息是,这可以在 .htaccess 中完成,因此您可以选择保持主配置不变(如果这对您来说更有效)。
那么,为什么需要使用
B
标志而不是让 mod_rewrite 自动转义重写的 URL?当mod_rewrite自动转义URL时,它使用ap_escape_uri
(显然由于某种原因已经变成ap_os_escape_path
的宏......),一个转义有限子集的函数的字符。然而,B
标志使用名为escape_uri
的内部模块函数,该函数以 PHP 的urlencode
函数。模块中
escape_uri
的实现建议字母数字字符和下划线保持原样,空格转换为 +,其他所有内容都转换为其转义的等效值。这似乎是您想要的行为,所以想必它应该有效。如果没有,您可以选择设置外部程序
RewriteMap
可以将传入的 URL 处理为正确的格式。不过,这需要操作 Apache 配置,并且叛徒脚本可能会导致整个服务器出现问题,因此如果可以避免的话,我不认为它是一个理想的解决方案。The solution actually may be fairly simple, though it will only work in Apache 2.2 and later due to the use of the
B
flag. I'm not sure whether or not it takes care of every case correctly (admittedly I'm a bit skeptical it doesn't involve more work than this), but I'm led to believe it should by the source code.Keep in mind too that the value of
REQUEST_URI
is not updated by mod_rewrite transformations, so if your application relies on that value to determine the requested URL, the changes you make won't be visible anyway.The good news is that this can be done in .htaccess, so you have the option of leaving the main configuration untouched if that works better for you.
So, why is there a need to use the
B
flag instead of letting mod_rewrite escape the rewritten URL automatically? When mod_rewrite automatically escapes the URL, it usesap_escape_uri
(which apparently has been turned into a macro forap_os_escape_path
for some reason...), a function that escapes a limited subset of characters. TheB
flag, however, uses an internal module function calledescape_uri
, which is modeled on PHP'surlencode
function.The implementation of
escape_uri
in the module suggests that alphanumeric characters and underscores are left as-is, spaces are converted to +, and everything else is converted to its escaped equivalent. This seems to be the behaviour that you want, so presumably it should work.If not, you do have the option of setting up an external program
RewriteMap
that could manipulate your incoming URLs into the correct format. This requires manipulating the Apache configuration though, and a renegade script could cause problems for the server on the whole, so I don't consider it an ideal solution if it can be avoided.mod_rewrite 不是完成此类工作的最佳工具。因为使用 mod_rewrite 一次只能替换固定数量的出现次数。但这是可能的:
这将一次替换一个
%20
、'
、(
或)
,并且以 301 重定向响应。因此,如果 URL 路径包含 10 个需要替换的字符,则需要 10 次重定向才能完成此操作。由于这可能不是最好的解决方案,因此可以使用 N 标志 并且仅使用重定向进行最后一次外部替换:
但是使用 N 标志可能很危险,因为它不会增加内部递归计数器,因而很容易导致无限递归。
mod_rewrite is not the best tool to do this kind of work. Because with mod_rewrite you can only replace a fixed amount of occurrences at a time. But it is possible:
This will replace one
%20
,'
,(
, or)
at a time and responds with a 301 redirect. So if a URL path contains 10 characters that needs to be replaced, it needs 10 redirects to do so.Since this might not be the best solution, it is possible to do all replacements except the last internal using the N flag and only the last replacement externally with a redirect:
But using the N flag can be dangerous as it doesn’t increment the internal recursion counter and thus can easily lead to infinite recursion.