如何最好地标准化 URL

发布于 2024-08-18 06:46:09 字数 564 浏览 5 评论 0 原文

我正在创建一个允许用户添加关键字的网站 -->网址链接。我希望多个用户能够链接到相同的 url（完全相同、相同的对象实例）。

因此，如果用户 1 输入“http://www.facebook.com/index.php ”，用户 2 输入“http://facebook.com”，用户 3 输入“www.facebook.com” “我如何最好地将它们“转换”为所有这些都解析为的内容：“http://www.facebook.com/ ”

后端是用 Python 编写的......

搜索引擎如何跟踪 URL？他们是保留一个 URL，然后采用它解析的内容，还是丢弃与他们解析的内容不同的 URL，只关心解析后的版本？

谢谢！！！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

追风人 2024-08-25 06:46:09

因此，如果用户 1 输入“http://www.facebook.com/index.php< /a>”，用户 2 输入“http://facebook.com”，用户 3 输入“www./a>”。 facebook.com”我如何最好地将它们“转换”为所有这些都解析为的内容：“http://www.facebook .com/“

您可以通过修复无效 URL 来解析用户 3。 www.facebook.com 不是 URL，但您可以猜测 http:// 应该放在开头。空路径部分与 / 路径相同，因此您可以确定它也需要放在最后。一个好的 URL 解析器应该能够做到这一点。

您可以通过向 URL 发出 HTTP HEAD 请求来解析用户 2。如果返回的状态代码为 301，则您将在 Location 响应标头中永久重定向到真实 URL。 Facebook 这样做是为了将 facebook.com 流量发送到 www.facebook.com，这绝对是网站应该做的事情（尽管在现实世界中许多网站并没有这么做））。您可以考虑允许 3xx 系列中的其他重定向状态代码执行相同的操作；这并不是正确的做法，但有些网站使用 302 而不是 301 进行重定向，因为它们有点粗。

如果您有时间和网络资源（加上更多代码以防止该功能被滥用以对您或其他人进行 DoS），您还可以考虑获取目标网页并解析它（假设它不是 HTML）。如果页面中存在元素，您还应该将该 URL 视为正确的 URL。（查看来源：Stack Overflow 就是这样做的。）

但是，不幸的是，用户 1 的案例无法得到解决。 Facebook 在 / 处提供一个页面，在 /index.php 处提供一个页面，虽然我们可以查看它们并说它们是相同的，但没有技术方法来描述这种关系。在理想的情况下，Facebook 会包含 301 重定向响应或来告诉人们 / 是访问特定资源的正确格式 URL，而不是 /index.php（反之亦然）。但他们没有这样做，事实上大多数数据库驱动的网站也没有这样做。

为了解决这个问题，一些搜索引擎（*）会比较不同[子]域的内容，并在有限的范围内比较同一主机上的不同路径，如果内容足够相似，则猜测它们是相同的。当然，这是一项大量的工作，需要大量的存储和处理，并且最终并不是非常可靠。

除了像 user 3 的情况那样修复 URL 之外，我不会真正费心做这些事情。从您的描述来看，“相同”的页面必须共享实际身份似乎并不重要，除非有您没有提到的特定用例。

（*：好吧，无论如何，谷歌；更传统的传统上不会，并且很乐意为同一页面提供多个链接，但我认为其他专业现在正在做类似的事情。）

So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"

You'd resolve user 3 by fixing up invalid URLs. www.facebook.com isn't a URL, but you can guess that http:// should go on the start. An empty path part is the same as the / path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.

You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of 301, you've got a permanent redirect to the real URL in the Location response header. Facebook does this to send facebook.com traffic to www.facebook.com, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the 3xx family to do the same; it's not really the right thing to do, but some sites use 302 instead of 301 for the redirect because they're a bit thick.

If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a <link rel="canonical" href="..." /> element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)

However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at / and a page at /index.php, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a 301 redirect response or a <link rel="canonical" /> to tell people that / was the proper format URL to access a particular resource rather than /index.php (or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.

To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.

I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.

(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)

回复收藏 0 原文