我正在创建一个允许用户添加关键字的网站 -->网址链接。我希望多个用户能够链接到相同的 url(完全相同、相同的对象实例)。
因此,如果用户 1 输入“http://www.facebook.com/index.php ”,用户 2 输入“http://facebook.com”,用户 3 输入“www.facebook.com” “我如何最好地将它们“转换”为所有这些都解析为的内容:“http://www.facebook.com/ ”
后端是用 Python 编写的......
搜索引擎如何跟踪 URL?他们是保留一个 URL,然后采用它解析的内容,还是丢弃与他们解析的内容不同的 URL,只关心解析后的版本?
谢谢!!!
I'm creating a site that allows users to add Keyword --> URL links. I want multiple users to be able to link to the same url (exactly the same, same object instance).
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
The back end is in Python...
How does a search engine keep track of URLs? Do they keep a URL then take what ever it resolves to or do they toss URLs that are different from what they resolve to and just care about the resolved version?
Thanks!!!
发布评论
评论(3)
您可以通过修复无效 URL 来解析用户 3。
www.facebook.com
不是 URL,但您可以猜测http://
应该放在开头。空路径部分与/
路径相同,因此您可以确定它也需要放在最后。一个好的 URL 解析器应该能够做到这一点。您可以通过向 URL 发出 HTTP HEAD 请求来解析用户 2。如果返回的状态代码为
301
,则您将在Location
响应标头中永久重定向到真实 URL。 Facebook 这样做是为了将facebook.com
流量发送到www.facebook.com
,这绝对是网站应该做的事情(尽管在现实世界中许多网站并没有这么做) )。您可以考虑允许3xx
系列中的其他重定向状态代码执行相同的操作;这并不是正确的做法,但有些网站使用302
而不是301
进行重定向,因为它们有点粗。如果您有时间和网络资源(加上更多代码以防止该功能被滥用以对您或其他人进行 DoS),您还可以考虑获取目标网页并解析它(假设它不是 HTML)。如果页面中存在
元素,您还应该将该 URL 视为正确的 URL。 (查看来源:Stack Overflow 就是这样做的。)
但是,不幸的是,用户 1 的案例无法得到解决。 Facebook 在
/
处提供一个页面,在/index.php
处提供一个页面,虽然我们可以查看它们并说它们是相同的,但没有技术方法来描述这种关系。在理想的情况下,Facebook 会包含301
重定向响应或来告诉人们
/
是访问特定资源的正确格式 URL,而不是/index.php
(反之亦然)。但他们没有这样做,事实上大多数数据库驱动的网站也没有这样做。为了解决这个问题,一些搜索引擎(*)会比较不同[子]域的内容,并在有限的范围内比较同一主机上的不同路径,如果内容足够相似,则猜测它们是相同的。当然,这是一项大量的工作,需要大量的存储和处理,并且最终并不是非常可靠。
除了像 user 3 的情况那样修复 URL 之外,我不会真正费心做这些事情。从您的描述来看,“相同”的页面必须共享实际身份似乎并不重要,除非有您没有提到的特定用例。
(*:好吧,无论如何,谷歌;更传统的传统上不会,并且很乐意为同一页面提供多个链接,但我认为其他专业现在正在做类似的事情。)
You'd resolve user 3 by fixing up invalid URLs.
www.facebook.com
isn't a URL, but you can guess thathttp://
should go on the start. An empty path part is the same as the/
path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of
301
, you've got a permanent redirect to the real URL in theLocation
response header. Facebook does this to sendfacebook.com
traffic towww.facebook.com
, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the3xx
family to do the same; it's not really the right thing to do, but some sites use302
instead of301
for the redirect because they're a bit thick.If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a
<link rel="canonical" href="..." />
element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at
/
and a page at/index.php
, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a301
redirect response or a<link rel="canonical" />
to tell people that/
was the proper format URL to access a particular resource rather than/index.php
(or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.
I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.
(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)
除了有关特定网站的“神奇”知识之外,没有办法知道“/index.php”与获取“/”相同。
所以,正如你所说,你的问题是不可能的。
There's no way to know, other than "magic" knowledge about the particular website, that "/index.php" is the same as fetching "/".
So, your problem, as stated, is impossible.
我会将 3 个链接保存为单独的,因为您永远无法可靠地告诉它们解析到同一页面。这完全取决于服务器(我们无法控制)如何解析 url。
i'd save 3 link as separated, since you can never reliably tell they resolve to same page. it all depends on how the server (out of our control) resolve the url.