使用 PHP 规范给定基本 url 的 uri 部分

发布于 2024-09-06 00:39:54 字数 685 浏览 4 评论 0原文

首先,我正在为网络爬虫(又名蜘蛛又名蠕虫...)执行此操作,

给定两个字符串(基本 url 和相对 url),我需要确定绝对 url。 当涉及到“SEO 友好”的废话时,尤其令人困惑,例如:

基本 url: http:// aaa.com/january/15/test 找到 url: /test.php?aaa

我怎么知道上面不是文件夹? 例如;绝对路径是:

http://aaa.com/january/15/test/test .php?aaa

或者:

http://aaa.com/january/15/test.php?aaa

?

混乱源于是否有正在运行的索引。 “/test/index.php”还是“/index.php”?

First of, I'm doing this for a web crawler (aka spider aka worm...)

Given two strings (base url and relative url), I need to determine the absolute url.
It is especially confusing when it comes to "SEO friendly" crap, such as:

Base url: http://aaa.com/january/15/test
Found url: /test.php?aaa

How would I know that the above aren't folders or not?
Eg; the absolute path would be:

http://aaa.com/january/15/test/test.php?aaa

Or:

http://aaa.com/january/15/test.php?aaa

?

The confusion stems from whether there is an index in action or not. "/test/index.php" or "/index.php"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吾家有女初长成 2024-09-13 00:39:54

仅通过检查 URL 无法解决此问题。

您说需要给定基本 URL 和相对 URL 的绝对 URL。完整 URL 是基本 URL 和相对 URL 的串联。正如您所看到的,知道这一点没有任何帮助。

http://example.com/directory/index.php
http://example.com/directory/ 可以合法地引用两个不同的资源。

http://example.com/directory/index.phphttp://example.com/directory/foo/bar/baz.php 可以合法地引用同样的终极资源。

在上面的第二个示例中,哪个是规范 URL?这不一定是可以通过计算确定的事情。规范 URL 是您选择作为规范 URL 的 URL。

您实际上在这里面临两个问题:

  1. 两个不同的 URL 何时引用同一资源?
  2. 哪个 URL 是规范 URL?

1.两个不同的 URL 何时引用同一资源?

这无法通过以任何方式比较 URL 来确定。这只能通过比较资源本身(即内容和 HTTP 标头)来确定。

ETag - http://en.wikipedia.org/wiki/HTTP_ETag

简而言之,ETag 是一个资源唯一的 HTTP 标头。其目的是用于缓存验证,即我的缓存中的内容是否与 http://example.com 中的内容相同/内容

两个相同的资源(至少来自同一主机)将具有相同的 ETag 标头值。如果可能,请使用此选项(并非所有 Web 服务器都会返回 ETag 标头)。

HTTP 标头和内容比较

两个资源何时相同?当内容类型和内容相同时。

使用 Content-Type 标头比较内容类型。比较内容本身是字符串比较的简单情况。

如果您要存储以前找到的资源的属性并将其与新找到的资源进行比较,则无需出于比较目的考虑资源的全文 - 哈希即可。

就 PHP 而言,HTTP 扩展将为您提供所需的一切,并提供非常方便的 OO API 来检查HTTP 标头和资源的完整内容。 md5() 函数是生成唯一哈希的选项之一。还有其他的。

2.哪个 URL 是规范 URL?

选择一个并坚持使用。默认情况下,对于同一资源,一个 URL 并不比另一个 URL 更规范。为简单起见,您可以将两个 URL 中最短的一个视为规范形式。

You can't solve this problem by examining only the URL.

You say you need the absolute URL given a base URL and relative URL. The full URL is the concatenation of the base URL and relative URL. As you've seen, knowing this doesn't help one bit.

http://example.com/directory/index.php and
http://example.com/directory/ can legitimately refer to two different resources.

http://example.com/directory/index.php and http://example.com/directory/foo/bar/baz.php can legitimately refer to the same ultimate resource.

In the second above example, which is the canonical URL? This is not something that can be necessarily computationally determined. The canonical URL is the one you choose to be the canonical URL.

You're actually facing two problems here:

  1. When do two different URLs refer to the same resource?
  2. Which URL is the canonical URL?

1. When do two different URLs refer to the same resource?

This can't be determined by comparing URLs in any way. This can only be determined by comparing the resource itself i.e. the content and the HTTP headers.

ETag - http://en.wikipedia.org/wiki/HTTP_ETag

In short, the ETag is an HTTP header that is unique for a resource. Its intent is for cache validation i.e. Is the content I have in my cache the same as the content at http://example.com/content?

Two identical resources, at least from the same host, will have the same ETag header value. Use this if possible (not all web servers will return an ETag header).

HTTP header and content comparison

When are two resources identical? When the content type and content are the same.

Compare the content type using the Content-Type header. Comparing the content itself is a simple case of string comparison.

If you're storing properties of previously-found resources and comparing these to newly-found resources you don't need to consider the full text of the resource for the purposes of comparison - a hash will do.

As far as PHP is concerned, the HTTP extension will give you all you need with a very convenient OO API for examining the HTTP headers and full content of a resource. The md5() function is one option for generating a unique hash. There are others.

2. Which URL is the canonical URL?

Pick one and stick with it. By default one URL is no more canonical than another for the same resource. For simplicity, you might consider the shortest of two URLs to be the canonical form.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文