当前位置：文江博客话题详情

使用简单的 HTML DOM 将相对 URL 转换为绝对 URL？

发布于 2024-09-10 23:57:02 字数 62 浏览 4 评论 0原文

当我从某些页面抓取内容时，脚本会给出一个相对 URL。是否可以使用简单的 HTML DOM 获取绝对 URL？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

楠木可依 2024-09-17 23:57:02

我不认为简单 HTML DOM 解析器可以做到这一点。

但你可以自己做。首先，如果没有另外声明，您需要区分作为文档 URI 的基本 URI（请参阅 BASE 元素）。然后获取每个 URI 引用并应用算法来解析相对 URI，如 RFC 中所述3986 （已经有一些类可供您使用，例如 PEAR 包 Net_URL2 ）。

因此，使用这两个类，您可以执行以下操作：

$uri = new Net_URL2('http://example.com/foo/bar'); // URI of the resource
$baseURI = $uri;
foreach ($html->find('base[href]') as $elem) {
    $baseURI = $uri->resolve($elem->href);
}

foreach ($html->find('*[src]') as $elem) {
    $elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($html->find('*[href]') as $elem) {
    if (strtoupper($elem->tag) === 'BASE') continue;
    $elem->href = $baseURI->resolve($elem->href)->__toString();
}
foreach ($html->find('form[action]') as $elem) {
    $elem->action = $baseURI->resolve($elem->action)->__toString();
}

重复替换包含 URI 的任何其他属性，例如 background、cite、classid、codebase、data、longdesc、profile 和 usemap（请参阅HTML 4.01 中的属性索引）。

I don’t think that the Simple HTML DOM Parser can do that.

But you can do that on your own. First you need to distinguish the base URI that is the URI of the document if not declared otherwise (see BASE element). Than get each URI reference and apply the algorithms to resolve a relative URI as described in RFC 3986 (there already are classes you can use for that like the PEAR package Net_URL2).

So, using these two classes, you could do something like this:

$uri = new Net_URL2('http://example.com/foo/bar'); // URI of the resource
$baseURI = $uri;
foreach ($html->find('base[href]') as $elem) {
    $baseURI = $uri->resolve($elem->href);
}

foreach ($html->find('*[src]') as $elem) {
    $elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($html->find('*[href]') as $elem) {
    if (strtoupper($elem->tag) === 'BASE') continue;
    $elem->href = $baseURI->resolve($elem->href)->__toString();
}
foreach ($html->find('form[action]') as $elem) {
    $elem->action = $baseURI->resolve($elem->action)->__toString();
}

Repeat the substitution for any other attribute containing a URI like background, cite, classid, codebase, data, longdesc, profile and usemap (see index of attributes in HTML 4.01).

回复收藏 0 原文

猥琐帝 2024-09-17 23:57:02

除了@Artefacto的答案之外，如果您在某处输出抓取的HTML，您可以简单地将添加到文档的头部，这会将文档中所有相对 URL 的基本 URL 建立为指定的 href。看看http://www.w3schools.com/tags/tag_base.asp

回复收藏 0 原文

萌化 2024-09-17 23:57:02

编辑请参阅 Gumbo 的答案以获得正式的正确答案。这是一种简化的算法，适用于绝大多数情况，但在某些情况下会失败。

当然。执行此操作：

获取相对 URL（不以 http://、https:// 或任何其他协议开头的 URL，也不以以 / 开头）。
获取页面的 URL。
从中删除查询字符串（如果有）。一种简单的方法是 explode around ? 然后获取结果数组的第一个元素（获取索引为 0 的元素或使用 reset< /代码>）。如果页面的 URL 以 / 结尾，请在其后面附加相对 URL，您就得到了最终 URL。如果网址不以 / 结尾，则采用 dirname ，并附加相对 URL。您现在已经有了最终的网址。