ASP.NET网页镜像，用绝对路径替换所有相对URL

发布于 2024-12-25 01:37:47 字数 2187 浏览 0 评论 0原文

我正在尝试构建一个 ASP.NET 页面，该页面可以抓取网页并正确显示它们，并编辑所有相关的 html 元素以在适当的情况下包含绝对 URL。

这个问题已在这里部分回答 https://stackoverflow.com/a/2719712/696638

使用答案的组合上面和这篇博文http://blog.abodit.com /2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ 我已经构建了以下内容；

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

这仅替换链接的 href 属性。通过扩展这个，我想知道最有效的方法是什么；

元素的 href 属性
元素的 href 属性
src
的 src 属性
action 属性
elements

人们还能想到其他的元素吗？

是否可以通过使用巨大的 xpath 对 SelectNodes 进行一次调用来找到这些内容，或者多次调用 SelectNodes 并迭代每个集合是否会更有效？

原文

I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.

This question has been partially answered here https://stackoverflow.com/a/2719712/696638

Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

This only replaces the href attribute for links. By expanding this I'd like to know what the most efficient way would be to include;

href attribute for <a> elements
href attribute for <link> elements
src attribute for <script> elements
src attribute for <img> elements
action attribute for <form> elements

And any others people can think of?

Could these be found using a single call to SelectNodes with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迎风吟唱 2025-01-01 01:37:47

以下内容应该有效：

SelectNodes("//*[@href or @src or @action]")

然后您必须调整下面的 if 语句。

The following should work:

SelectNodes("//*[@href or @src or @action]")

and then you'd have to adapt the if statement below.

回复收藏 0 原文

~没有更多了~

关于作者

尤怨

暂无简介

文章

25 人气

关注发私信

忆悲凉

文章 0 评论 0

关注

hgfg1645

文章 0 评论 0

关注

qq_qLPLYi

文章 0 评论 0

关注

戏舞

文章 0 评论 0

关注

殊姿

文章 0 评论 0

关注

﹂绝世的画

文章 0 评论 0

友情链接

文江博客

ASP.NET网页镜像，用绝对路径替换所有相对URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

ASP.NET网页镜像，用绝对路径替换所有相对URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。