ASP.NET网页镜像,用绝对路径替换所有相对URL
我正在尝试构建一个 ASP.NET 页面,该页面可以抓取网页并正确显示它们,并编辑所有相关的 html 元素以在适当的情况下包含绝对 URL。
这个问题已在这里部分回答 https://stackoverflow.com/a/2719712/696638
使用答案的组合上面和这篇博文http://blog.abodit.com /2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ 我已经构建了以下内容;
public partial class Crawler : System.Web.UI.Page {
protected void Page_Load(object sender, EventArgs e) {
Response.Clear();
string url = Request.QueryString["path"];
WebClient client = new WebClient();
byte[] requestHTML = client.DownloadData(url);
string sourceHTML = new UTF8Encoding().GetString(requestHTML);
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(sourceHTML);
foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
HtmlAttribute att = link.Attributes["href"];
string href = att.Value;
// ignore javascript on buttons using a tags
if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;
Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
if (!urlNext.IsAbsoluteUri) {
urlNext = new Uri(new Uri(url), urlNext);
att.Value = urlNext.ToString();
}
}
}
Response.Write(htmlDoc.DocumentNode.OuterHtml);
}
}
这仅替换链接的 href 属性。通过扩展这个,我想知道最有效的方法是什么;
元素的
href
属性- 元素的
href
属性 src
元素的 code> 属性元素
- 的
src
属性 action
属性
人们还能想到其他的元素吗?
是否可以通过使用巨大的 xpath 对 SelectNodes
进行一次调用来找到这些内容,或者多次调用 SelectNodes 并迭代每个集合是否会更有效?
I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.
This question has been partially answered here https://stackoverflow.com/a/2719712/696638
Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following;
public partial class Crawler : System.Web.UI.Page {
protected void Page_Load(object sender, EventArgs e) {
Response.Clear();
string url = Request.QueryString["path"];
WebClient client = new WebClient();
byte[] requestHTML = client.DownloadData(url);
string sourceHTML = new UTF8Encoding().GetString(requestHTML);
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(sourceHTML);
foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
HtmlAttribute att = link.Attributes["href"];
string href = att.Value;
// ignore javascript on buttons using a tags
if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;
Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
if (!urlNext.IsAbsoluteUri) {
urlNext = new Uri(new Uri(url), urlNext);
att.Value = urlNext.ToString();
}
}
}
Response.Write(htmlDoc.DocumentNode.OuterHtml);
}
}
This only replaces the href attribute for links. By expanding this I'd like to know what the most efficient way would be to include;
href
attribute for<a>
elementshref
attribute for<link>
elementssrc
attribute for<script>
elementssrc
attribute for<img>
elementsaction
attribute for<form>
elements
And any others people can think of?
Could these be found using a single call to SelectNodes
with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
以下内容应该有效:
然后您必须调整下面的
if
语句。The following should work:
and then you'd have to adapt the
if
statement below.