如何从网页中找到的不完整 URL 形成完整 URL?
我可以检索网页的文本,假设 https://stackoverflow.com/questions 带有一些真实的和虚构的链接
/questions /tags /questions?sort=votes /questions?sort=active randompage.aspx ../coolhomepage.aspx
:我的原始页面是 https://stackoverflow.com/questions .Net 中有没有办法解决此链接?
https://stackoverflow.com/questions https://stackoverflow.com/tags https://stackoverflow.com/questions?sort=votes https://stackoverflow.com/questions?sort=active https://stackoverflow.com/questions/randompage.aspx https://stackoverflow.com/coolhomepage.aspx
有点像浏览器足够智能来解析链接的方式。
============================= 更新 - 使用大卫的解决方案:
'Regex to match all <a ... /a> links Dim myRegEx As New Regex("\<\s*a (?# Find opening <a tag) " & _ ".+?href\s*=\s*['""] (?# Then all to href=' or "" ) " & _ "(?<href>.*?)['""] (?# Then all to the next ' or "" ) " & _ ".*?\> (?# Then all to > ) " & _ "(?<name>.*?)\<\s*/a\s*\> (?# Then all to </a> ) ", _ RegexOptions.IgnoreCase Or _ RegexOptions.IgnorePatternWhitespace Or _ RegexOptions.Multiline) 'MatchCollection to hold all the links that are matched Dim myMatchCollection As MatchCollection myMatchCollection = myRegEx.Matches(Me._RawPageText) 'Loop through all matches and evaluate the value of the href attribute. For i As Integer = 0 To myMatchCollection.Count - 1 Dim thisLink As String = "" thisLink = myMatchCollection(i).Groups("href").Value() 'This checks for Javascript and Mailto links. 'This is not complete. There are others to check I just haven't encountered them yet. If thisLink.ToLower.StartsWith("javascript") Then thisLink = "JAVASCRIPT: " & thisLink ElseIf thisLink.ToLower.StartsWith("mailto") Then thisLink = "MAILTO: " & thisLink Else Dim baseUri As New Uri(Me.URL) If Not thisLink.ToLower.StartsWith("http") Then 'This is a partial URL so we will assume that it's relative to our originating URL Dim myUri As New Uri(baseUri, thisLink) thisLink = "RELATIVE LOCAL LINK: RESOLVED: " & myUri.ToString() & " ORIGINAL: " & thisLink Else 'The link starts with HTTP, determine if part of base host or is outside host. Dim ThisUri As New Uri(thisLink) If ThisUri.Host.ToLower = baseUri.Host.ToLower Then thisLink = "INSIDE COMPLETE LINK: " & thisLink Else thisLink = "OUTSIDE LINK: " & thisLink End If End If End If 'I'm storing the found links into a Generic.List(Of String) 'This link has descriptive text added to it. 'TODO: Make collection to hold only unique internal links. Me._Links.Add(thisLink) Next
I can retrieve the text of a web page, let's say https://stackoverflow.com/questions with some real and made up links:
/questions /tags /questions?sort=votes /questions?sort=active randompage.aspx ../coolhomepage.aspx
Knowing my originating page was https://stackoverflow.com/questions is there a way in .Net to resolve the links to this?
https://stackoverflow.com/questions https://stackoverflow.com/tags https://stackoverflow.com/questions?sort=votes https://stackoverflow.com/questions?sort=active https://stackoverflow.com/questions/randompage.aspx https://stackoverflow.com/coolhomepage.aspx
Kind of like the way a Browser is smart enough to resolve the links.
=========================== Update - Using David's solution:
'Regex to match all <a ... /a> links Dim myRegEx As New Regex("\<\s*a (?# Find opening <a tag) " & _ ".+?href\s*=\s*['""] (?# Then all to href=' or "" ) " & _ "(?<href>.*?)['""] (?# Then all to the next ' or "" ) " & _ ".*?\> (?# Then all to > ) " & _ "(?<name>.*?)\<\s*/a\s*\> (?# Then all to </a> ) ", _ RegexOptions.IgnoreCase Or _ RegexOptions.IgnorePatternWhitespace Or _ RegexOptions.Multiline) 'MatchCollection to hold all the links that are matched Dim myMatchCollection As MatchCollection myMatchCollection = myRegEx.Matches(Me._RawPageText) 'Loop through all matches and evaluate the value of the href attribute. For i As Integer = 0 To myMatchCollection.Count - 1 Dim thisLink As String = "" thisLink = myMatchCollection(i).Groups("href").Value() 'This checks for Javascript and Mailto links. 'This is not complete. There are others to check I just haven't encountered them yet. If thisLink.ToLower.StartsWith("javascript") Then thisLink = "JAVASCRIPT: " & thisLink ElseIf thisLink.ToLower.StartsWith("mailto") Then thisLink = "MAILTO: " & thisLink Else Dim baseUri As New Uri(Me.URL) If Not thisLink.ToLower.StartsWith("http") Then 'This is a partial URL so we will assume that it's relative to our originating URL Dim myUri As New Uri(baseUri, thisLink) thisLink = "RELATIVE LOCAL LINK: RESOLVED: " & myUri.ToString() & " ORIGINAL: " & thisLink Else 'The link starts with HTTP, determine if part of base host or is outside host. Dim ThisUri As New Uri(thisLink) If ThisUri.Host.ToLower = baseUri.Host.ToLower Then thisLink = "INSIDE COMPLETE LINK: " & thisLink Else thisLink = "OUTSIDE LINK: " & thisLink End If End If End If 'I'm storing the found links into a Generic.List(Of String) 'This link has descriptive text added to it. 'TODO: Make collection to hold only unique internal links. Me._Links.Add(thisLink) Next
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你的意思是这样吗?
示例来自 http://msdn.microsoft.com/en-us/library /9hst1w91.aspx
You mean like this?
Sample comes from http://msdn.microsoft.com/en-us/library/9hst1w91.aspx
如果您指的是服务器端,则可以使用
ResolveUrl()
:If you mean server-side, you can use
ResolveUrl()
:我不明白你在这种情况下的“解决”是什么意思,但你可以尝试插入一个基本的 html 元素。 既然您问浏览器将如何处理它。
“
标记指定页面上所有链接的默认地址或默认目标。”http://www.w3schools.com/TAGS/tag_base.asp
I dont understand what you mean by "resolve" in this context, but you can try inserting a base html element. Since you asked how the browser would handle it.
"The
<base>
tag specifies a default address or a default target for all links on a page."http://www.w3schools.com/TAGS/tag_base.asp