如何从网页中找到的不完整 URL 形成完整 URL?

发布于 2024-07-18 10:49:13 字数 3355 浏览 5 评论 0原文

我可以检索网页的文本,假设 https://stackoverflow.com/questions 带有一些真实的和虚构的链接

    /questions
    /tags
    /questions?sort=votes
    /questions?sort=active
    randompage.aspx
    ../coolhomepage.aspx

:我的原始页面是 https://stackoverflow.com/questions .Net 中有没有办法解决此链接?

    https://stackoverflow.com/questions
    https://stackoverflow.com/tags
    https://stackoverflow.com/questions?sort=votes
    https://stackoverflow.com/questions?sort=active
    https://stackoverflow.com/questions/randompage.aspx
    https://stackoverflow.com/coolhomepage.aspx

有点像浏览器足够智能来解析链接的方式。

============================= 更新 - 使用大卫的解决方案:

    'Regex to match all <a ... /a> links
    Dim myRegEx As New Regex("\<\s*a                   (?# Find opening <a tag)           " & _
                             ".+?href\s*=\s*['""]      (?# Then all to href=' or "" )     " & _
                             "(?<href>.*?)['""]        (?# Then all to the next ' or "" ) " & _
                             ".*?\>                    (?# Then all to > )                " & _
                             "(?<name>.*?)\<\s*/a\s*\> (?# Then all to </a> )             ", _
                             RegexOptions.IgnoreCase Or _
                             RegexOptions.IgnorePatternWhitespace Or _
                             RegexOptions.Multiline)

    'MatchCollection to hold all the links that are matched
    Dim myMatchCollection As MatchCollection
    myMatchCollection = myRegEx.Matches(Me._RawPageText)

    'Loop through all matches and evaluate the value of the href attribute.
    For i As Integer = 0 To myMatchCollection.Count - 1
        Dim thisLink As String = ""
        thisLink = myMatchCollection(i).Groups("href").Value()
        'This checks for Javascript and Mailto links.
        'This is not complete. There are others to check I just haven't encountered them yet.
        If thisLink.ToLower.StartsWith("javascript") Then
            thisLink = "JAVASCRIPT: " & thisLink
        ElseIf thisLink.ToLower.StartsWith("mailto") Then
            thisLink = "MAILTO: " & thisLink
        Else
            Dim baseUri As New Uri(Me.URL)

            If Not thisLink.ToLower.StartsWith("http") Then
                'This is a partial URL so we will assume that it's relative to our originating URL
                Dim myUri As New Uri(baseUri, thisLink)
                thisLink = "RELATIVE LOCAL LINK: RESOLVED: " & myUri.ToString() & " ORIGINAL: " & thisLink
            Else
                'The link starts with HTTP, determine if part of base host or is outside host.
                Dim ThisUri As New Uri(thisLink)
                If ThisUri.Host.ToLower = baseUri.Host.ToLower Then
                    thisLink = "INSIDE COMPLETE LINK: " & thisLink
                Else
                    thisLink = "OUTSIDE LINK: " & thisLink
                End If
            End If

        End If

        'I'm storing the found links into a Generic.List(Of String)
        'This link has descriptive text added to it.
        'TODO: Make collection to hold only unique internal links.
        Me._Links.Add(thisLink)
    Next

I can retrieve the text of a web page, let's say https://stackoverflow.com/questions with some real and made up links:

    /questions
    /tags
    /questions?sort=votes
    /questions?sort=active
    randompage.aspx
    ../coolhomepage.aspx

Knowing my originating page was https://stackoverflow.com/questions is there a way in .Net to resolve the links to this?

    https://stackoverflow.com/questions
    https://stackoverflow.com/tags
    https://stackoverflow.com/questions?sort=votes
    https://stackoverflow.com/questions?sort=active
    https://stackoverflow.com/questions/randompage.aspx
    https://stackoverflow.com/coolhomepage.aspx

Kind of like the way a Browser is smart enough to resolve the links.

=========================== Update - Using David's solution:

    'Regex to match all <a ... /a> links
    Dim myRegEx As New Regex("\<\s*a                   (?# Find opening <a tag)           " & _
                             ".+?href\s*=\s*['""]      (?# Then all to href=' or "" )     " & _
                             "(?<href>.*?)['""]        (?# Then all to the next ' or "" ) " & _
                             ".*?\>                    (?# Then all to > )                " & _
                             "(?<name>.*?)\<\s*/a\s*\> (?# Then all to </a> )             ", _
                             RegexOptions.IgnoreCase Or _
                             RegexOptions.IgnorePatternWhitespace Or _
                             RegexOptions.Multiline)

    'MatchCollection to hold all the links that are matched
    Dim myMatchCollection As MatchCollection
    myMatchCollection = myRegEx.Matches(Me._RawPageText)

    'Loop through all matches and evaluate the value of the href attribute.
    For i As Integer = 0 To myMatchCollection.Count - 1
        Dim thisLink As String = ""
        thisLink = myMatchCollection(i).Groups("href").Value()
        'This checks for Javascript and Mailto links.
        'This is not complete. There are others to check I just haven't encountered them yet.
        If thisLink.ToLower.StartsWith("javascript") Then
            thisLink = "JAVASCRIPT: " & thisLink
        ElseIf thisLink.ToLower.StartsWith("mailto") Then
            thisLink = "MAILTO: " & thisLink
        Else
            Dim baseUri As New Uri(Me.URL)

            If Not thisLink.ToLower.StartsWith("http") Then
                'This is a partial URL so we will assume that it's relative to our originating URL
                Dim myUri As New Uri(baseUri, thisLink)
                thisLink = "RELATIVE LOCAL LINK: RESOLVED: " & myUri.ToString() & " ORIGINAL: " & thisLink
            Else
                'The link starts with HTTP, determine if part of base host or is outside host.
                Dim ThisUri As New Uri(thisLink)
                If ThisUri.Host.ToLower = baseUri.Host.ToLower Then
                    thisLink = "INSIDE COMPLETE LINK: " & thisLink
                Else
                    thisLink = "OUTSIDE LINK: " & thisLink
                End If
            End If

        End If

        'I'm storing the found links into a Generic.List(Of String)
        'This link has descriptive text added to it.
        'TODO: Make collection to hold only unique internal links.
        Me._Links.Add(thisLink)
    Next

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吾性傲以野 2024-07-25 10:49:13

你的意思是这样吗?

Uri baseUri = new Uri("http://www.contoso.com");
Uri myUri = new Uri(baseUri, "catalog/shownew.htm");

Console.WriteLine(myUri.ToString());

示例来自 http://msdn.microsoft.com/en-us/library /9hst1w91.aspx

You mean like this?

Uri baseUri = new Uri("http://www.contoso.com");
Uri myUri = new Uri(baseUri, "catalog/shownew.htm");

Console.WriteLine(myUri.ToString());

Sample comes from http://msdn.microsoft.com/en-us/library/9hst1w91.aspx

七颜 2024-07-25 10:49:13

如果您指的是服务器端,则可以使用 ResolveUrl()

string url = ResolveUrl("~/questions");

If you mean server-side, you can use ResolveUrl():

string url = ResolveUrl("~/questions");
神回复 2024-07-25 10:49:13

我不明白你在这种情况下的“解决”是什么意思,但你可以尝试插入一个基本的 html 元素。 既然您问浏览器将如何处理它。

标记指定页面上所有链接的默认地址或默认目标。”

http://www.w3schools.com/TAGS/tag_base.asp

I dont understand what you mean by "resolve" in this context, but you can try inserting a base html element. Since you asked how the browser would handle it.

"The <base> tag specifies a default address or a default target for all links on a page."

http://www.w3schools.com/TAGS/tag_base.asp

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文