HttpUrlConnection 获取内容标题并获得“永久移动”

发布于 2024-11-29 21:49:36 字数 930 浏览 2 评论 0原文

这是我用 Groovy 编写的代码，用于从 URL 中获取页面标题。然而，有些网站我得到了“永久移动”，我认为这是因为 301 重定向。我如何避免这种情况并让 HttpUrlConnection 跟随正确的 URL 并获得正确的页面标题

例如这个网站我得到了“永久移动”而不是正确的页面标题 http ://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html


        def con = (HttpURLConnection) new URL(url).openConnection()
        con.connect()

        def inputStream = con.inputStream

        HtmlCleaner cleaner = new HtmlCleaner()
        CleanerProperties props = cleaner.getProperties()

        TagNode node = cleaner.clean(inputStream)
        TagNode titleNode = node.findElementByName("title", true);

        def title = titleNode.getText().toString()
        title = StringEscapeUtils.unescapeHtml(title).trim()
        title = title.replace("\n", "");
        return title

原文

This is my code I've written in Groovy to get the page title out of a URL. However, some website I got "Moved Permanently" which I think this is because of the 301 Redirect. How do I avoid this and let the HttpUrlConnection to follow to the right URL and get the correct page title

For example this website I got "Moved Permanently" instead of the correct page title
http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html


        def con = (HttpURLConnection) new URL(url).openConnection()
        con.connect()

        def inputStream = con.inputStream

        HtmlCleaner cleaner = new HtmlCleaner()
        CleanerProperties props = cleaner.getProperties()

        TagNode node = cleaner.clean(inputStream)
        TagNode titleNode = node.findElementByName("title", true);

        def title = titleNode.getText().toString()
        title = StringEscapeUtils.unescapeHtml(title).trim()
        title = title.replace("\n", "");
        return title

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

未央 2024-12-06 21:49:36

如果我自己管理重定向，我就可以让它工作...

我认为问题是该网站会期望它在重定向链的一半位置发送cookie，如果它没有收到它们，它会将您发送到登录页面。

这段代码显然需要一些清理（并且可能有更好的方法来做到这一点），但它显示了我如何提取标题：

@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''

while( location ) {
  new URL( location ).openConnection().with { con ->
    // We'll do redirects ourselves
    con.instanceFollowRedirects = false

    // If we got a cookie last time round, then add it to our request
    if( cookie ) con.setRequestProperty( 'Cookie', cookie )
    con.connect()

    // Get the response code, and the location to jump to (in case of a redirect)
    int responseCode = con.responseCode
    location = con.getHeaderField( "Location" )

    // Try and get a cookie the site will set, we will pass this next time round
    cookie = con.getHeaderField( "Set-Cookie" )

    // Read the HTML and close the inputstream
    pageContent = con.inputStream.withReader { it.text }
  }
}

// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "\n", "" )

println title

希望它有帮助！

I can get this to work if I manage the redirecting myself...

I think the issue is that the site will expect cookies that it sends half way down the redirect chain, and if it doesn't get them, it sends you to a log-in page.

This code obviously needs some cleaning up (and there is probably a better way to do this), but it shows how I can extract the title:

@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''

while( location ) {
  new URL( location ).openConnection().with { con ->
    // We'll do redirects ourselves
    con.instanceFollowRedirects = false

    // If we got a cookie last time round, then add it to our request
    if( cookie ) con.setRequestProperty( 'Cookie', cookie )
    con.connect()

    // Get the response code, and the location to jump to (in case of a redirect)
    int responseCode = con.responseCode
    location = con.getHeaderField( "Location" )

    // Try and get a cookie the site will set, we will pass this next time round
    cookie = con.getHeaderField( "Set-Cookie" )

    // Read the HTML and close the inputstream
    pageContent = con.inputStream.withReader { it.text }
  }
}

// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "\n", "" )

println title

Hope it helps!

回复收藏 0 原文