HttpUrlConnection 获取内容标题并获得“永久移动”
这是我用 Groovy 编写的代码,用于从 URL 中获取页面标题。然而,有些网站我得到了“永久移动”,我认为这是因为 301 重定向。我如何避免这种情况并让 HttpUrlConnection 跟随正确的 URL 并获得正确的页面标题
例如这个网站我得到了“永久移动”而不是正确的页面标题 http ://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html
def con = (HttpURLConnection) new URL(url).openConnection()
con.connect()
def inputStream = con.inputStream
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()
TagNode node = cleaner.clean(inputStream)
TagNode titleNode = node.findElementByName("title", true);
def title = titleNode.getText().toString()
title = StringEscapeUtils.unescapeHtml(title).trim()
title = title.replace("\n", "");
return title
This is my code I've written in Groovy to get the page title out of a URL. However, some website I got "Moved Permanently" which I think this is because of the 301 Redirect. How do I avoid this and let the HttpUrlConnection to follow to the right URL and get the correct page title
For example this website I got "Moved Permanently" instead of the correct page title
http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html
def con = (HttpURLConnection) new URL(url).openConnection()
con.connect()
def inputStream = con.inputStream
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()
TagNode node = cleaner.clean(inputStream)
TagNode titleNode = node.findElementByName("title", true);
def title = titleNode.getText().toString()
title = StringEscapeUtils.unescapeHtml(title).trim()
title = title.replace("\n", "");
return title
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我自己管理重定向,我就可以让它工作...
我认为问题是该网站会期望它在重定向链的一半位置发送cookie,如果它没有收到它们,它会将您发送到登录页面。
这段代码显然需要一些清理(并且可能有更好的方法来做到这一点),但它显示了我如何提取标题:
希望它有帮助!
I can get this to work if I manage the redirecting myself...
I think the issue is that the site will expect cookies that it sends half way down the redirect chain, and if it doesn't get them, it sends you to a log-in page.
This code obviously needs some cleaning up (and there is probably a better way to do this), but it shows how I can extract the title:
Hope it helps!
您需要在 HttpUrlConnection 上调用 setInstanceFollowRedirects(true)。即在第一行之后插入
con.setInstanceFollowRedirects(true)
You need to call setInstanceFollowRedirects(true) on the HttpUrlConnection. i.e. after the first line, insert
con.setInstanceFollowRedirects(true)