仅抓取 HTML 页面,同时检查响应标头
我试图获取所有标题为 Content-Type:text/html 的 url,因此我检查每个 url 的响应标头,如果它们具有 content-type: text/html,那么我想打印该 url内容类型:文本/html。但是在我的代码中,当我检查标头是否具有 Content-Type 时,它不会显示任何内容。如果我删除 if 循环,那么它会打印与我想要抓取的特定 url 相关的每个链接及其响应标头..
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
Pattern filters = Pattern.compile("(\\.(html))");
*/
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//System.out.println("Href: " +href);
/*
if (filters.matcher(href).matches()) {
return false;
}*/
if (href.startsWith("http://www.somehost.com/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
//HttpGet httpget = new HttpGet(url);
try {
URL url1 = new URL(url);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();)
{
String key = (String) iterator.next();
if(key==("Content-Type")) //(Anything wrong with this if loop)
{
System.out.print(key + " = ");
List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.print(o + ", ");
}
System.out.println("");
System.out.println(url1);
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//System.out.println("Docid: " + docid);
//System.out.println("URL: " + url);
//System.out.println("Text length: " + text.length());
//System.out.println("Number of links: " + links.size());
//System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
I am trying to get all the url's that have header as Content-Type:text/html so I am checking the response header of each url and If they have content-type: text/html, then I want to print that url that has content-type:text/html. But in my code when I am checking that if the header has Content-Type, it is not displaying anything.. And If I remove the if loop then it prints every link related to that particular url that I want to crawl and their response header..
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
Pattern filters = Pattern.compile("(\\.(html))");
*/
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//System.out.println("Href: " +href);
/*
if (filters.matcher(href).matches()) {
return false;
}*/
if (href.startsWith("http://www.somehost.com/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
//HttpGet httpget = new HttpGet(url);
try {
URL url1 = new URL(url);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();)
{
String key = (String) iterator.next();
if(key==("Content-Type")) //(Anything wrong with this if loop)
{
System.out.print(key + " = ");
List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.print(o + ", ");
}
System.out.println("");
System.out.println(url1);
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//System.out.println("Docid: " + docid);
//System.out.println("URL: " + url);
//System.out.println("Text length: " + text.length());
//System.out.println("Number of links: " + links.size());
//System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
key 变量包含:
Content-Type=[text/html; charset=ISO-8859-1]
因此不能用
==
或.equals("Content-Type")
捕获如果您尝试运行以下代码,看看它打印出什么
这是输出:
看起来您也可以只使用一个 if 语句,如下所示:
BTW string Comparison in Java 非常直观,一直让我着迷!
The key variable contains:
Content-Type=[text/html; charset=ISO-8859-1]
and therefor can't be caught with
==
or.equals("Content-Type")
If you try to run the following code, see what it prints out
Here is the output:
It looks like you could also just do with one if statement as following:
BTW string comparison in Java is very intuitive, gets me all the time!
对于字符串比较,请使用
.equals()
。For string comparison, use
.equals()
.它应该与
It should work with