JSoup 和我无法抓取的网站
我正在学习使用 JSoup,为此我从一个包含足球比赛的网站开始。我不擅长html,这可能是问题所在,无论如何,我检查了网站的代码以了解如何检索我需要的信息。它在初始页面上运行良好,这是我的代码:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JSoup_example {
public static void main(String[] args) {
try {
Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);
Elements resultLinks = doc.select("tr.normal");
//System.out.println(doc);
System.out.println(resultLinks.size());
for (Element link : resultLinks) {
System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
System.out.println();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
但是如果我尝试抓取同一网站的另一个页面,我就会遇到问题,例如:
,检查代码后,看起来与第一个完全一样。 经过多次尝试,我用浏览器(Chrome)检查了 centra 元素,发现包含我想要的数据的表的属性“baseURI”设置如下:
https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0
如果我将该链接放入我的代码中,它会按预期工作,但我不这样做不明白为什么它不能与前面的链接一起使用。所以我的问题是:
我做错了什么?
有没有更好的方法来检索表的链接而不是手动执行?
感谢您的帮助。
I'm learning to use JSoup and to do so I started with a site which contains football matches. I'm not good at html and this may be the problem, anyhow, I checked the code of the site to understand how to retrieve the information I need. It works fine with the initial page, here's my code:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JSoup_example {
public static void main(String[] args) {
try {
Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);
Elements resultLinks = doc.select("tr.normal");
//System.out.println(doc);
System.out.println(resultLinks.size());
for (Element link : resultLinks) {
System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
System.out.println();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
But I'm having problems if I try to scrape another page of the same site, for example:
which, after checking the code, looks exactly like the first one.
After many attempts I inspected the centra element with my browser (Chrome) and I found that the table which contains the data I want has the attribute "baseURI" set as follow:
https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0
If I put that link in my code it works as expected but I don't understand why it doesn't work with the previous link. So my questions are:
What am I doing wrong?
Is there a better way to retrieve the link of the table insted of doing it manually?
Thanks for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论