从不一致的 HTML 页面收集数据 - JSoup
我试图从多个页面获取大量数据,但它并不总是一致的。这是我正在使用的 html 示例!:
我需要得到类似以下内容:团队 |团队|结果全部放入不同的变量或列表中。
我只需要一些关于从哪里开始的帮助,因为我在多个页面上使用的主表在每个人身上都不相同。
到目前为止,这是我的java:
try {
Document team_page = Jsoup.connect("http://www.soccerstats.com/team.asp?league=" + league + "&teamid=" + teamNumber).get();
Element home_team = team_page.select("[class=homeTitle]").first();
String teamName = home_team.text();
System.out.println(teamName + "'s Latest Results: ");
Elements main_page = team_page.select("[class=stat]");
System.out.println(main_page);
} catch (IOException e) {
System.out.println("unable to parse content");
}
我从程序的不同方法中获取了 league 和 teamid 。
谢谢!
I'm trying to get a lot of data from multiple pages but its not always consistent. here is an example of the html I am working with!:
I need to get something like: Team | Team | Result all into different variables or lists.
I just need some help on where to start because the main table I'm working with on multiple pages isn't the same on everyone.
heres my java so far:
try {
Document team_page = Jsoup.connect("http://www.soccerstats.com/team.asp?league=" + league + "&teamid=" + teamNumber).get();
Element home_team = team_page.select("[class=homeTitle]").first();
String teamName = home_team.text();
System.out.println(teamName + "'s Latest Results: ");
Elements main_page = team_page.select("[class=stat]");
System.out.println(main_page);
} catch (IOException e) {
System.out.println("unable to parse content");
}
I am getting the league and teamid from different methods of my program.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的。这是网页抓取的问题之一。
您必须找出一种或多种启发式方法,以便在您需要访问的所有页面中提取所需的信息。没有灵丹妙药。只是努力工作。 (如果站点更改其页面布局,您将必须重新执行此操作。)
更好的想法是使用站点的 RESTful API 以 XML 或 JSON 形式请求信息……假设它们存在并且可用给你。
(如果您继续使用网络抓取方法,请检查网站的服务条款以确保您的活动是可接受的。)
Yes. This is one of the problems with webpage scraping.
You have to figure out one or more heuristics that will extract the information that you need across all of the pages that you need to access. There's no magic bullet. Just hard work. (And you'll have to do it all over again if the site changes its page layout.)
A better idea is to request the information as XML or JSON using the site or sites' RESTful APIs ... assuming they exist and are available to you.
(And if you continue with the web-scraping approach, check the site's Terms of Service to make sure that your activity is acceptable.)