JSoup 和我无法抓取的网站

发布于 2024-12-17 15:07:09 字数 1479 浏览 0 评论 0原文

我正在学习使用 JSoup,为此我从一个包含足球比赛的网站开始。我不擅长html,这可能是问题所在,无论如何,我检查了网站的代码以了解如何检索我需要的信息。它在初始页面上运行良好,这是我的代码:

import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class JSoup_example {

    public static void main(String[] args) {

        try {
            Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);

            Elements resultLinks = doc.select("tr.normal");

            //System.out.println(doc);
            System.out.println(resultLinks.size());

            for (Element link : resultLinks) {
                System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
                System.out.println();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

但是如果我尝试抓取同一网站的另一个页面,我就会遇到问题,例如:

https://www.bwin.it/calcio

,检查代码后,看起来与第一个完全一样。 经过多次尝试,我用浏览器(Chrome)检查了 centra 元素,发现包含我想要的数据的表的属性“baseURI”设置如下:

https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0

如果我将该链接放入我的代码中,它会按预期工作,但我不这样做不明白为什么它不能与前面的链接一起使用。所以我的问题是:

我做错了什么?

有没有更好的方法来检索表的链接而不是手动执行?

感谢您的帮助。

I'm learning to use JSoup and to do so I started with a site which contains football matches. I'm not good at html and this may be the problem, anyhow, I checked the code of the site to understand how to retrieve the information I need. It works fine with the initial page, here's my code:

import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class JSoup_example {

    public static void main(String[] args) {

        try {
            Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);

            Elements resultLinks = doc.select("tr.normal");

            //System.out.println(doc);
            System.out.println(resultLinks.size());

            for (Element link : resultLinks) {
                System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
                System.out.println();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

But I'm having problems if I try to scrape another page of the same site, for example:

https://www.bwin.it/calcio

which, after checking the code, looks exactly like the first one.
After many attempts I inspected the centra element with my browser (Chrome) and I found that the table which contains the data I want has the attribute "baseURI" set as follow:

https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0

If I put that link in my code it works as expected but I don't understand why it doesn't work with the previous link. So my questions are:

What am I doing wrong?

Is there a better way to retrieve the link of the table insted of doing it manually?

Thanks for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文