JSoup 和我无法抓取的网站

发布于 2024-12-17 15:07:09 字数 1479 浏览 4 评论 0原文

我正在学习使用 JSoup，为此我从一个包含足球比赛的网站开始。我不擅长html，这可能是问题所在，无论如何，我检查了网站的代码以了解如何检索我需要的信息。它在初始页面上运行良好，这是我的代码：

import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class JSoup_example {

    public static void main(String[] args) {

        try {
            Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);

            Elements resultLinks = doc.select("tr.normal");

            //System.out.println(doc);
            System.out.println(resultLinks.size());

            for (Element link : resultLinks) {
                System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
                System.out.println();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

但是如果我尝试抓取同一网站的另一个页面，我就会遇到问题，例如：

https://www.bwin.it/calcio

，检查代码后，看起来与第一个完全一样。经过多次尝试，我用浏览器（Chrome）检查了 centra 元素，发现包含我想要的数据的表的属性“baseURI”设置如下：

https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0

如果我将该链接放入我的代码中，它会按预期工作，但我不这样做不明白为什么它不能与前面的链接一起使用。所以我的问题是：

我做错了什么？

有没有更好的方法来检索表的链接而不是手动执行？

感谢您的帮助。

原文

I'm learning to use JSoup and to do so I started with a site which contains football matches. I'm not good at html and this may be the problem, anyhow, I checked the code of the site to understand how to retrieve the information I need. It works fine with the initial page, here's my code:

import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class JSoup_example {

    public static void main(String[] args) {

        try {
            Document doc = Jsoup.parse(new URL("https://www.bwin.it/sportsbook.aspx"), 1000000);

            Elements resultLinks = doc.select("tr.normal");

            //System.out.println(doc);
            System.out.println(resultLinks.size());

            for (Element link : resultLinks) {
                System.out.println(link.select("td.label").first().text() + " - " + link.select("td.label").last().text());
                System.out.println();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

But I'm having problems if I try to scrape another page of the same site, for example:

https://www.bwin.it/calcio

which, after checking the code, looks exactly like the first one.
After many attempts I inspected the centra element with my browser (Chrome) and I found that the table which contains the data I want has the attribute "baseURI" set as follow:

https://www.bwin.it/betViewIframe.aspx?SportID=4&bv=bb&selectedLeagues=0

If I put that link in my code it works as expected but I don't understand why it doesn't work with the previous link. So my questions are:

What am I doing wrong?

Is there a better way to retrieve the link of the table insted of doing it manually?

Thanks for the help.

分享到QQ

分享到微博