当我尝试刮擦这些数据时,为什么JSOUP会返回一个空字符串?

发布于 2025-01-29 21:32:28 字数 823 浏览 2 评论 0原文

public class JSoupBundesliga {
    
    private static final String URL = "https://www.flashscore.com/football/england/premier-league-2019-2020/";

    public static void main(String[] args) {
        
        Document doc = null;
        
        try {
            doc = Jsoup.connect(URL).timeout(6000).get();
        }
        catch (IOException e) {
            e.printStackTrace();
            return;
        }
        
        Elements awayScore = doc.getElementsByClass(".leagues--static event--leagues summary-results");
        System.out.println(awayScore);
        
    }

}

上面是我的完整代码,网站看起来像这样:

链接到检查元素image

我正在尝试要刮擦我以蓝色突出显示的部分,但我不能刮擦任何东西。因此,如果某人甚至可以帮助“阿森纳”这样的名字,那也将不胜感激。

public class JSoupBundesliga {
    
    private static final String URL = "https://www.flashscore.com/football/england/premier-league-2019-2020/";

    public static void main(String[] args) {
        
        Document doc = null;
        
        try {
            doc = Jsoup.connect(URL).timeout(6000).get();
        }
        catch (IOException e) {
            e.printStackTrace();
            return;
        }
        
        Elements awayScore = doc.getElementsByClass(".leagues--static event--leagues summary-results");
        System.out.println(awayScore);
        
    }

}

Above is my full code, the website looks like this:

Link To Inspect Element Image

I am trying to scrape the part I highlighted in blue, but I can't scrape anything. So, if someone could even help with the names like "Arsenal" that would be much appreciated as well.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

疾风者 2025-02-05 21:32:28

该页URL不能像这样刮擦。实际的HTML含量非常准确。取而代之的是,当您访问WebBrowser中的所有内容都由JavaScript填写。

JSOUP不是JavaScript Runner,也不是浏览器。它简单地加载了您想要的HTML,不运行JavaScript,并且不能这样做,然后会让您刮擦任何想刮擦的东西。在这种情况下,这基本上是什么都没有的 - 因为该站点提供的HTML不包含任何有用的东西。

许多网站在线工作:所有实际内容均由JavaScript编写。

您陷入了死胡同:jsoup不能在这里使用。您可以尝试做的是硒或类似工具:这些启动了一个真正的浏览器,并真正使用它来像真实的浏览器一样渲染页面,然后让您与类似于JSOUP的DOM进行编程交互。让它运行在无头服务器上可能具有挑战性,而且非常沉重(JSOUP可以在微小的记忆足迹中琐碎地运行,并且非常快。当然,野兽)。硒不是为此目的 - 它是用于测试您自己的网站的。但是它可以或多或少地完成工作。

许多浏览器具有“禁用JavaScript”选项,可能在DevTools中。打开它,加载页面。您看到的是可以用jsoup刮擦的。如果您想要的内容在执行此操作时会消失,则您将无法用jsoup 刮擦它。

That page URL cannot be scraped like this. The actual HTML content is very barebones; instead, everything you see in the webbrowser when you access it is filled in by javascript.

JSoup isn't a javascript runner and isn't a browser. It simply loads the HTML you want, does not run javascript, and cannot be made to do so, and will then let you scrape whatever you wanna scrape. Which, in this case, is basically nothing - as the HTML served up by that site doesn't contain anything useful.

Many websites online work like this: All actual content is written in by javascript.

You're stuck in a dead end: JSoup cannot be used here. What you can attempt to do is selenium or similar tools: These start a real browser and really use it to render the page just like real browsers do, and then lets you programmatically interact with the DOM similar to JSoup. Getting it to run on headless servers can be lightly challenging, and it's incredibly heavy (JSoup can trivially run in tiny memory footprints and very very quickly. This selenium route very much cannot - it needs to fire up an entire browser which are relatively speaking humongous beasts, of course). Selenium isn't meant for this purpose - it's meant for testing your own websites. But it can do the job, more or less.

Many browsers have a 'disable javascript' option, possibly in the devtools. Turn that on, load the page. What you see is what you can scrape with JSoup. If the content you want disappears when you do this, then You cannot scrape it with JSoup.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文