Jsoup SocketTimeoutException:读取超时

发布于 2024-11-18 12:25:20 字数 1531 浏览 2 评论 0原文

当我尝试使用 Jsoup 解析大量 HTML 文档时,出现 SocketTimeoutException。

例如,我得到了一个链接列表:

<a href="www.domain.com/url1.html">link1</a>
<a href="www.domain.com/url2.html">link2</a>
<a href="www.domain.com/url3.html">link3</a>
<a href="www.domain.com/url4.html">link4</a>

对于每个链接,我解析链接到 URL(来自 href 属性)的文档,以获取这些页面中的其他信息。

所以我可以想象这需要很多时间,但是如何关闭这个异常这是整个堆栈跟踪:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read1(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at java.net.HttpURLConnection.getResponseCode(Unknown Source)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
    at app.ForumCrawler.crawl(ForumCrawler.java:50)
    at Main.main(Main.java:15)

I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup.

For example, I got a list of links :

<a href="www.domain.com/url1.html">link1</a>
<a href="www.domain.com/url2.html">link2</a>
<a href="www.domain.com/url3.html">link3</a>
<a href="www.domain.com/url4.html">link4</a>

For each link, I parse the document linked to the URL (from the href attribute) to get other pieces of information in those pages.

So I can imagine that it takes lot of time, but how to shut off this exception Here is the whole stack trace:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read1(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at java.net.HttpURLConnection.getResponseCode(Unknown Source)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
    at app.ForumCrawler.crawl(ForumCrawler.java:50)
    at Main.main(Main.java:15)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

与往事干杯 2024-11-25 12:25:21

这应该有效:
Jsoup.connect(url.toLowerCase()).timeout(0);

This should work:
Jsoup.connect(url.toLowerCase()).timeout(0);.

偏闹i 2024-11-25 12:25:21

设置从 jsoup 连接时的超时。

Set timeout while connecting from jsoup.

百思不得你姐 2024-11-25 12:25:20

我认为你可以将

Jsoup.connect("...").timeout(10 * 1000).get(); 

超时设置为 10 秒。

I think you can do

Jsoup.connect("...").timeout(10 * 1000).get(); 

which sets timeout to 10s.

画离情绘悲伤 2024-11-25 12:25:20

好的 - 所以,我试图将此作为对 MarcoS 答案的编辑,但编辑被拒绝。尽管如此,以下信息可能对未来的访问者有用:

根据 javadocs,默认超时 org.jsoup.Connection 为 30 秒。

正如已经提到的,这可以使用 timeout(int millis) 设置。

此外,正如编辑中的 OP 注释,这也可以使用 timeout(0) 设置。然而,正如 javadoc 所说:

超时为零被视为无限超时。

Ok - so, I tried to offer this as an edit to MarcoS's answer, but the edit was rejected. Nevertheless, the following information may be useful to future visitors:

According to the javadocs, the default timeout for an org.jsoup.Connection is 30 seconds.

As has already been mentioned, this can be set using timeout(int millis)

Also, as the OP notes in the edit, this can also be set using timeout(0). However, as the javadocs state:

A timeout of zero is treated as an infinite timeout.

灼疼热情 2024-11-25 12:25:20

我遇到了同样的错误:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)

并且只有设置 .userAgent(Opera) 对我有用。

所以我使用了 Connection userAgent(String userAgent)< /a> Connection 类的方法来设置 Jsoup 用户代理。

像这样的东西:

Jsoup.connect("link").userAgent("Opera").get();

I had the same error:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)

and only setting .userAgent(Opera) worked for me.

So I used Connection userAgent(String userAgent) method of Connection class to set Jsoup user agent.

Something like:

Jsoup.connect("link").userAgent("Opera").get();
爱的故事 2024-11-25 12:25:20

https://jsoup.org/apidocs/org/jsoup/Connection 上有错误。 html
默认超时不是 30 秒。是3秒。
看看代码中的javadoc就可以了。它说 3000 毫秒。

There is mistake on https://jsoup.org/apidocs/org/jsoup/Connection.html.
Default timeout is not 30 seconds. It is 3 seconds.
Just look at javadoc in codes. It says 3000 ms.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文