使用Java获取以下页面的源代码

发布于 2024-11-03 13:38:53 字数 1688 浏览 0 评论 0原文

我正在尝试获取以下页面的源代码: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all (请注意,如果您单击该链接,亚马逊会将您带到另一个页面。要访问我有兴趣阅读的页面,请复制该链接并将其粘贴到浏览器中的空选项卡中。谢谢!)

通常使用 java。 NET API,我几乎可以毫无问题地获取大多数 URL 的源代码,但是对于上面的链接我什么也得不到。事实证明,连接生成的输入流是由 gzip 编码的,所以我尝试了以下操作:

URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
     is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
     is = new InflaterInputStream((is), new Inflater(true));
}

但是这次我确定性地得到以下错误:

java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)

有人能看到我的错误吗?还有其他方法可以阅读此特定页面吗?有人能解释一下为什么我的浏览器(firefox)可以读取它,但我无法使用 Java 读取源代码吗?

预先感谢,最诚挚的问候,

I am trying to get the source code for the following page: http://www.amazon.com/gp/offer-listing/082470732X/ref=dp_olp_0?ie=UTF8&redirect=true&condition=all
(Please note that Amazon takes you to another page if you click on the link. To get to the page that I am interested in reading please copy the link and paste it to an empty tab in your browser. Thanks!)

Normally using java.net API, I can get the source code for most of the URLs with almost no problem, however for the above link I get nothing. It turned out that the input stream generated by the connection is encoded by gzip, so I tried the following:

URL url = new URL(urlString);
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
InputStream is = urlConnection.getInputStream();
HttpURLConnection.setFollowRedirects(true);
urlConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = urlConnection.getContentEncoding();
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
     is = new GZIPInputStream(is);
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
     is = new InflaterInputStream((is), new Inflater(true));
}

However this time I get the following error deterministically:

java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:142)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:67)
at domain.logic.ItemScraper.loadURL(ItemScraper.java:405)
at domain.logic.ItemScraper.main(ItemScraper.java:510)

Can anybody see my mistake? Is there another way to read this particular page? Can somebody explain me why my browser (firefox) can read it, however I cannot read the source using Java?

Thanks in advance, best regards,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幸福%小乖 2024-11-10 13:38:53

而不是

is = new GZIPInputStream(is);

尝试

is = new GZIPInputStream(urlConnection.getInputStream());

至于EOFException,如果您添加

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");

它就会消失。

Instead of

is = new GZIPInputStream(is);

try

is = new GZIPInputStream(urlConnection.getInputStream());

As for the EOFException, if you add

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24");

it would go away.

水晶透心 2024-11-10 13:38:53

您可以使用标准 BufferedReader 读取给定 URL 的 Web 服务器的响应。

URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));

然后使用 ...

while ((incomingLine = URLIn.readLine()) != null) {
 ...
}

... 来获取响应。

You can use a standard BufferedReader to read the response of a webserver of a given URL.

URLIn = new BufferedReader(new InputStreamReader(new URL(URLOrFilename).openStream()));

Then use ...

while ((incomingLine = URLIn.readLine()) != null) {
 ...
}

... to get the response.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文