保证正确获取www.bing.com/内容的方法

发布于 2024-12-25 16:26:26 字数 190 浏览 0 评论 0原文

我一直在开发一个程序,该程序获取 www.bing.com 的内容并将其保存到文件中,但是在我尝试过的两种方法中,一种使用套接字,另一种使用 HtmlUnit 在以下情况下均未显示 100% 正确的内容:我打开文件。我知道还有其他选择,但我正在寻找一个能够保证正确获取 www.bing.com/ 内容的选项。因此,如果有人能指出我实现这一目标的方法,我将不胜感激。

I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

傲娇萝莉攻 2025-01-01 16:26:26

您看到的差异可能是由于 Web 服务器根据用户代理字符串和其他请求标头向不同的浏览器提供不同的内容。

尝试将套接字和 HtmlUnit 策略中的 User-Agent 标头设置为您要比较的标头,看看结果是否符合预期。此外,您可能必须完全复制目标浏览器发送的请求标头。

The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.

Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.

丢了幸福的猪 2025-01-01 16:26:26

返回的内容有哪些“不正确”?请记住,Bing 可能通过 JavaScript 生成一些内容;您的客户端将需要发出额外的请求来检索 JavaScript 文件、运行 JavaScript 等。

What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.

薄荷→糖丶微凉 2025-01-01 16:26:26

您可以使用 URL.openConnection() 创建一个 URLConnection 并调用 URLConnection.getInputStream()。您可以读取InputStream 内容并将其写入文件。

如果您需要覆盖用户代理,因为服务器使用它来提供不同的内容,您可以通过首先将 http.agent 系统属性设置为空字符串来实现。

/* Somewhere in your code before you make requests */
System.setProperty("http.agent", ""); 

或者在 java 命令行上使用 -Dhttp.agent=

,然后在获取 InputStream 之前将 User-Agent 设置为对连接有用的内容。

URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);

You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.

If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.

/* Somewhere in your code before you make requests */
System.setProperty("http.agent", ""); 

or using -Dhttp.agent= on your java command line

and then setting the User-Agent to something useful on the connection before you get the InputStream.

URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文