如何抓取 WordPress 博客?

发布于 2024-10-19 23:37:07 字数 792 浏览 3 评论 0原文

我写了一个ac程序来爬取博客。一直很好用,直到遇到这个博客:www.ipujia.com。我发送HTTP请求:

GET http://www.ipujia.com/ HTTP/1.0< /code>

到网站并得到如下响应:

HTTP/1.1 301 Moved Permanently
Date: Sun, 27 Feb 2011 13:15:26 GMT
Server: Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8e-fips-rhel5
mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 
Perl/v5.8.8
X-Powered-By: PHP/5.2.14
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Sun, 27 Feb 2011 13:15:27 GMT
Location: http://http/www.ipujia.com/
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

这很奇怪,因为我无法获取位置后面的索引页面。有人有什么想法吗?

I write a c program to crawl blogs. It works well until it meets this blog: www.ipujia.com. I send the HTTP request:

GET http://www.ipujia.com/ HTTP/1.0

to the website and get the response as below:

HTTP/1.1 301 Moved Permanently
Date: Sun, 27 Feb 2011 13:15:26 GMT
Server: Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8e-fips-rhel5
mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 
Perl/v5.8.8
X-Powered-By: PHP/5.2.14
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Pragma: no-cache
Last-Modified: Sun, 27 Feb 2011 13:15:27 GMT
Location: http://http/www.ipujia.com/
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

This is strange because I cannot get the index page following the Location. Does anyone have any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

感情废物 2024-10-26 23:37:07

响应中的位置字段包含格式错误的 URI。

位置:http://http/www.ipujia.com/(注意协议错误)
应该

位置: http://www.ipujia.com/

除非您控制服务器,否则你在这里无能为力。

要解决这个问题,您是否可以不解析“位置”响应并尝试从中提取有效的 URI?

The Location field in the response contains a malformed URI.

Location: http://http/www.ipujia.com/ (notice the protocol error)
Should be

Location: http://www.ipujia.com/

Unless you are in control of the server there is little you could do here.

To solve it could you not parse the "Location" response and attempt to extract a valid URI from the it?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文