Scrapy，URL 上的哈希标签

发布于 2024-11-18 17:15:19 字数 682 浏览 2 评论 0原文

我正在使用 Scrapy 进行一个报废项目。

我意识到 Scrapy 将 URL 从哈希标签剥离到末尾。

这是 shell 的输出：

[s]   request    <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s]   response   <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>

这确实影响了我的报废，因为在尝试找出为什么某些项目未被选择的几个小时后，我意识到长 URL 提供的 HTML 与短 URL 提供的 HTML 不同。另外，经过观察，一些关键部分的内容发生了变化。

有没有办法修改此行为，以便 Scrapy 保留整个 URL？

感谢您的反馈和建议。

原文

I'm on the middle of a scrapping project using Scrapy.

I realized that Scrapy strips the URL from a hash tag to the end.

Here's the output from the shell:

[s]   request    <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s]   response   <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>

This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.

Is there a way to modify this behavior so Scrapy keeps the whole URL?

Thanks for your feedback and suggestions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酒中人 2024-11-25 17:15:19

这不是 scrapy 本身可以改变的——url 中哈希后面的部分是片段标识符由客户端（这里是scrapy，通常是浏览器）而不是服务器使用。

当您在浏览器中获取页面时，可能会发生的情况是，该页面包含一些 JavaScript，这些 JavaScript 会查看片段标识符并通过 AJAX 加载一些附加数据并更新页面。您需要查看浏览器的功能，看看是否可以模拟它——Firebug、Chrome 或 Safari 检查器等开发人员工具可以让这一切变得简单。

例如，如果您导航到 http://twitter.com/also，您将被重定向到 http://twitter.com/#!/also。此处浏览器加载的实际 URL 只是 http://twitter.com/，但该页面随后加载数据 ( http://twitter.com/users/show_for_profile.json?screen_name=also ) 用于生成页面，以及在本例中，只是您可以自己解析的 JSON 数据。您可以使用 Chrome 中的网络检查器看到这种情况的发生。