Scrapy,URL 上的哈希标签
我正在使用 Scrapy 进行一个报废项目。
我意识到 Scrapy 将 URL 从哈希标签剥离到末尾。
这是 shell 的输出:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
这确实影响了我的报废,因为在尝试找出为什么某些项目未被选择的几个小时后,我意识到长 URL 提供的 HTML 与短 URL 提供的 HTML 不同。另外,经过观察,一些关键部分的内容发生了变化。
有没有办法修改此行为,以便 Scrapy 保留整个 URL?
感谢您的反馈和建议。
I'm on the middle of a scrapping project using Scrapy.
I realized that Scrapy strips the URL from a hash tag to the end.
Here's the output from the shell:
[s] request <GET http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C#/ref=sr_nr_p_8_0?rh=n%3A165796011%2Cn%3A%212334086011%2Cn%3A%212334148011%2Cn%3A3006339011%2Cp_8%3A2229010011&bbn=3006339011&ie=UTF8&qid=1309631658&rnid=598357011>
[s] response <200 http://www.domain.com/b?ie=UTF8&node=3006339011&ref_=pe_112320_20310580%5C>
This really affects my scrapping because after a couple of hours trying to find out why some item was not being selected, I realized that the HTML provided by the long URL differs from the one provided by the short one. Besides, after some observation, the content changes in some critical parts.
Is there a way to modify this behavior so Scrapy keeps the whole URL?
Thanks for your feedback and suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这不是 scrapy 本身可以改变的——url 中哈希后面的部分是 片段标识符 由客户端(这里是scrapy,通常是浏览器)而不是服务器使用。
当您在浏览器中获取页面时,可能会发生的情况是,该页面包含一些 JavaScript,这些 JavaScript 会查看片段标识符并通过 AJAX 加载一些附加数据并更新页面。您需要查看浏览器的功能,看看是否可以模拟它——Firebug、Chrome 或 Safari 检查器等开发人员工具可以让这一切变得简单。
例如,如果您导航到 http://twitter.com/also,您将被重定向到 http://twitter.com/#!/also。此处浏览器加载的实际 URL 只是 http://twitter.com/,但该页面随后加载数据 ( http://twitter.com/users/show_for_profile.json?screen_name=also ) 用于生成页面,以及在本例中,只是您可以自己解析的 JSON 数据。您可以使用 Chrome 中的网络检查器看到这种情况的发生。
This isn't something scrapy itself can change--the portion following the hash in the url is the fragment identifier which is used by the client (scrapy here, usually a browser) instead of the server.
What probably happens when you fetch the page in a browser is that the page includes some JavaScript that looks at the fragment identifier and loads some additional data via AJAX and updates the page. You'll need to look at what the browser does and see if you can emulate it--developer tools like Firebug or the Chrome or Safari inspector make this easy.
For example, if you navigate to http://twitter.com/also, you are redirected to http://twitter.com/#!/also. The actual URL loaded by the browser here is just http://twitter.com/, but that page then loads data (http://twitter.com/users/show_for_profile.json?screen_name=also) which is used to generate the page, and is, in this case, just JSON data you could parse yourself. You can see this happen using the Network Inspector in Chrome.
看起来这是不可能的。问题不在于响应,而在于请求,这会破坏 url。
我可以在服务器端读取 URL 的哈希部分吗应用程序(PHP、Ruby、Python 等)?
为什么需要这部分,如果服务器没有从浏览器接收到它,它就会被删除?
如果您正在与亚马逊合作 - 我没有发现此类网址有任何问题。
Looks like it's not possible. The problem is not the response, it's in the request, which chops the url.
Can I read the hash portion of the URL on my server-side application (PHP, Ruby, Python, etc.)?
Why do you need this part which is stripped if the server doesn't receive it from browser?
If you are working with Amazon - i haven't seen any problems with such urls.
实际上,当在网络浏览器中输入该 URL 时,它也只会将哈希标签之前的部分发送到网络服务器。如果内容不同,可能是因为页面上有一些 javascript - 基于哈希标记部分的内容 - 加载页面后更改了页面内容(很可能是XmlHttpRequest 是用来加载附加内容的)。
Actually, when entering that URL in a web browser, it will also only send the part before the hash tag to the web server. If the content is different, it's probably because there are some javascript on the page that - based on the content of the hash tag part - changes the content of the page after it has been loaded (most likely an XmlHttpRequest is made that loads additional content).