当我尝试屏幕抓取时,Gap.com 正在重定向我

发布于 2024-09-07 09:05:00 字数 570 浏览 5 评论 0原文

我们正在建立一个网站,允许用户从互联网上的各个地方收集和存储他们最喜欢的产品到一个地方。我们有一种算法,可以通过读取源代码过滤并找到正确的图像。 80% 的网站工作正常,但 2 家大公司阻止将我们从产品页面重定向到他们的主页。

例如此产品 http://www.gap.com/browse/product.do?pid=741123&kwid=1&sem=false& ;sdReferer=http://www.gap.com/products/graphic-ts-toddler-boy-clothing-C35792.jsp# 选择gap.com 主页的标题,而不是手头产品的标题。

我们如何绕过这个重定向并允许我们的算法通过读取正确的源代码来收集正确的图像?

We are building a site that allows users to collect and store their favorite products from all over the Internet to one spot. We have an algorithm that filters out and finds the correct image by reading the source code. 80% of the sites work correctly but 2 large companies are blocking redirecting us from a product page to their homepage.

For example this product http://www.gap.com/browse/product.do?pid=741123&kwid=1&sem=false&sdReferer=http://www.gap.com/products/graphic-ts-toddler-boy-clothing-C35792.jsp# picks up the header for gap.com main page and not for the product at hand.

How do we get around this redirect and allows our algorithm to collect the correct image by reading the correct source code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

爱的十字路口 2024-09-14 09:05:00

首先,您可以请律师研究您的目标网站的服务条款,并确保您不会遇到法律问题。

在技​​术方面,设置 Referer [原文如此]请求图像时的 header 。图像的引荐来源网址应该是嵌入该图像的页面。服务器可以检查这一点以确保正在请求图像以满足浏览器而不是图像采集屏幕抓取器的页面渲染。


对相关图像进行一些测试后,看起来不需要 Referer 标头。也许它只是拒绝一个不熟悉的用户代理,或者消除了请求中的一些其他奇怪的情况,例如缺少 Accept 标头等。

First, you might ask a lawyer to study the terms of service of your target web sites, and make sure that you won't run into legal problems.

On the technical side, set the Referer [sic] header when requesting the image. The referrer for an image should be the page in which it is embedded. The server may check that to ensure that the image is being requested to satisfy a page render by a browser, rather than a image-harvesting screen scraper.


After a bit of testing with the image in question, it doesn't look the Referer header is required. Perhaps it is simply rejecting an unfamiliar user-agent, or is keying off some other oddity in the request, like a missing Accept header, etc.

赠我空喜 2024-09-14 09:05:00

我想您需要将抓取工具的用户代理字符串更改为看起来像普通浏览器的字符串(默认情况下您可能会发送类似 curlwget 的字符串) 。

不过,如果您向他们发送足够的流量,他们最终会注意到并以难以规避的方式关闭您,这是一个很好的机会。

I'd imagine you need to change your scraper's user agent string to something that looks like a normal browser (you're probably sending a string like curl or wget by default).

There's a good chance, though, that if you're sending enough traffic their way they'll eventually notice and shut you down in a harder-to-circumvent manner.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文