屏幕抓取一个阻止 ip 的网站

发布于 2024-09-17 00:44:19 字数 237 浏览 7 评论 0原文

你好,我想屏幕抓取像 yelp 这样的网站来获取意大利餐馆的电话号码。我创建了一个简单的程序来执行我想要的操作,但他们阻止了我的服务器 IP,

我正在使用 php 来执行此操作。怎样才能突破ip限制呢?

我听说过诸如 screen-scraper 之类的程序,但我还没有使用过它,

最好的方法是什么?是否可以将屏幕抓取器与 php 一起使用?

请注意*这是我正在从事的个人项目,而不是以此创建业务

Hello I want to screen scrape a site like yelp to get phone numbers of italian restaurants.. I created a simple program to do just what I wanted but they blocked my servers ip

I am using php to do it. How can I get past the ip block?

I've heard about programs like screen-scraper, but I still haven't used it yet

What is the best way to do it? and is it possible to use screen-scraper with php?

Please note* this is for a personal project I'm working on, its not to create a business out of it

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

烈酒灼喉 2024-09-24 00:44:19

如果您这样做是为了商业利益,请就地停下来。看看你是否可以找到获得相同数据的许可方法,或者自己亲自动手。一些公司故意在其信息中注入错误或可识别的拼写错误,以此来抓住像您这样的人,并将采取法律措施保护其知识产权(即使该信息如果以任何其他方式收集是完全免费的)。便宜有时可能会变得非常昂贵。

如果您这样做不是为了商业利益(并且您真的很喜欢意大利美食),请移动服务器或等到 IP 封锁解除(这可能永远不会解除)。重写你的代码并对你的请求设置一个巨大的速率限制器(模拟一个用户,每 5-10 秒左右获取一个页面。在几天的短会话中抓取网站。如果他们看到来自单个 IP 的请求过多时间太短,他们会再次将您列入黑名单。如果您是他们,您也会这样做。

If you're doing this for commercial gain, stop right where you are. See if you can find licensed means to get at the same data, or pound the pavement yourself. Some companies intentionally inject mistakes or identifiable typos into their information as a way to catch people like you and will take legal steps to protect their intellectual property (even though that info is completely free if collected any other way). Being cheap can sometimes end up being very expensive.

If you're not doing this for commercial gain (and you just really love Italian food), move servers or wait until the IP block lifts (which may be never). Rewrite your code and put a massive rate-limiter on your requests (emulate a user and get one page every 5-10 seconds or so. Scrape the site over several days in short sessions. If they see too many requests from a single IP over too short a time, they will blacklist you again. If you were them, you would too.

满身野味 2024-09-24 00:44:19

如果您只需要电话号码,可能有一种更简单的方法来获取该信息,所有信息都在一页上。尝试黄页之类的网站。查找您所在地区的意大利餐馆。保存整个页面。然后你就有了数字。

可能还有另一个网站也可以通过 API 提供此信息 - 这样您就不必违反任何服务条款。写得不好或激进的抓取脚本可能会暂时损坏网络服务器 - 网站阻止这些操作是有原因的。

If you only want phone numbers, there's probably an easier way to get that info, all on one page. Try a Yellow Pages sort of site. Look up Italian restaurants in your area. Save the whole page. You then have the numbers.

There may be another site that has this info available via an API, too - that way you don't have to break any terms of service. Poorly written or aggressive scraping scripts can temporarily damage webservers - there IS a reason sites block these actions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文