多次请求服务器问题

发布于 2024-08-30 07:52:51 字数 181 浏览 1 评论 0原文

我有一个包含用户帐户信息的数据库。
我已经安排了一个 CRON 作业,它会使用从帐户中获取的每个新用户数据来更新数据库。 我认为这可能会导致问题,因为所有请求都来自同一 IP 地址,并且服务器可能会阻止来自该 IP 地址的请求。

是这样吗?
如果是这样,我该如何避免被禁止?我应该使用代理吗?

谢谢

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.

Is this the case?
If so, how do I avoid being banned? should I be using a proxy?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

柠檬色的秋千 2024-09-06 07:52:51

您因可疑(或恶意)活动而被禁止。

如果您在正常的公司内部网内运行正常的业务应用程序,您不太可能被禁止。

由于您有权访问用户帐户信息,因此您已经拥有对系统的大量访问权限。最好的办法是询问您的系统管理员,因为他/她定义了什么构成可疑/恶意活动。系统管理员可能还想帮助您确保数据库至少与原始信息一样安全。

我应该使用代理吗?

代理可能会掩盖您正在做的事情 - 但您仍然在做。所以这不是解决问题的最道德的方式。

You get banned for suspicious (or malicious) activity.

If you are running a normal business application inside a normal company intranet you are unlikely to get banned.

Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.

should I be using a proxy?

A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.

我也只是我 2024-09-06 07:52:51

从同一服务器上的“数据库”获取数据的 cron 作业是否如此?您是否使用屏幕抓取或其他方式从远程服务器获取用户数据?

如果是这种情况,您可能需要设置几个不同的 cron 作业并分批执行。这样,您就可以减少远程服务器上的负载量,并降低您从何处获取此数据而阻止访问的可能性。

编辑

好的,所以如果您没有获得抓取的权限,显然您会想要负责任地进行(无论站点如何)。尝试从尽可能少的请求中收集尽可能多的数据,并将它们分布在全天的过程中,甚至在可能低负载的时间段内。我不会尝试使用代理,这对远程服务器没有真正的帮助,但对你来说会很痛苦。

我不是 iPhone 程序员,这可能是不可能的,但您可以尝试让各个 iPhone 抓取数据,这样所有源流量就不会来自同一 IP。只是一个想法,否则就尽量保持谨慎。

以下是 Jeff 提供的有关 Stack Overflow 抓取的一些提示,但我想这些规则对于任何网站都是类似的。

  1. 使用 GZIP 请求。这很重要!例如,一个抓取工具在仅 3,310 次点击中就使用了 120 兆字节的带宽,这是相当可观的。如果有了基本的 gzip 支持(自 90 年代起就融入 HTTP,并得到普遍支持),它的大小将是 20 MB 或更少。

  2. 表明您自己的身份。向用户代理添加一些有用的内容(最好是 URL 链接或信息性内容),以便我们可以将您的机器人视为“通用未知匿名抓取工具”之外的其他内容。 ”

  3. 使用正确的格式。当存在可以使用的 JSON 或 RSS 提要时,请勿抓取 HTML。哎呀,当您可以下载我们的 cc-wiki 数据转储时,为什么要刮擦一个>??


  4. 考虑周全。超过每 15 分钟提取一次数据是有问题的。如果你需要比这更及时的东西......为什么不先请求许可,并说明为什么这对 SO 社区有利并且应该被允许?我们的电子邮件链接位于每个 SO 系列网站的每个页面的底部。我们不咬...硬。

  5. 是的,您需要一个 API。我们明白了。在我们建造机器之前,不要做一些顽皮的事情来对机器发怒。它在队列中。

Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?

If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.

Edit

Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.

I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.

Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.

  1. Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.

  2. Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."

  3. Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??

  4. Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.

  5. Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文