如何为生产网站中的数据丢失做好准备?

发布于 2024-11-06 09:37:44 字数 682 浏览 8 评论 0原文

我正在构建一个正在快速投入生产的应用程序,我担心由于黑客攻击,可能会出现一些愚蠢的个人错误(例如运行 rake db:schema:loadrake db:回滚)或其他情况下我们可能会遭受一个数据库表甚至整个系统的数据丢失。

虽然我认为上述情况不太可能发生,但如果没有做好万一发生的准备,那就太失职了。

我正在使用 Heroku 的 PG 备份(本月将被其他东西取代),并且我还运行自动每日备份到 S3:http://trevorturk.com/2010/04/14/automated-heroku-backups/,成功生成.dump 文件。

处理生产应用程序中的数据丢失的正确方法是什么?

  1. 如果需要,我将如何恢复 .dump 文件?如果系统的一小部分受到攻击,我可以进行选择性恢复吗?
  2. 如果无法进行选择性恢复:假设一个表在上次备份后 4 小时丢失了数据。结果=>修复丢失的表需要回滚 4 小时的用户活动吗?对此有什么好的解决办法吗?
  3. 如果发生此类情况,为用户带来不便提供支持的最佳方式是什么?

I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.

While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.

I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.

What is the correct way to deal with data loss on a production app?

  1. How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
  2. In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
  3. What is the best way to support users through the inconvenience if something like this happens?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

预谋 2024-11-13 09:37:44

完整的 DR(灾难恢复)解决方案需要满足以下条件:

  1. 多站点。如果火灾、洪水、乌萨马·本·拉登或其他什么袭击了 Heroku 使用的 Amazon(或者是 Salesforce?)数据中心,您需要确保您的数据在其他地方是安全的。
  2. 持续将数据复制到单独的站点(或多个站点)。这意味着写入一个站点上的数据库的每个事务都会在几秒钟内复制到另一个站点上的镜像数据库。大多数 RDBMS 都有让您执行类似主从复制的机制。
  3. 这同样适用于您放在数据库外部的文件系统上的任何内容,例如图像、XML 配置文件等。S3 是一个很好的解决方案 - 它们为您将所有内容复制到多个数据中心。
  4. 我不会伤害创建数据库的定期(每天左右)转储并单独存储它们(例如在 S3 上)。这可以帮助您从传播到从属数据库的数据损坏中恢复。
  5. 自动化数据恢复过程。您希望它在您需要时发挥作用。
  6. 测试一切。理想情况下,您希望自动化测试过程并定期运行它以确保您的备份可以恢复。 Netflix Chaos Monkey 就是一个极端的例子的这个。

我不确定你如何在 Heroku 上实现这一切。对于大多数公司来说,完整的解决方案的价格仍然遥不可及——我们在自己的数据中心(一个在美国,一个在欧盟)运行这个解决方案,成本高达数百万美元。按照 80-20 规则进行工作 - 持续备份到单独的站点,加上经过充分测试的恢复计划(不断测试您从备份中恢复的能力)涵盖了您所需的 80%。

对于支持用户来说,最好的解决方案就是在出现问题时及时、如实地沟通,并确保不会丢失任何数据。如果您的用户为您的服务付费(即您不受广告支持),那么您可能应该制定 SLA。

A full DR (disaster recovery) solution requires the following:

  1. Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
  2. On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
  3. The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
  4. I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
  5. Automate the process of data recovery. You want this to just work when you need it.
  6. Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.

I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.

As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.

余生再见 2024-11-13 09:37:44

关于备份,您无法每次都 100% 确定不会丢失数据。最好的办法是在另一台服务器上测试它。您必须至少有两种类型的备份:

  • 数据库备份,例如 pg-dump。转储是一种独特的 SQL 命令,因此您可以使用它来重新创建整个数据库、一个表或几行。您会丢失同时添加的数据。

  • 代码备份,例如 git 存储库。

About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :

  • A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.

  • A code backup, for example a git repository.

弃爱 2024-11-13 09:37:44

除了 Hartator 的回答之外:

  • 如果您的数据库提供复制,例如至少使用一个从机进行主/从复制

  • < p>在从属数据库服务器上进行数据库备份并将它们存储在外部(例如 scp 或 rsync 将它们从您的服务器中取出)

  • 为您的源代码使用良好的版本控制系统,例如Git

  • 使用可靠的部署机制,例如Capistrano并编写您的自定义任务,因此没有人需要手动执行数据库迁移

  • 让您信任的人检查您的防火墙设置和系统的总体安全性

DB-Dumps 包含用于重新创建所有表和所有数据的 SQL 命令...如果您是要仅恢复一个表,您可以从转储文件的副本中提取该部分并(非常仔细地)对其进行编辑,然后使用修改后的转储文件进行恢复(对于一个表)。

始终首先恢复到独立机器并检查数据是否正确。例如,您可以使用一台从服务器,如果离线,则在本地恢复并检查数据。如果您的系统中有两个从属设备,那么当您恢复到第二个从属设备时,剩余系统仍然具有一主一从属设备,这很好。

in addition to Hartator's answer:

  • use replication if your DB offers it, e.g. at least master/slave replication with one slave

  • do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)

  • use a good version control system for your source code, e.g. Git

  • use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand

  • have somebody you trust check your firewall setup and the security of your system in general

The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).

Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.

白馒头 2024-11-13 09:37:44

要在 Heroku 上模拟相当简单的“全面灾难恢复”,请创建另一个 Heroku 项目并完全复制您的生产应用程序(除非使用不同的自定义域名)。

您可以将多个远程 git 目标添加到单个 git 存储库,以便可以使用当前的生产代码库。您可以将数据库备份推送到复制的项目,然后就可以开始了。

与真正的灾难恢复相比,此练习中唯一缺少的步骤是将您的生产域分配给复制的 Heroku 项目。

如果您有能力并行运行应用程序的两个副本,则可以自动执行此操作,并根据您的数据丢失容忍度定期(例如每小时、每天)进行自我复制。

To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).

You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.

The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.

If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文