如何为生产网站中的数据丢失做好准备?
我正在构建一个正在快速投入生产的应用程序,我担心由于黑客攻击,可能会出现一些愚蠢的个人错误(例如运行 rake db:schema:load
或 rake db:回滚
)或其他情况下我们可能会遭受一个数据库表甚至整个系统的数据丢失。
虽然我认为上述情况不太可能发生,但如果没有做好万一发生的准备,那就太失职了。
我正在使用 Heroku 的 PG 备份(本月将被其他东西取代),并且我还运行自动每日备份到 S3:http://trevorturk.com/2010/04/14/automated-heroku-backups/,成功生成.dump
文件。
处理生产应用程序中的数据丢失的正确方法是什么?
- 如果需要,我将如何恢复
.dump
文件?如果系统的一小部分受到攻击,我可以进行选择性恢复吗? - 如果无法进行选择性恢复:假设一个表在上次备份后 4 小时丢失了数据。结果=>修复丢失的表需要回滚 4 小时的用户活动吗?对此有什么好的解决办法吗?
- 如果发生此类情况,为用户带来不便提供支持的最佳方式是什么?
I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load
or rake db:rollback
) or other circumstance we may suffer data loss in one database table or even across the system.
While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.
I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump
files.
What is the correct way to deal with data loss on a production app?
- How would I restore the
.dump
file in case I need to? Can I do a selective restore if a small part of the system is hit? - In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
- What is the best way to support users through the inconvenience if something like this happens?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
完整的 DR(灾难恢复)解决方案需要满足以下条件:
我不确定你如何在 Heroku 上实现这一切。对于大多数公司来说,完整的解决方案的价格仍然遥不可及——我们在自己的数据中心(一个在美国,一个在欧盟)运行这个解决方案,成本高达数百万美元。按照 80-20 规则进行工作 - 持续备份到单独的站点,加上经过充分测试的恢复计划(不断测试您从备份中恢复的能力)涵盖了您所需的 80%。
对于支持用户来说,最好的解决方案就是在出现问题时及时、如实地沟通,并确保不会丢失任何数据。如果您的用户为您的服务付费(即您不受广告支持),那么您可能应该制定 SLA。
A full DR (disaster recovery) solution requires the following:
I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.
As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.
关于备份,您无法每次都 100% 确定不会丢失数据。最好的办法是在另一台服务器上测试它。您必须至少有两种类型的备份:
数据库备份,例如 pg-dump。转储是一种独特的 SQL 命令,因此您可以使用它来重新创建整个数据库、一个表或几行。您会丢失同时添加的数据。
代码备份,例如 git 存储库。
About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :
A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.
A code backup, for example a git repository.
除了 Hartator 的回答之外:
如果您的数据库提供复制,例如至少使用一个从机进行主/从复制
为您的源代码使用良好的版本控制系统,例如Git
使用可靠的部署机制,例如Capistrano并编写您的自定义任务,因此没有人需要手动执行数据库迁移
让您信任的人检查您的防火墙设置和系统的总体安全性
DB-Dumps 包含用于重新创建所有表和所有数据的 SQL 命令...如果您是要仅恢复一个表,您可以从转储文件的副本中提取该部分并(非常仔细地)对其进行编辑,然后使用修改后的转储文件进行恢复(对于一个表)。
始终首先恢复到独立机器并检查数据是否正确。例如,您可以使用一台从服务器,如果离线,则在本地恢复并检查数据。如果您的系统中有两个从属设备,那么当您恢复到第二个从属设备时,剩余系统仍然具有一主一从属设备,这很好。
in addition to Hartator's answer:
use replication if your DB offers it, e.g. at least master/slave replication with one slave
do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)
use a good version control system for your source code, e.g. Git
use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand
have somebody you trust check your firewall setup and the security of your system in general
The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).
Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.
要在 Heroku 上模拟相当简单的“全面灾难恢复”,请创建另一个 Heroku 项目并完全复制您的生产应用程序(除非使用不同的自定义域名)。
您可以将多个远程 git 目标添加到单个 git 存储库,以便可以使用当前的生产代码库。您可以将数据库备份推送到复制的项目,然后就可以开始了。
与真正的灾难恢复相比,此练习中唯一缺少的步骤是将您的生产域分配给复制的 Heroku 项目。
如果您有能力并行运行应用程序的两个副本,则可以自动执行此操作,并根据您的数据丢失容忍度定期(例如每小时、每天)进行自我复制。
To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).
You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.
The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.
If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.