您在生产中遇到的最严重的数据库事故是什么?

发布于 2024-07-04 05:22:19 字数 1450 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(18

云淡风轻 2024-07-11 05:22:22

一些效果:

更新电子邮件集processedTime=null,sentTime=null

生产新闻通讯数据库上的电子邮件集processedTime=null,sentTime=null,重新发送数据库中的每封电子邮件。

Something to the effect of:

update email set processedTime=null,sentTime=null

on a production newsletter database, resending every email in the database.

春花秋月 2024-07-11 05:22:22

我曾经成功地编写了一个从未退出的更新游标。 在 2M+ 行的表上。 锁定不断升级,直到这个 16 核、8GB RAM(2002 年!)的盒子实际上停止运行(属于蓝屏类型)。

I once managed to write an updating cursor that never exited. On a 2M+ row table. The locks just escalated and escalated until this 16-core, 8GB RAM (in 2002!) box actually ground to a halt (of the blue screen variety).

意中人 2024-07-11 05:22:22
update Customers set ModifyUser = 'Terrapin'

我忘记了 where 子句 - 很无辜,但是在一个拥有 5000 多个客户的表上,我的名字将在每条记录上出现一段时间......

经验教训:使用事务提交和回滚!

update Customers set ModifyUser = 'Terrapin'

I forgot the where clause - pretty innocent, but on a table with 5000+ customers, my name will be on every record for a while...

Lesson learned: use transaction commit and rollback!

并安 2024-07-11 05:22:22

我们试图修复 Oracle 集群上损坏的节点。

存储管理模块出现问题,因此我们单击卸载按钮,打算重新安装并从另一个节点复制配置。

嗯,原来卸载按钮适用于整个集群,于是高高兴兴地将存储管理模块从系统中的所有节点上删除了。

导致生产集群中的每个节点崩溃。 由于所有节点都没有存储管理器,因此它们不会出现!

关于备份,有一个有趣的事实...最旧的备份会异地轮换,您知道数据库上最旧的文件是什么吗? 系统安装时设置的配置文件。

因此,我们必须让异地人员将磁带寄给快递员,几个小时后,我们重新安装并运行了所有内容。 现在我们保留安装和配置文件的本地副本!

We were trying to fix a busted node on an Oracle cluster.

The storage management module was having problems, so we clicked the un-install button with the intention of re-installing and copying the configuration over from another node.

Hmm, it turns out the un-install button applied to the entire cluster, so it cheerfully removed the storage management module from all the nodes in the system.

Causing every node in the production cluster to crash. And since none of the nodes had a storage manager, they wouldn't come up!

Here's an interesting fact about backups... the oldest backups get rotated off-site, and you know what your oldest files on a database are? The configuration files that got set up when the system was installed.

So we had to have the offsite people send a courier with that tape, and a couple of hours later we had everything reinstalled and running. Now we keep local copies of the installation and configuration files!

盗心人 2024-07-11 05:22:22

我以为我正在测试数据库中工作(显然情况并非如此),因此当我完成“测试”时,我运行一个脚本将所有数据重置回我们使用的标准测试数据。 ..哎呀!
幸运的是,这种情况发生在有备份的数据库上,因此在发现我做错了什么之后,我们可以轻松恢复原始数据库。

然而,这次事件确实教会了我工作的公司真正将生产环境和测试环境分开。

I thought I was working in the testing DB (which wasn't the case apparently), so when I finished 'testing' I run a script to reset all data back to the standard test data we use... ouch!
Luckily this happened on a database that had backups in place, so after figuring out I did something wrong we could easily bring back the original database.

However this incident did teach the company I worked for to realy seperate the production and the test environment.

一身仙ぐ女味 2024-07-11 05:22:22

我不记得所有失控的 sql 语句,但我学到了一个教训 - 如果可以的话,在事务中执行(注意大日志文件!)。

在生产中,如果可以的话,请继续采用老式方法:

  1. 使用维护时段
  2. 备份
  3. 请执行更改
  4. 验证
  5. 如果出现问题,

恢复非常不酷,但通常有效,甚至可以将此过程交给其他人在夜班期间运行它,同时您得到应有的睡眠:-)

I don't remember all the sql statements that ran out of control but I have one lesson learned - do it in a transaction if you can (beware of the big logfiles!).

In production, if you can, proceed the old fashioned way:

  1. Use a maintenance window
  2. Backup
  3. Perform your change
  4. verify
  5. restore if something went wrong

Pretty uncool, but generally working and even possible to give this procedure to somebody else to run it during their night shift while you're getting your well deserved sleep :-)

-柠檬树下少年和吉他 2024-07-11 05:22:22

我完全按照你的建议做了。 我更新了保存客户文档的表中的所有行,因为我忘记在末尾添加“where ID = 5”。 那是个错误。

但我很聪明,也很偏执。 我知道有一天我会搞砸。 我已经发出了“开始交易”。 我发出回滚,然后检查表是否正常。

事实并非如此。

生产中的经验教训:尽管事实上我们出于多种原因喜欢在 MySQL 中使用 InnoDB 表...请确保您还没有找到少数 MyISAM 表之一这不尊重事务,并且您无法回滚。 任何情况下都不要信任MySQL,习惯性地发出“开始事务”是一件好事。 即使在最坏的情况下(这里发生的情况),它也不会造成任何伤害,并且它会在 InnoDB 表上保护我。

我必须从备份中恢复该表。 幸运的是,我们有夜间备份,数据几乎从不改变,而且表有几十行,所以它几乎是瞬时的。 作为参考,没有人知道我们仍然有非 InnoDB 表,我们以为我们很久以前就将它们全部转换了。 没有人告诉我要留意这个陷阱,没有人知道它在那里。 我的老板也会做同样的事情(如果他在输入 where 子句之前太早按下回车键的话)。

I did exactly what you suggested. I updated all the rows in a table that held customer documents because I forgot to add the "where ID = 5" at the end. That was a mistake.

But I was smart and paranoid. I knew I would screw up one day. I had issued a "start transaction". I issued a rollback and then checked the table was OK.

It wasn't.

Lesson learned in production: despite the fact we like to use InnoDB tables in MySQL for many MANY reasons... be SURE you haven't managed to find one of the few MyISAM tables that doesn't respect transactions and you can't roll back on. Don't trust MySQL under any circumstances, and habitually issuing a "start transaction" is a good thing. Even in the worst case scenario (what happened here) it didn't hurt anything and it would have protected me on the InnoDB tables.

I had to restore the table from a backup. Luckily we have nightly backups, the data almost never changes, and the table is a few dozen rows so it was near instantaneous. For reference, no one knew that we still had non-InnoDB tables around, we thought we converted them all long ago. No one told me to look out for this gotcha, no one knew it was there. My boss would have done the same exact thing (if he had hit enter too early before typing the where clause too).

高冷爸爸 2024-07-11 05:22:22

我发现我不理解 Oracle 重做日志文件(术语?那是很久以前的事了)并且丢失了一周的交易数据,这些数据必须从纸质票据中手动重新输入。

还有一线希望 - 在我花在输入上的周末期间,我学到了很多关于交易输入屏幕的可用性的知识,此后它得到了显着改善。

I discovered I didn't understand Oracle redo log files (terminology? it was a long time ago) and lost a weeks' trade data, which had to be manually re-keyed from paper tickets.

There was a silver lining - during the weekend I spent inputting, I learned a lot about the useability of my trade input screen, which improved dramatically thereafter.

星星的轨迹 2024-07-11 05:22:22

对于大多数人来说,最糟糕的情况是生产数据丢失,但如果他们不运行夜间备份或将数据复制到灾难恢复站点,那么他们应得的一切!

T-SQL 中的 @Keith 不是FROM 关键字对于 DELETE 是可选的吗? 这两个语句做了完全相同的事情......

Worst case scenario for most people is production data loss, but if they're not running nightly backups or replicating data to a DR site, then they deserve everything they get!

@Keith in T-SQL, isn't the FROM keyword optional for a DELETE? Both of those statements do exactly the same thing...

攒一口袋星星 2024-07-11 05:22:22

我遇到的最糟糕的事情是生产服务器占用了硬盘中的所有空间。 我使用的是 SQL Server,所以我看到数据库文件并看到日志约为 10 GB,因此我决定执行当我想要截断日志文件时总是执行的操作。 我做了一个分离删除日志文件,然后再次附加。 好吧,我意识到如果日志文件没有正确关闭,这个过程就不起作用。 所以我最终得到一个 mdf 文件,没有日志文件。 值得庆幸的是,我访问了 Microsoft 网站,找到了一种将数据库恢复为恢复并移动到另一个数据库的方法。

The worst thing that happened to me was that a Production server consume all the space in the HD. I was using SQL Server so I see the database files and see that the log was about 10 Gb so I decide to do what I always do when I want to trunc a Log file. I did a Detach the delete the log file and then attach again. Well I realize that if the log file is not close properly this procedure does not work. so I end up with a mdf file and no log file. Thankfully I went to the Microsoft site I get a way to restore the database as recovery and move to another database.

来世叙缘 2024-07-11 05:22:22

更新客户表的所有行,因为您忘记添加 where 子句。

这正是我所做的:| 。 我已将所有用户的密码列更新为我在控制台上输入的示例字符串。 最糟糕的部分是我正在访问生产服务器,并且在执行此操作时正在检查一些查询。 然后,我的上级不得不恢复旧的备份,并不得不接听一些非常不满的客户打来的电话。 当然,还有一次我确实使用了删除语句,我什至不想谈论;-)

Updating all rows of the customer table because you forgot to add the where clause.

That was exactly i did :| . I had updated the password column for all users to a sample string i had typed onto the console. The worst part of it was i was accessing the production server and i was checking out some queries when i did this. My seniors then had to revert an old backup and had to field some calls from some really disgruntled customers. Ofcourse there is another time when i did use the delete statement, which i don't even want to talk about ;-)

木格 2024-07-11 05:22:22

我删除了实时数据库并将其删除。

经验教训:确保您了解 SQL - 并确保在接触内容之前进行备份。

I dropped the live database and deleted it.

Lesson learned: ensure you know your SQL - and make sure that you back up before you touch stuff.

要走就滚别墨迹 2024-07-11 05:22:22

这不是发生在我身上,只是我们的一个客户,我必须清理他的烂摊子。

他们有一个在 RAID5 磁盘阵列上运行的 SQL 服务器 - 漂亮的热插拔驱动器,配有亮起的磁盘状态指示灯。 绿色=好,红色=坏。

他们的一个驱动器从绿色变成红色,被告知拔出并更换(红色)坏驱动器的天才取而代之的是(绿色)好的驱动器。 好吧,这并没有完全摧毁raid集 - 选择稍微可读(红色)与不可用(绿色)几分钟..在意识到错误并将驱动器交换回在此期间写入的任何数据块之后由于磁盘同步丢失,时间变得混乱)...连续 24 小时后编写元程序来恢复可读数据并重建他们备份和运行的中等大小的模式。

这个故事的寓意包括……永远不要使用 RAID5,始终维护备份,小心你雇用的人。

我曾经在客户的生产系统上犯过一个重大错误——幸运的是,当我想知道为什么命令需要这么长时间才能执行时,我意识到我做了什么,并在世界末日之前取消了它。

这个故事的寓意包括……在更改任何内容之前始终启动一个新事务,测试结果是否符合您的预期,然后才提交事务。

作为一般观察,通过在模式上正确定义外键约束并远离任何标记为“CASCADE”的命令,可以防止许多类 rm -rf / 类型错误

This didn't happen to me, just a customer of ours whos mess I had to clean up.

They had a SQL server running on a RAID5 disk array - nice hotswap drives complete with lighted disk status indicators. Green = Good, Red = Bad.

One of their drives turned from green to red and the genius who was told to pull and replace the (Red) bad drive takes a (Green) good one out instead. Well this didn't quite manage to bring down the raid set completely - opting for the somewhat readable (Red) vs unavaliable (Green) for several minutes.. after realizing the mistake and swapping the drives back any data blocks that were written during this time became jyberish as disk synchronization was lost) ... 24-straight hours later writing meta programs to recover readable data and reconstruct a medium sized schema they were back up and running.

Morals of this story include...Never use RAID5, always maintain backups, careful who you hire.

I made a major mistake on a customers production system once -- luckily while wondering why the command was taking so long to execute realized what I had done and canceled it before the world came to an end.

Moral of this story include ... always start a new transaction before changing ANYTHING, test the results are what you expect and then and only then commit the transaction.

As a general observation many classes of rm -rf / type errors can be prevented by properly defining foreign key constraints on your schema and staying far away from any command labled 'CASCADE'

ˇ宁静的妩媚 2024-07-11 05:22:22

截断表 T_DAT_STORE

T_DAT_STORE是我所在部门的事实表。我认为我已连接到开发数据库。 幸运的是,我们有每日备份,直到那天才使用,六个小时后数据就恢复了。

从那时起,我在截断之前修改所有内容,并定期要求对小表进行备份恢复,只是为了检查备份是否正常(备份不是由我的部门完成的)

Truncate table T_DAT_STORE

T_DAT_STORE was the fact table of the department I work in. I think I was connected to the development database. Fortunately, we have a daily backup, which hasn't been used until that day, and the data was restored in six hours.

Since then I revise everything before a truncate, and periodically I ask for a backup restoration of minor tables only to check the backup is doing well (Backup isn't done by my department)

孤独患者 2024-07-11 05:22:21

初级 DBA 的本意是:

delete from [table] where [condition]

相反,他们输入:

delete [table] where [condition]

Which is valid T-Sql 但基本上完全忽略了 where [condition] 位(至少当时在 MSSQL 2000/97 上是这样做的 - 我忘了是哪个)并擦除整个表。

蛮好玩的 :-/

A junior DBA meant to do:

delete from [table] where [condition]

Instead they typed:

delete [table] where [condition]

Which is valid T-Sql but basically ignores the where [condition] bit completely (at least it did back then on MSSQL 2000/97 - I forget which) and wipes the entire table.

That was fun :-/

满地尘埃落定 2024-07-11 05:22:21

大约 7 年前,我工作到很晚,正在为客户的数据库生成更改脚本。 我只更改了存储过程,但是当我生成 SQL 时,我检查了“脚本相关对象”。 我在本地计算机上运行它,一切似乎都运行良好。 我在客户端的服务器上运行它并且脚本成功了。

然后我加载了该网站,但该网站是空的。 令我恐惧的是,“脚本相关对象”设置为我的存储过程触及的每个表执行了一个DROP TABLE

我立即打电话给首席开发人员和老板,让他们知道发生了什么事,并询问数据库的最新备份位于哪里。 另外 2 名开发人员参加了会议,我们得出的结论是,甚至没有备份系统,也无法恢复数据。 客户丢失了整个网站的内容,而我就是根本原因。 结果是我们的客户获得了$5000的信用。

对我来说这是一个很好的教训,现在我对运行任何更改脚本并首先备份数据库都非常谨慎。 今天我仍然在同一家公司工作,每当有关备份或数据库脚本的笑话出现时,总会有人提起著名的“DROP TABLE”事件。

About 7 years ago, I was generating a change script for a client's DB after working late. I had only changed stored procedures but when I generated the SQL I had "script dependent objects" checked. I ran it on my local machine and all appeared to work well. I ran it on the client's server and the script succeeded.

Then I loaded the web site and the site was empty. To my horror, the "script dependent objects" setting did a DROP TABLE for every table that my stored procedures touched.

I immediately called the lead dev and boss letting them know what happened and asking where the latest backup of the DB could be located. 2 other devs were conferenced in and the conclusion we came to was that no backup system was even in place and no data could be restored. The client lost their entire website's content and I was the root cause. The result was a $5000 credit given to our client.

For me it was a great lesson, and now I am super-cautious about running any change scripts, and backing up DBs first. I'm still with the same company today, and whenever the jokes come up about backups or database scripts someone always brings up the famous "DROP TABLE" incident.

a√萤火虫的光℡ 2024-07-11 05:22:20

我在一家小型电子商务公司工作,有 2 名开发人员和一名 DBA,我是开发人员之一。 我通常不习惯动态更新生产数据,如果我们更改了存储过程,我们会将它们通过源代码控制并进行正式的部署例程设置。

无论如何,一位用户来找我,需要对我们的联系人数据库进行更新,批量更新一堆设施。 所以我在我们的测试环境中写出了查询,类似

update facilities set address1 = '123 Fake Street'
    where facilityid in (1, 2, 3)

这样的东西。 在测试中运行它,更新了 3 行。 将其复制到剪贴板,将其粘贴到我们的生产 SQL 框上的终端服务中,运行它,惊恐地看着它花了 5 秒钟执行并更新了 100000 行。 不知何故,我复制了第一行而不是第二行,并且在我 CTRL + VCTRL + E 时没有注意'd。

我的 DBA 是一位年长的希腊绅士,他可能是我见过的最脾气暴躁的人,但他并不感到兴奋。 幸运的是,我们有一个备份,并且它没有破坏任何页面,幸运的是该字段仅用于显示目的(以及计费/运输)。

吸取的教训是要注意您复制和粘贴的内容,可能还有其他一些内容。

I work for a small e-commerce company, there's 2 developers and a DBA, me being one of the developers. I'm normally not in the habit of updating production data on the fly, if we have stored procedures we've changed we put them through source control and have an officially deployment routine setup.

Well anyways a user came to me needing an update done to our contact database, batch updating a bunch of facilities. So I wrote out the query in our test environment, something like

update facilities set address1 = '123 Fake Street'
    where facilityid in (1, 2, 3)

Something like that. Ran it in test, 3 rows updated. Copied it to clipboard, pasted it in terminal services on our production sql box, ran it, watched in horror as it took 5 seconds to execute and updated 100000 rows. Somehow I copied the first line and not the second, and wasn't paying attention as I CTRL + V, CTRL + E'd.

My DBA, an older Greek gentleman, probably the grumpiest person I've met was not thrilled. Luckily we had a backup, and it didn't break any pages, luckily that field is only really for display purposes (and billing/shipping).

Lesson learned was pay attention to what you're copying and pasting, probably some others too.

飘过的浮云 2024-07-11 05:22:19

我认为我最严重的错误是

truncate table Customers
truncate table Transactions

我没有看到我登录的 MSSQL 服务器,我想清除我的本地副本...熟悉的“OH s**t”,当它需要比大约半秒更长的时间来删除时,我的老板注意到我脸色明显变白,并问我刚刚做了什么。 大约半分钟后,我们的网站监视器发疯了,开始给我们发电子邮件说网站已关闭。

学过的知识? 永远不要保持与活动数据库的连接打开时间超过绝对需要的时间。

直到凌晨 4 点才从备份中恢复数据! 老板心疼我,还请我吃饭……

I think my worst mistake was

truncate table Customers
truncate table Transactions

I didnt see what MSSQL server I was logged into, I wanted to clear my local copy out...The familiar "OH s**t" when it was taking significantly longer than about half a second to delete, my boss noticed I went visibily white, and asked what I just did. About half a mintue later, our site monitor went nuts and started emailing us saying the site was down.

Lesson learned? Never keep a connection open to live DB longer than absolutly needed.

Was only up till 4am restoring the data from the backups too! My boss felt sorry for me, and bought me dinner...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文