如何在测试过程中模拟故障磁盘?

发布于 2024-08-03 23:23:48 字数 195 浏览 9 评论 0原文

在 Linux VM(Vmware 工作站或类似设备)中,如何在之前工作的磁盘上模拟故障?

我在生产过程中遇到过光盘出现故障的情况(可能是控制器、电缆或固件问题)。显然,这是不可预测或不可重现的,我想测试我的监控以确保它正确发出警报。

理想情况下,我希望能够模拟写入失败但读取成功的情况,以及完全失败的情况,即 scsi 接口将错误报告回内核。

In a Linux VM (Vmware workstation or similar), how can I simulate a failure on a previously working disc?

I have a situation happening in production where a disc fails (probably a controller, cable or firmware problem). Obviously this is not predictable or reproducible, I want to test my monitoring to ensure that it alerts correctly.

I'd ideally like to be able to simulate a situation where it fails writes but succeeds reads, as well as a complete failure, i.e. the scsi interface reports errors back to the kernel.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

那请放手 2024-08-10 23:23:49

可以在多个层上模拟磁盘错误。如果您正在测试单个用户空间程序,最简单的方法可能是插入适当的调用(例如write())并让它们有时返回错误。 libfiu 错误注入库 可以使用其 fiu-run 工具。

另一种方法是使用内核驱动程序,该驱动程序可以将数据传递到另一个设备或从另一个设备传递数据,但会沿途注入错误。然后,您可以安装该设备并从任何应用程序使用它,就像它是有故障的磁盘一样。 fsdisk 驱动程序就是一个示例。

还有一个故障注入基础设施已合并到 Linux 内核中,尽管您可能需要重新配置内核才能启用它。它记录在 Documentation/fault-injection/fault-injection.txt 中。这对于测试内核代码很有用。

还可以使用 SystemTap 在内核级别注入错误。请参阅SCSI 故障注入测试使用 SystemTap 进行内核故障注入

There are several layers at which a disk error can be simulated. If you are testing a single user-space program, probably the simplest approach is to interpose the appropriate calls (e.g. write()) and have them sometimes return an error. The libfiu fault-injection library can do this using its fiu-run tool.

Another approach is to use a kernel driver that can pass through data to/from another device, but inject faults along the way. You can then mount the device and use it from any application as if it was a faulty disk. The fsdisk driver is an example of this.

There is also a fault injection infrastructure that has been merged in to the Linux kernel, although you will probably need to reconfigure your kernel to enable it. It is documented in Documentation/fault-injection/fault-injection.txt. This is useful for testing kernel code.

It is also possible to use SystemTap to inject faults at the kernel level. See The SCSI fault injection test and Kernel Fault injection using SystemTap.

居里长安 2024-08-10 23:23:49

要添加到 mark4o 的答案中,您还可以使用 Linux 的设备映射器来生成故障设备。

Device Mapper的延迟设备可用于发送读写同一块到不同底层设备的 I/O(正如其名称所示,它也可以延迟该 I/O)。设备映射器的错误设备可用于在访问特定块时生成永久性错误。通过将两者结合起来,您可以创建一个设备,对于给定区域,写入总是失败,但读取总是成功。

上面是问题 模拟有读取错误的故障块设备? 中所描述内容的更复杂示例(请参阅https://stackoverflow.com/a/1871029 获取简单的设备映射器示例)。

Linux 磁盘故障注入机制列表 stackexchange.com/q/77492/61610">导致 I/O 错误的特殊文件 Unix 和 Linux Linux问题。

To add to mark4o's answer, you can also use Linux's Device Mapper to generate failing devices.

Device Mapper's delay device can be used to send read and write I/O of the same block to different underlying devices (it can also delay that I/O as its name suggests). Device Mapper's error device can be used to generate permanent errors when a particular block is accessed. By combining the two you can create a device where writes always fail but reads always succeed for a given area.

The above is a more complicated example of what is described in the question Simulate a faulty block device with read errors? (see https://stackoverflow.com/a/1871029 for a simple Device Mapper example).

There is also a list of Linux disk fault injection mechanisms on the Special File that causes I/O error Unix & Linux question.

尤怨 2024-08-10 23:23:49

使 SCSI 磁盘在 2.6 内核中消失的一个简单方法是:

echo 1 > /sys/bus/scsi/devices/H:B:T:L/delete

(H:B:T:L 是主机、总线、目标、LUN)。不过,要模拟只读情况,您必须使用 mark4o 提到的错误注入方法。

A simple way to make a SCSI disk disappear with a 2.6 kernel is:

echo 1 > /sys/bus/scsi/devices/H:B:T:L/delete

(H:B:T:L is host, bus, target, LUN). To simulate the read-only case you'll have to use the fault injection methods that mark4o mentioned, though.

辞取 2024-08-10 23:23:49

Linux 内核提供了一个名为“故障注入”的好功能,

echo 1 > /sys/block/vdd/vdd2/make-it-fail

要设置一些选项:

mkdir /debug
mount debugfs /debug -t debugfs
cd /debug/fail_make_request
echo 10 > interval # interval
echo 100 > probability # 100% probability
echo -1 > times # how many times: -1 means no limit

https://lxadm.com/Using_fault_injection< /a>

Linux kernel provides a nice feature called “fault injection”

echo 1 > /sys/block/vdd/vdd2/make-it-fail

To setup some of the options:

mkdir /debug
mount debugfs /debug -t debugfs
cd /debug/fail_make_request
echo 10 > interval # interval
echo 100 > probability # 100% probability
echo -1 > times # how many times: -1 means no limit

https://lxadm.com/Using_fault_injection

你的呼吸 2024-08-10 23:23:49

您可以使用scsi_debug内核模块来模拟RAM磁盘,它通过optsevery_nth选项支持所有SCSI错误。

请检查此http://sg.danny.cz/sg/sdebug26.html

扇区 4656 上的中等错误示例:

[fge@Gris-Laptop ~]$ sudo modprobe scsi_debug opts=2 every_nth=1
[fge@Gris-Laptop ~]$ sudo dd if=/dev/sdb of=/dev/null
dd: error reading ‘/dev/sdb’: Input/output error
4656+0 records in
4656+0 records out
2383872 bytes (2.4 MB) copied, 0.021299 s, 112 MB/s
[fge@Gris-Laptop ~]$ dmesg|tail
[11201.454332] blk_update_request: critical medium error, dev sdb, sector 4656
[11201.456292] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11201.456299] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] 
[11201.456303] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
[11201.456308] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 00 00 12 30 00 00 08 00
[11201.456312] blk_update_request: critical medium error, dev sdb, sector 4656

您可以通过 sysfs 在运行时更改 optsevery_nth 选项:

echo 2 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
echo 1 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts

You may use scsi_debug kernel module to simulate a RAM disk and it supports all the SCSI errors with opts and every_nth options.

Please check this http://sg.danny.cz/sg/sdebug26.html

Example on medium error on sector 4656:

[fge@Gris-Laptop ~]$ sudo modprobe scsi_debug opts=2 every_nth=1
[fge@Gris-Laptop ~]$ sudo dd if=/dev/sdb of=/dev/null
dd: error reading ‘/dev/sdb’: Input/output error
4656+0 records in
4656+0 records out
2383872 bytes (2.4 MB) copied, 0.021299 s, 112 MB/s
[fge@Gris-Laptop ~]$ dmesg|tail
[11201.454332] blk_update_request: critical medium error, dev sdb, sector 4656
[11201.456292] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11201.456299] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] 
[11201.456303] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
[11201.456308] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 00 00 12 30 00 00 08 00
[11201.456312] blk_update_request: critical medium error, dev sdb, sector 4656

You could alter the opts and every_nth options in runtime via sysfs:

echo 2 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
echo 1 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
星軌x 2024-08-10 23:23:49

还可以使用磁盘提供的方法来进行介质错误测试。 SCSI 有一个 WRITE LONG 命令,可用于通过写入具有无效 ECC 的数据来损坏块。 SATA和NVMe也有类似的命令。

对于最常见的情况 (SATA),您可以使用 hdparm 和 --make-bad-sector 来使用该命令,对于 SCSI,您可以使用 sg_write_long,对于 NVMe,您可以使用带有 write-uncor 选项的 nvme-cli。

与其他注入方法相比,这些命令的一大优点是它们的行为也像驱动器一样,具有完全延迟影响,并且还可以通过重新分配写入该扇区进行恢复。这还包括驱动器中上升的错误计数器。

缺点是,如果您对同一驱动器执行此操作过多,其错误计数器将会上升,并且 SMART 可能会将磁盘标记为坏磁盘,或者您可能会耗尽其重新分配表。因此,请务必将其用于手动测试,但如果您在自动化测试中运行它,请不要经常这样做。

One can also use methods that are provided by the disks to do media error testing. SCSI has a WRITE LONG command that can be used to corrupt a block by writing data with invalid ECC. SATA and NVMe also have similar commands.

For the most common case (SATA) you can use hdparm with --make-bad-sector to employ that command, you can use sg_write_long for SCSI and for NVMe you can use the nvme-cli with the write-uncor option.

The big advantage that these commands have over other injection methods is that they also behave just like a drive does, with full latency impacts and also the recovery upon a write to that sector by reallocation. This includes also error counters going up in the drive.

The disadvantage is that if you do this too much for the same drive its error counters will go up and SMART may flag the disk as bad or you may exhaust its reallocation tables. So do use it for manual testing but if you are running it on automated testing don't do it too often.

温折酒 2024-08-10 23:23:49

您还可以使用低级 SCSI 实用程序 (sg3-utils) 来停止驱动器。它仍然会响应 Inquiry,因此它的状态仍然是“正在运行”,但读取和写入将失败,直到再次启动。我已经通过这种方式使用 mdadm 测试了 RAID 驱动器的删除和恢复。

sg_start --stop /dev/sdb

You can also use a low-level SCSI utility (sg3-utils) to stop the drive. It will still respond to Inquiry, so its state will still be "running" but reads and writes will fail until it is started again. I've tested RAID drive removal and recovery using mdadm this way.

sg_start --stop /dev/sdb
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文