lustre 1.4.6 on redhat as4 U2 问题[以解决]
问题以解决:可能还是lustre 版本的问题,我用lustre.1.4.6.2就没有问题了
我在vmware gsx上作lustre的试验,系统是redhat as4 U2 ,内核是2.6.9-22.ELsmp,lustre安装包如下:
kernel-smp-2.6.9-22.0.2.EL_lustre.1.4.6
lustre-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
lustre-debuginfo-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
lustre-modules-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
结构为1个ost,1个mds,1个client,每台机器的hosts内容均一样,hosts内容如下:
127.0.0.1 localhost.localdomain localhost
192.168.0.162 n01
192.168.0.164 n03
192.168.0.165 n04
注:n01(ost),n03(mds),n04(client)在ost和mds机器上分别添加了一块硬盘(sdb1),容量1G,作为lustre分区,显示如下:
[root@n01 ~]# fdisk -l
Disk /dev/sda: 4294 MB, 4294967296 bytes
255 heads, 63 sectors/track, 522 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 16 128488+ 83 Linux
/dev/sda2 17 143 1020127+ 82 Linux swap
/dev/sda3 144 522 3044317+ 83 Linux
Disk /dev/sdb: 4294 MB, 4294967296 bytes
255 heads, 63 sectors/track, 522 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 1 123 987966 83 Linux
每台机器的modprobe.conf内容如下:
[root@n01 ~]# cat /etc/modprobe.conf
alias eth0 pcnet32
alias scsi_hostadapter mptbase
alias scsi_hostadapter1 mptscsih
install kptlrouter modprobe portals ; modprobe --ignore-install kptlrouter
#install ptlrpc modprobe ksocknal ; modprobe --ignore-install ptlrpc
install llite modprobe lov osc ; modprobe --ignore-install llite
alias lustre llite
options lnet networks=tcp0
lustre脚本内容如下:
[root@n01 ~]# cat newconfig.sh
#!/bin/sh
#config.sh
#Create nodes
rm -f newconfig.xml
lmc -m newconfig.xml --add net --node n03 --nid n03 --nettype lnet
lmc -m newconfig.xml --add net --node n01 --nid n01 --nettype lnet
lmc -m newconfig.xml --add net --node generic-client --nid '*' --nettype lnet
#Configure mds
lmc -m newconfig.xml --add mds --node n03 --mds n03-mds1 --fstype ldiskfs --dev /dev/sdb1 --journal_size 400
#Configure ost
lmc -m newconfig.xml --add lov --lov lov1 --mds n03-mds1 --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m newconfig.xml --add ost --node n01 --lov lov1 --ost n01-ost1 --fstype ldiskfs --dev /dev/sdb1
#Configure client
lmc -m newconfig.xml --add mtpt --node generic-client --path /mnt/lustre --mds n03-mds1 --lov lov1
用sh newconfig.sh生成newconfig.xml文件并分发到n03和n04上,在ost上执行lconf --reformat --node n01 newconfig.xml命令启动ost成功,没有错误
在mds上执行lconf --reformat --node n03 newconfig.xml启动mds,出现kernel: <0>Fatal exception: panic in 5 seconds错误,死机,此时ost也死机,重起后检查log显示如下内容:
- Apr 14 13:30:14 n03 sshd(pam_unix)[2681]: session opened for user root by (uid=0)
- Apr 14 13:30:50 n03 kernel: Lustre: 2701:0:(module.c:381:init_libcfs_module()) maximum lustre stack 8192
- Apr 14 13:30:52 n03 kernel: Lustre: OBD class driver Build Version: 1.4.6-19691231190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BUILD.lustre-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp, [email]info@clusterfs.com[/email]
- Apr 14 13:30:53 n03 kernel: Lustre: Added LNI 192.168.0.164@tcp [8/256]
- Apr 14 13:30:54 n03 kernel: Lustre: Accept secure, port 988
- Apr 14 13:31:02 n03 kernel: kjournald starting. Commit interval 5 seconds
- Apr 14 13:31:02 n03 kernel: LDISKFS FS on sdb1, internal journal
- Apr 14 13:31:02 n03 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
- Apr 14 13:31:03 n03 kernel: Lustre: 2762:0:(mds_fs.c:239:mds_init_server_data()) n03-mds1: initializing new last_rcvd
- Apr 14 13:31:03 n03 kernel: Lustre: MDT n03-mds1 now serving /dev/sdb1 (c4d604e8-2506-42cc-97c3-c0c436f1440e) with recovery enabled
- Apr 14 13:31:16 n03 kernel: Lustre: MDT n03-mds1 has stopped.
- Apr 14 13:31:23 n03 kernel: loop: loaded (max 8 devices)
- Apr 14 13:31:23 n03 hald[2237]: Timed out waiting for hotplug event 269. Rebasing to 273
- Apr 14 13:31:29 n03 kernel: kjournald starting. Commit interval 5 seconds
- Apr 14 13:31:29 n03 kernel: LDISKFS FS on sdb1, internal journal
- Apr 14 13:31:29 n03 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
- Apr 14 13:31:30 n03 kernel: eip: c8a1e7aa
- Apr 14 13:31:30 n03 kernel: ------------[ cut here ]------------
- Apr 14 13:31:30 n03 kernel: kernel BUG at include/asm/spinlock.h:146!
- Apr 14 13:31:30 n03 kernel: invalid operand: 0000 [#1]
- Apr 14 13:31:30 n03 kernel: SMP
- Apr 14 13:31:30 n03 kernel: Modules linked in: loop(U) fsfilt_ldiskfs(U) ldiskfs(U) mds(U) lov(U) osc(U) mdc(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) md5(U) ipv6(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) iptable_filter(U) ip_tables(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) shpchp(U) pcnet32(U) mii(U) floppy(U) ext3(U) jbd(U) mptscsih(U) mptbase(U) sd_mod(U) scsi_mod(U)
- Apr 14 13:31:30 n03 kernel: CPU: 0
- Apr 14 13:31:30 n03 kernel: EIP: 0060:[<c02d1a41>] Not tainted VLI
- Apr 14 13:31:30 n03 kernel: EFLAGS: 00010016 (2.6.9-22.0.2.EL_lustre.1.4.6smp)
- Apr 14 13:31:30 n03 kernel: EIP is at _spin_lock_irqsave+0x20/0x45
- Apr 14 13:31:30 n03 kernel: eax: c8a1e7aa ebx: 00000002 ecx: c02e8bea edx: c02e8bea
- Apr 14 13:31:30 n03 kernel: esi: c5c44960 edi: c0a800a4 ebp: c7a1a000 esp: c415be98
- Apr 14 13:31:30 n03 kernel: ds: 007b es: 007b ss: 0068
- Apr 14 13:31:30 n03 kernel: Process socknal_cd00 (pid: 2736, threadinfo=c415a000 task=c4d4ebb0)
- Apr 14 13:31:30 n03 kernel: Stack: c6d57f40 c5c44960 c8a1e7aa c7edb8a8 c7a1a000 c0a800a4 00000001 c8a17dab
- Apr 14 13:31:30 n03 kernel: 0000bc84 00000000 c0a800a4 c0a800a2 00000000 00000000 000000b1 00000246
- Apr 14 13:31:30 n03 kernel: c5422100 c6d57f00 c5bb1c80 c7edb880 4a9b692d 0004115d c415bef0 c415bef0
- Apr 14 13:31:30 n03 kernel: Call Trace:
- Apr 14 13:31:30 n03 kernel: [<c8a1e7aa>] ksocknal_queue_tx_locked+0x11e/0x1f7 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a17dab>] ksocknal_create_conn+0xd7c/0x1454 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8dab11c>] lnet_connect+0x277/0x2c7 [lnet]
- Apr 14 13:31:30 n03 kernel: [<c8a232e3>] ksocknal_connect+0xd8/0x254 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a23638>] ksocknal_connd+0x1d9/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c011eb7c>] autoremove_wake_function+0x0/0x2d
- Apr 14 13:31:30 n03 kernel: [<c011eb7c>] autoremove_wake_function+0x0/0x2d
- Apr 14 13:31:30 n03 kernel: [<c02d2d5a>] ret_from_fork+0x6/0x14
- Apr 14 13:31:30 n03 kernel: [<c8a2345f>] ksocknal_connd+0x0/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a2345f>] ksocknal_connd+0x0/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb
- Apr 14 13:31:30 n03 kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 56 89 c6 53 9c 5b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 ea 8b 2e c0 e8 0f f5 e4 ff 59 58 <0f> 0b 92 00 a4 7c 2e c0 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 01
- Apr 14 13:31:30 n03 kernel: <0>Fatal exception: panic in 5 seconds
复制代码
那位大哥帮忙看看这是什么问题,小弟先谢了Sample Text
[ 本帖最后由 suran007 于 2006-4-25 17:48 编辑 ]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
先顶一下
请斑竹帮帮忙~
我在sles9上测试1.4.6版本没有问题,在as4 u2上测试过1.4.5也没有问题,不知道为什么1.4.6就不行了,但log,不知道为什么出现了kernel BUG字样,难道是内核有问题?
Hi suran007,
前一阵子我的测试设备上在跑其他工作,没有办法腾开帮你测,昨天弄完了后,今天早上我做了一个和你类似的环境来测试,没有发现有问题.
==========================================================================
环境配置:
Server1: AMD64 , host OS SLES9SP3+errata x86-64 version, vmware server beta(latest build)
vmware guestOS 1 : RHEL4U3 x86-64 version => mds
vmware guestOS 2 : RHEL4U3 x86-64 version => client
lustre的那些rpm包我装的是X86-64的版本
Server2: Intel dual core EM64T, host OS SLES9SP3+errara x86-64 version,vmware server beta(latest build)
vmware guestOS 1: RHEL4U3 x86 version => ost1
vmware guestOS 1: RHEL4U3 x86 version => ost1
lustre的那些rpm包我装的是X86的版本
============================================================================
config.sh 文件内容
#!/bin/sh
# config.sh
# Create nodes
rm -f config.xml
lmc -m config.xml --add net --node node-mds --nid n1 --nettype tcp
lmc -m config.xml --add net --node node-ost1 --nid n3 --nettype tcp
lmc -m config.xml --add net --node node-ost2 --nid n4 --nettype tcp
lmc -m config.xml --add net --node client --nid n2 --nettype tcp
# Cofigure MDS
lmc -m config.xml --add mds --node node-mds --mds mds-test --fstype ldiskfs --dev /tmp/mds-test --size 50000
# Configures OSTs
lmc -m config.xml --add lov --lov lov-test --mds mds-test --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m config.xml --add ost --node node-ost1 --lov lov-test --ost ost1-test --fstype ldiskfs --dev /tmp/ost1-test --size 100000
lmc -m config.xml --add ost --node node-ost2 --lov lov-test --ost ost2-test --fstype ldiskfs --dev /tmp/ost2-test --size 100000
# Configure client (this is a 'generic' client used for all client mounts)
lmc -m config.xml --add mtpt --node client --path /mnt/lustre --mds mds-test --lov lov-test
===============================================================================
所有4个node 的 /etc/hosts 文件内容
[root@n1 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost
192.168.0.31 n1
192.168.0.32 n2
192.168.0.33 n3
192.168.0.34 n4
===============================================================================
产生xml文件之后,分别scp到每个node的/root目录
===============================================================================
启动OST
root 登陆到ost1,运行 lconf --reformat --node node-ost1 config.xml
root 登陆到ost1,运行 lconf --reformat --node node-ost2 config.xml
root 登陆到mds,运行 lconf --reformat --node node-mds config.xml
root 登陆到client,运行 lconf --node client config.xml
这个时候,在client node 上,用root运行 df -hT 命令, 已经看到 client node的 /mnt/lustre 被mount上了,
===============================================================================
屏幕信息
启动OST1的时候的屏幕信息
[root@n3 ~]# lconf --reformat --node node-ost1 config.xml
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: ost srcdir None devdir ost
loading module: ldiskfs srcdir None devdir ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
loading module: obdfilter srcdir None devdir obdfilter
NETWORK: NET_node-ost1_tcp NET_node-ost1_tcp_UUID tcp n3
OSD: ost1-test ost1-test_UUID obdfilter /tmp/ost1-test 100000 ldiskfs no 0 0
OST mount options: errors=remount-ro
[root@n3 ~]#
启动OST2的屏幕信息
[root@n4 ~]# lconf --reformat --node node-ost2 config.xml
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: ost srcdir None devdir ost
loading module: ldiskfs srcdir None devdir ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
loading module: obdfilter srcdir None devdir obdfilter
NETWORK: NET_node-ost2_tcp NET_node-ost2_tcp_UUID tcp n4
OSD: ost2-test ost2-test_UUID obdfilter /tmp/ost2-test 100000 ldiskfs no 0 0
OST mount options: errors=remount-ro
[root@n4 ~]#
启动MDS的屏幕信息
[root@n1 ~]# lconf --reformat --node node-mds config.xml
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: mdc srcdir None devdir mdc
loading module: osc srcdir None devdir osc
loading module: lov srcdir None devdir lov
loading module: mds srcdir None devdir mds
loading module: ldiskfs srcdir None devdir ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
NETWORK: NET_node-mds_tcp NET_node-mds_tcp_UUID tcp n1
MDSDEV: mds-test mds-test_UUID /tmp/mds-test ldiskfs no
recording clients for filesystem: FS_fsname_UUID
Recording log mds-test on mds-test
LOV: lov_mds-test 4f3bf_lov_mds-test_f34d7ba738 mds-test_UUID 0 1048576 0 0 [u'ost1-test_UUID', u'ost2-test_UUID'] mds-test
OSC: OSC_n1_ost1-test_mds-test 4f3bf_lov_mds-test_f34d7ba738 ost1-test_UUID
OSC: OSC_n1_ost2-test_mds-test 4f3bf_lov_mds-test_f34d7ba738 ost2-test_UUID
End recording log mds-test on mds-test
MDSDEV: mds-test mds-test_UUID /tmp/mds-test ldiskfs 50000 no
MDS mount options: errors=remount-ro
[root@n1 ~]#
启动client 的屏幕信息
[root@n2 ~]# lconf --node client config.xml
loading module: libcfs srcdir None devdir libcfs
loading module: lnet srcdir None devdir lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
loading module: lvfs srcdir None devdir lvfs
loading module: obdclass srcdir None devdir obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
loading module: osc srcdir None devdir osc
loading module: lov srcdir None devdir lov
loading module: mdc srcdir None devdir mdc
loading module: llite srcdir None devdir llite
NETWORK: NET_client_tcp NET_client_tcp_UUID tcp n2
LOV: lov-test e0002_lov-test_a77190f32b mds-test_UUID 0 1048576 0 0 [u'ost1-test_UUID', u'ost2-test_UUID'] mds-test
OSC: OSC_n2_ost1-test_MNT_client e0002_lov-test_a77190f32b ost1-test_UUID
OSC: OSC_n2_ost2-test_MNT_client e0002_lov-test_a77190f32b ost2-test_UUID
MDC: MDC_n2_mds-test_MNT_client 96c98_MNT_client_1567f4dc95 mds-test_UUID
MTPT: MNT_client MNT_client_UUID /mnt/lustre mds-test_UUID lov-test_UUID
检查 client node 文件系统加载情况
[root@n2 ~]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
ext3 4.1G 3.0G 920M 77% /
/dev/sda1 ext3 99M 14M 80M 15% /boot
none tmpfs 187M 0 187M 0% /dev/shm
config lustre_lite 190M 8.5M 171M 5% /mnt/lustre
[root@n2 ~]#
因为时间关系,暂时还不能做更多的变化和测试。
希望对你有帮助.
先谢谢斑竹的帮助,等你有工夫的时候,希望斑竹能做一下oss或mds自动failover的测试,小弟感激不尽
你最好把你的问题同时抄送 lustre maillist.
你是使说我上面这个panic问题么?我已经发到maillist了,可能是我的英语书写的太差,没人理我,:(
"loading module: libcfs srcdir None devdir libcfs
Bad module options? Check dmesg.
! modprobe (error 1):
> FATAL: Module libcfs not found."
我的为什么老出这种报错信息