Generally, Lustre should start on its own using heartbeat. It is chkconfig'd on on all the lustre nodes, and when the nodes reboot, they should negotiate between them and start all the services.
It is advised to start Lustre using only
service heartbeat start
and just wait. The action of mounting the partitions turns on the lustre-related modules. Nothing further is needed.
ALWAYS REMEMBER -
be patient! It takes about 3 times longer than you want it to!
*
LustreDiskChangeProcedures
*
LustreFreezeUpProcedures
*OST Layout
| oss11 | oss21 | oss31 | oss41 |
md11 | OST0000
| OST0002 | OST0004 | OST0006 |
md13 | OST0008
| OST000a
| OST000c
| OST000e
|
md15 | OST0010 | OST0012 | OST0014
| OST0016
|
md17 | OST0018 | OST001a | OST001c
| OST001e
|
| | | | |
| oss12 | oss22 | oss32 | oss42 |
md10 | OST0001
| OST0003
| OST0005 | OST0007
|
md12 | OST0009 | OST000b
| OST000d
| OST000f
|
md14 | OST0011 | OST0013 | OST0015
| OST0017 |
md16 | OST0019 | OST001b
| OST001d
| OST001f
|
--
JasonTemple - 2010-04-13
Lustre FSCK procedure:
3 Steps:
1. e2fsck on the mds
- unmount lustre everywhere, including on the lustre servers:
- phoenix lustre service heartbeat start
- start up the raids on the mds:
- mdadm --assemble -c /etc/mdadm.conf.local /dev/md10
- next, mount gpfs so you have a workspace
- phoenix lustre "/usr/lpp/mmfs/bin/mmstartup;sleep 2;/usr/lpp/mmfs/bin/mmmount scratch"
- now, run the e2fsck:
- e2fsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb /dev/md10
- this will output a metadata file which you use in the next steps
2. e2fsck on the oss machines
- first, start up the raids using this script (/gpfs/jason/start_raid.sh):
#!/bin/bash
OSSNAME=`uname -n | cut -d. -f1`
case ${OSSNAME} in
oss?1) OSTs="1 3 5 7" ;;
oss?2) OSTs="0 2 4 6" ;;
*) echo "Wrong node, exiting!"; exit 1 ;;
esac
for i in $OSTs; do
mdadm -A -c /etc/mdadm.conf.local /dev/md3$i 2>&1 | /usr/bin/logger -t "initlustre"
mount /lustre/scratch/bmp1$i && /usr/bin/logger -t "initlustre" mounted bitmap device /dev/md3$i
mdadm -A -c /etc/mdadm.conf.local /dev/md2$i 2>&1 | /usr/bin/logger -t "initlustre"
mdadm -A -c /etc/mdadm.conf.local /dev/md1$i 2>&1 | /usr/bin/logger -t "initlustre"
done
- And you run it like this:
- dsh -w oss[11-12,21-22,31-32,41-42] /gpfs/jason/start_raid.sh
- Then, you run the e2fsck on the servers, like this:
- dsh -w oss[11-12,21-22,31-32,41-42] sh /gpfs/jason/lustre_e2fsck.sh
- using this script:
#!/bin/bash
OSSNAME=`uname -n | cut -d. -f1`
case ${OSSNAME} in
oss?1) OSTs="1 3 5 7" ;;
oss?2) OSTs="0 2 4 6" ;;
*) echo "Wrong node, exiting!"; exit 1 ;;
esac
typeset LOG="/gpfs/lustre_fsck/$OSSNAME.out"
-t 1 && echo "Writing to logfile '$LOG'."
exec > $LOG 2>&1
exec < /dev/null 2<&1
for i in $OSTs; do
e2fsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb --ostdb /gpfs/lustre_fsck/${OSSNAME}.ostdb.${i} /dev/md1${i}
done
which generates a logfile for each oss, as well as ostdb files for each raid:
Mar 16 14:30 [root@oss11:~]# ls /gpfs/lustre_fsck/
mdsdb oss11.ostdb.3 oss11.out oss12.ostdb.4 oss21.ostdb.1 oss21.ostdb.7 oss22.ostdb.2 oss22.out oss31.ostdb.5 oss32.ostdb.0 oss32.ostdb.6 oss41.ostdb.3 oss41.out oss42.ostdb.4
mdsdb.mdshdr oss11.ostdb.5 oss12.ostdb.0 oss12.ostdb.6 oss21.ostdb.3 oss21.out oss22.ostdb.4 oss31.ostdb.1 oss31.ostdb.7 oss32.ostdb.2 oss32.out oss41.ostdb.5 oss42.ostdb.0 oss42.ostdb.6
oss11.ostdb.1 oss11.ostdb.7 oss12.ostdb.2 oss12.out oss21.ostdb.5 oss22.ostdb.0 oss22.ostdb.6 oss31.ostdb.3 oss31.out oss32.ostdb.4 oss41.ostdb.1 oss41.ostdb.7 oss42.ostdb.2 oss42.out
3. run lfsck from a client:
- stop the raids on all the servers
- ssh mds1 mdadm --stop /dev/md10
- dsh -w oss[11-12,21-22,31-32,41-42] /gpfs/jason/stop_raid.sh
- start lustre on all the servers
- dsh -w oss[11-12,21-22,31-32,41-42] service heartbeat start
- dsh -w mds[1,2] service heartbeat start
- Make sure gpfs and lustre are running on a client and that e2fsprogs is installed, then start the lfsck (I run it from a script with typeset running /gpfs/jason/lfsck.sh):
- lfsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb --ostdb /gpfs/lustre_fsck/oss11.ostdb.1 /gpfs/lustre_fsck/oss11.ostdb.3 /gpfs/lustre_fsck/oss11.ostdb.5 /gpfs/lustre_fsck/oss11.ostdb.7 /gpfs/lustre_fsck/oss12.ostdb.0 /gpfs/lustre_fsck/oss12.ostdb.2 /gpfs/lustre_fsck/oss12.ostdb.4 /gpfs/lustre_fsck/oss12.ostdb.6 /gpfs/lustre_fsck/oss21.ostdb.1 /gpfs/lustre_fsck/oss21.ostdb.3 /gpfs/lustre_fsck/oss21.ostdb.5 /gpfs/lustre_fsck/oss21.ostdb.7 /gpfs/lustre_fsck/oss22.ostdb.0 /gpfs/lustre_fsck/oss22.ostdb.2 /gpfs/lustre_fsck/oss22.ostdb.4 /gpfs/lustre_fsck/oss22.ostdb.6 /gpfs/lustre_fsck/oss31.ostdb.1 /gpfs/lustre_fsck/oss31.ostdb.3 /gpfs/lustre_fsck/oss31.ostdb.5 /gpfs/lustre_fsck/oss31.ostdb.7 /gpfs/lustre_fsck/oss32.ostdb.0 /gpfs/lustre_fsck/oss32.ostdb.2 /gpfs/lustre_fsck/oss32.ostdb.4 /gpfs/lustre_fsck/oss32.ostdb.6 /gpfs/lustre_fsck/oss41.ostdb.1 /gpfs/lustre_fsck/oss41.ostdb.3 /gpfs/lustre_fsck/oss41.ostdb.5 /gpfs/lustre_fsck/oss41.ostdb.7 /gpfs/lustre_fsck/oss42.ostdb.0 /gpfs/lustre_fsck/oss42.ostdb.2 /gpfs/lustre_fsck/oss42.ostdb.4 /gpfs/lustre_fsck/oss42.ostdb.6 /lustre/scratch/
Here is the documentation for the Lustre FS
CSCS_Lustre_Runbook_v0.8.pdf: CSCS_Lustre_Runbook_v0.8.pdf
Common Problems / Troubleshooting
1. Worker node goes nuts
- System load increases dramatically (>100) and stays there
- System becomes unresponsive
You can check on xen02:/var/log/phoenix/
hostname/messages, and you should see something like:
<blockquote>
May 29 04:02:14 wn161 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_sa_31265:17598]
May 29 04:02:14 wn161 kernel: CPU 1:
May 29 04:02:14 wn161 kernel: Modules linked in: loop mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) lockd sunrpc ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) bonding ipv6 xfrm_nalgo crypto_api ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables x_tables dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801 i2c_core shpchp e1000e pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
May 29 04:02:14 wn161 kernel: Pid: 17598, comm: ll_sa_31265 Tainted: G 2.6.18-194.32.1.el5 #1
May 29 04:02:14 wn161 kernel: RIP: 0010:[<ffffffff80064bfc>] [<ffffffff80064bfc>] .text.lock.spinlock+0x2/0x30
May 29 04:02:14 wn161 kernel: RSP: 0018:ffff8101b9c47cf8 EFLAGS: 00000286
May 29 04:02:14 wn161 kernel: RAX: 0000000000000001 RBX: ffff81013c528c80 RCX: 0000000000000000
May 29 04:02:14 wn161 kernel: RDX: 0000000000000022 RSI: 0000000004024cc0 RDI: ffff81039c6d58c0
May 29 04:02:14 wn161 kernel: RBP: 0000000000000282 R08: ffffc20012a3a000 R09: 0000000000000000
May 29 04:02:14 wn161 kernel: R10: ffff8101253a8c00 R11: 0000000000000248 R12: ffff810232b40200
May 29 04:02:14 wn161 kernel: R13: ffffffff887da24a R14: ffff81020f2d8a80 R15: 0000000000000000
May 29 04:02:14 wn161 kernel: FS: 0000000000000000(0000) GS:ffff81010c499440(0000) knlGS:0000000000000000
May 29 04:02:14 wn161 kernel: CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
May 29 04:02:14 wn161 kernel: CR2: 00000000f7f4e000 CR3: 0000000000201000 CR4: 00000000000006e0
This means there is a problem with the node that most likely can only be cured by a reset:
once the node has come back up, make sure lustre is mounted, then start grid-services:
- ssh wn161
- service lustre start
- start grid-service restart
- service gmond restart
That should get everything going again.
2. Lustre itself has gone nuts
The best way is to check with lfs df:
If all is good, it will return this:
May 30 16:49 [root@xen11:~]# ssh wn161 lfs df
UUID 1K-blocks Used Available Use% Mounted on
scratch-MDT0000_UUID 1537958380 3162932 1446904876 0% /lustre/scratch[MDT:0]
scratch-OST0000_UUID 3845783560 219166780 3431255004 5% /lustre/scratch[OST:0]
scratch-OST0001_UUID 3845783560 227456072 3422966860 5% /lustre/scratch[OST:1]
scratch-OST0002_UUID 3845783560 221470296 3428956596 5% /lustre/scratch[OST:2]
scratch-OST0003_UUID 3845783560 222771872 3427653524 5% /lustre/scratch[OST:3]
scratch-OST0004_UUID 3845783560 217679228 3432744644 5% /lustre/scratch[OST:4]
scratch-OST0005_UUID 3845783560 226387856 3424037128 5% /lustre/scratch[OST:5]
scratch-OST0006_UUID 3845783560 212193564 3438228380 5% /lustre/scratch[OST:6]
scratch-OST0007_UUID 3845783560 216765892 3433663092 5% /lustre/scratch[OST:7]
scratch-OST0008_UUID 3845783560 220501124 3429923164 5% /lustre/scratch[OST:8]
scratch-OST0009_UUID 3845783560 222424724 3427998656 5% /lustre/scratch[OST:9]
scratch-OST000a_UUID 3845783560 221762460 3428662804 5% /lustre/scratch[OST:10]
scratch-OST000b_UUID 3845783560 220137692 3430291280 5% /lustre/scratch[OST:11]
scratch-OST000c_UUID 3845783560 215120080 3435303932 5% /lustre/scratch[OST:12]
scratch-OST000d_UUID 3845783560 212610320 3437818128 5% /lustre/scratch[OST:13]
scratch-OST000e_UUID 3845783560 218839300 3431585980 5% /lustre/scratch[OST:14]
scratch-OST000f_UUID 3845783560 215957740 3434471216 5% /lustre/scratch[OST:15]
scratch-OST0010_UUID 3845783560 227552128 3422870716 5% /lustre/scratch[OST:16]
scratch-OST0011_UUID 3845783560 228407164 3422015724 5% /lustre/scratch[OST:17]
scratch-OST0012_UUID 3845783560 213511964 3436911568 5% /lustre/scratch[OST:18]
scratch-OST0013_UUID 3845783560 223095692 3427327972 5% /lustre/scratch[OST:19]
scratch-OST0014_UUID 3845783560 217036892 3433386504 5% /lustre/scratch[OST:20]
scratch-OST0015_UUID 3845783560 219465120 3430956096 5% /lustre/scratch[OST:21]
scratch-OST0016_UUID 3845783560 220117284 3430305440 5% /lustre/scratch[OST:22]
scratch-OST0017_UUID 3845783560 218675392 3431746676 5% /lustre/scratch[OST:23]
scratch-OST0018_UUID 3845783560 237427616 3412995440 6% /lustre/scratch[OST:24]
scratch-OST0019_UUID 3845783560 223293232 3427129084 5% /lustre/scratch[OST:25]
scratch-OST001a_UUID 3845783560 215128968 3435296652 5% /lustre/scratch[OST:26]
scratch-OST001b_UUID 3845783560 238067324 3412361636 6% /lustre/scratch[OST:27]
scratch-OST001c_UUID 3845783560 215975200 3434453780 5% /lustre/scratch[OST:28]
scratch-OST001d_UUID 3845783560 219035464 3431387460 5% /lustre/scratch[OST:29]
scratch-OST001e_UUID 3845783560 225213428 3425210188 5% /lustre/scratch[OST:30]
scratch-OST001f_UUID 3845783560 216786528 3433638072 5% /lustre/scratch[OST:31]
filesystem summary: 123065073920 7070034396 109743553396 5% /lustre/scratch
Otherwise, you will notice that probably 4 of the mounts will be missing. In that case, find out which OSS caused the problem:
May 30 16:49 [root@xen11:~]# phoenix lustre "mount |grep lustre"
----------------
mds1
----------------
/dev/md10 on /lustre/scratch/mdt0 type lustre (rw)
----------------
oss[11,21,31,41]
----------------
/dev/md31 on /lustre/scratch/bmp11 type ext3 (rw)
/dev/md11 on /lustre/scratch/ost11 type lustre (rw)
/dev/md33 on /lustre/scratch/bmp13 type ext3 (rw)
/dev/md13 on /lustre/scratch/ost13 type lustre (rw)
/dev/md35 on /lustre/scratch/bmp15 type ext3 (rw)
/dev/md15 on /lustre/scratch/ost15 type lustre (rw)
/dev/md37 on /lustre/scratch/bmp17 type ext3 (rw)
/dev/md17 on /lustre/scratch/ost17 type lustre (rw)
----------------
oss[12,22,32,42]
----------------
/dev/md30 on /lustre/scratch/bmp10 type ext3 (rw)
/dev/md10 on /lustre/scratch/ost10 type lustre (rw)
/dev/md32 on /lustre/scratch/bmp12 type ext3 (rw)
/dev/md12 on /lustre/scratch/ost12 type lustre (rw)
/dev/md34 on /lustre/scratch/bmp14 type ext3 (rw)
/dev/md14 on /lustre/scratch/ost14 type lustre (rw)
/dev/md36 on /lustre/scratch/bmp16 type ext3 (rw)
/dev/md16 on /lustre/scratch/ost16 type lustre (rw)
This is the ideal situation, and everything is mounted just fine. If something is awry, like an oss or mds is down, then you go to the corresponding server (if oss11 is down, go to oss12) and issue the following command:
- /usr/lib64/heartbeat/hb_takeover foreign
and that should assemble the raids and mount everything for you - takes about 5 minutes. (SEE *
LustreFreezeUpProcedures )
2a. Another issue - frozen IB card
In a perfect world, this will be the reponse:
May 30 16:55 [root@xen11:~]# phoenix lustre "ifconfig ib0|grep Bcast"
----------------
mds1
----------------
inet addr:148.187.70.34 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
mds2
----------------
inet addr:148.187.70.35 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss11
----------------
inet addr:148.187.70.3 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss12
----------------
inet addr:148.187.70.4 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss21
----------------
inet addr:148.187.70.9 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss22
----------------
inet addr:148.187.70.10 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss31
----------------
inet addr:148.187.70.15 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss32
----------------
inet addr:148.187.70.16 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss41
----------------
inet addr:148.187.70.21 Bcast:148.187.71.255 Mask:255.255.252.0
----------------
oss42
----------------
inet addr:148.187.70.22 Bcast:148.187.71.255 Mask:255.255.252.0
Otherwise, if an OSS is missing its IB device, you ireset the oss, then once the oss has rebooted, you balance out the mount points with hb_takeover local.
2b. How to Disable a Bad OST
When an OST (a raid on an OSS) is unavailable, it causes the clients to freeze on disk access, such as df commands, ls, etc.
To mitigate this, you need to disable the OST so that they will ignore it, and jobs can continue to run.