LustreInformation < LCGTier2

Tags: view all tags

Generally, Lustre should start on its own using heartbeat. It is chkconfig'd on on all the lustre nodes, and when the nodes reboot, they should negotiate between them and start all the services.

It is advised to start Lustre using only

service heartbeat start

and just wait. The action of mounting the partitions turns on the lustre-related modules. Nothing further is needed.

ALWAYS REMEMBER - be patient! It takes about 3 times longer than you want it to!

* LustreDiskChangeProcedures

* LustreFreezeUpProcedures

*OST Layout

	oss11	oss21	oss31	oss41
md11	OST0000	OST0002	OST0004	OST0006
md13	OST0008	OST000a	OST000c	OST000e
md15	OST0010	OST0012	OST0014	OST0016
md17	OST0018	OST001a	OST001c	OST001e

	oss12	oss22	oss32	oss42
md10	OST0001	OST0003	OST0005	OST0007
md12	OST0009	OST000b	OST000d	OST000f
md14	OST0011	OST0013	OST0015	OST0017
md16	OST0019	OST001b	OST001d	OST001f

-- JasonTemple - 2010-04-13

CSCS_Lustre_Runbook_v0.8.odt: This is the sun-provided Lustre information - how the fs was created, and how to generally use it...

Lustre FSCK procedure:

3 Steps:

1. e2fsck on the mds

unmount lustre everywhere, including on the lustre servers:
- phoenix lustre service heartbeat start
start up the raids on the mds:
- mdadm --assemble -c /etc/mdadm.conf.local /dev/md10
next, mount gpfs so you have a workspace
- phoenix lustre "/usr/lpp/mmfs/bin/mmstartup;sleep 2;/usr/lpp/mmfs/bin/mmmount scratch"
now, run the e2fsck:
- e2fsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb /dev/md10
this will output a metadata file which you use in the next steps

2. e2fsck on the oss machines

first, start up the raids using this script (/gpfs/jason/start_raid.sh):

#!/bin/bash
OSSNAME=`uname -n | cut -d. -f1`
case ${OSSNAME} in
oss?1) OSTs="1 3 5 7" ;;
oss?2) OSTs="0 2 4 6" ;;
*) echo "Wrong node, exiting!"; exit 1 ;;
esac
for i in $OSTs; do
mdadm -A -c /etc/mdadm.conf.local /dev/md3$i 2>&1 | /usr/bin/logger -t "initlustre"
mount /lustre/scratch/bmp1$i && /usr/bin/logger -t "initlustre" mounted bitmap device /dev/md3$i
mdadm -A -c /etc/mdadm.conf.local /dev/md2$i 2>&1 | /usr/bin/logger -t "initlustre"
mdadm -A -c /etc/mdadm.conf.local /dev/md1$i 2>&1 | /usr/bin/logger -t "initlustre"
done

And you run it like this:
- dsh -w oss[11-12,21-22,31-32,41-42] /gpfs/jason/start_raid.sh
Then, you run the e2fsck on the servers, like this:
dsh -w oss[11-12,21-22,31-32,41-42] sh /gpfs/jason/lustre_e2fsck.sh
using this script:

#!/bin/bash

OSSNAME=`uname -n | cut -d. -f1`
case ${OSSNAME} in
oss?1) OSTs="1 3 5 7" ;;
oss?2) OSTs="0 2 4 6" ;;
*) echo "Wrong node, exiting!"; exit 1 ;;
esac

typeset LOG="/gpfs/lustre_fsck/$OSSNAME.out"
-t 1 && echo "Writing to logfile '$LOG'."
exec > $LOG 2>&1
exec < /dev/null 2<&1
for i in $OSTs; do
e2fsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb --ostdb /gpfs/lustre_fsck/${OSSNAME}.ostdb.${i} /dev/md1${i}
done

which generates a logfile for each oss, as well as ostdb files for each raid:
Mar 16 14:30 [root@oss11:~]# ls /gpfs/lustre_fsck/
mdsdb oss11.ostdb.3 oss11.out oss12.ostdb.4 oss21.ostdb.1 oss21.ostdb.7 oss22.ostdb.2 oss22.out oss31.ostdb.5 oss32.ostdb.0 oss32.ostdb.6 oss41.ostdb.3 oss41.out oss42.ostdb.4
mdsdb.mdshdr oss11.ostdb.5 oss12.ostdb.0 oss12.ostdb.6 oss21.ostdb.3 oss21.out oss22.ostdb.4 oss31.ostdb.1 oss31.ostdb.7 oss32.ostdb.2 oss32.out oss41.ostdb.5 oss42.ostdb.0 oss42.ostdb.6
oss11.ostdb.1 oss11.ostdb.7 oss12.ostdb.2 oss12.out oss21.ostdb.5 oss22.ostdb.0 oss22.ostdb.6 oss31.ostdb.3 oss31.out oss32.ostdb.4 oss41.ostdb.1 oss41.ostdb.7 oss42.ostdb.2 oss42.out

3. run lfsck from a client:

stop the raids on all the servers
- ssh mds1 mdadm --stop /dev/md10
- dsh -w oss[11-12,21-22,31-32,41-42] /gpfs/jason/stop_raid.sh
start lustre on all the servers
- dsh -w oss[11-12,21-22,31-32,41-42] service heartbeat start
- dsh -w mds[1,2] service heartbeat start
Make sure gpfs and lustre are running on a client and that e2fsprogs is installed, then start the lfsck (I run it from a script with typeset running /gpfs/jason/lfsck.sh):
lfsck -n -v --mdsdb /gpfs/lustre_fsck/mdsdb --ostdb /gpfs/lustre_fsck/oss11.ostdb.1 /gpfs/lustre_fsck/oss11.ostdb.3 /gpfs/lustre_fsck/oss11.ostdb.5 /gpfs/lustre_fsck/oss11.ostdb.7 /gpfs/lustre_fsck/oss12.ostdb.0 /gpfs/lustre_fsck/oss12.ostdb.2 /gpfs/lustre_fsck/oss12.ostdb.4 /gpfs/lustre_fsck/oss12.ostdb.6 /gpfs/lustre_fsck/oss21.ostdb.1 /gpfs/lustre_fsck/oss21.ostdb.3 /gpfs/lustre_fsck/oss21.ostdb.5 /gpfs/lustre_fsck/oss21.ostdb.7 /gpfs/lustre_fsck/oss22.ostdb.0 /gpfs/lustre_fsck/oss22.ostdb.2 /gpfs/lustre_fsck/oss22.ostdb.4 /gpfs/lustre_fsck/oss22.ostdb.6 /gpfs/lustre_fsck/oss31.ostdb.1 /gpfs/lustre_fsck/oss31.ostdb.3 /gpfs/lustre_fsck/oss31.ostdb.5 /gpfs/lustre_fsck/oss31.ostdb.7 /gpfs/lustre_fsck/oss32.ostdb.0 /gpfs/lustre_fsck/oss32.ostdb.2 /gpfs/lustre_fsck/oss32.ostdb.4 /gpfs/lustre_fsck/oss32.ostdb.6 /gpfs/lustre_fsck/oss41.ostdb.1 /gpfs/lustre_fsck/oss41.ostdb.3 /gpfs/lustre_fsck/oss41.ostdb.5 /gpfs/lustre_fsck/oss41.ostdb.7 /gpfs/lustre_fsck/oss42.ostdb.0 /gpfs/lustre_fsck/oss42.ostdb.2 /gpfs/lustre_fsck/oss42.ostdb.4 /gpfs/lustre_fsck/oss42.ostdb.6 /lustre/scratch/

Here is the documentation for the Lustre FS

CSCS_Lustre_Runbook_v0.8.pdf: CSCS_Lustre_Runbook_v0.8.pdf

Common Problems / Troubleshooting

1. Worker node goes nuts

System load increases dramatically (>100) and stays there
System becomes unresponsive

You can check on xen02:/var/log/phoenix/hostname/messages, and you should see something like:

<blockquote>
May 29 04:02:14 wn161 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_sa_31265:17598]
May 29 04:02:14 wn161 kernel: CPU 1:
May 29 04:02:14 wn161 kernel: Modules linked in: loop mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs fscache nfs_acl rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) lockd sunrpc ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) bonding ipv6 xfrm_nalgo crypto_api ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink iptable_mangle ip_tables x_tables dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg i2c_i801 i2c_core shpchp e1000e pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
May 29 04:02:14 wn161 kernel: Pid: 17598, comm: ll_sa_31265 Tainted: G      2.6.18-194.32.1.el5 #1
May 29 04:02:14 wn161 kernel: RIP: 0010:[<ffffffff80064bfc>]  [<ffffffff80064bfc>] .text.lock.spinlock+0x2/0x30
May 29 04:02:14 wn161 kernel: RSP: 0018:ffff8101b9c47cf8  EFLAGS: 00000286
May 29 04:02:14 wn161 kernel: RAX: 0000000000000001 RBX: ffff81013c528c80 RCX: 0000000000000000
May 29 04:02:14 wn161 kernel: RDX: 0000000000000022 RSI: 0000000004024cc0 RDI: ffff81039c6d58c0
May 29 04:02:14 wn161 kernel: RBP: 0000000000000282 R08: ffffc20012a3a000 R09: 0000000000000000
May 29 04:02:14 wn161 kernel: R10: ffff8101253a8c00 R11: 0000000000000248 R12: ffff810232b40200
May 29 04:02:14 wn161 kernel: R13: ffffffff887da24a R14: ffff81020f2d8a80 R15: 0000000000000000
May 29 04:02:14 wn161 kernel: FS:  0000000000000000(0000) GS:ffff81010c499440(0000) knlGS:0000000000000000
May 29 04:02:14 wn161 kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
May 29 04:02:14 wn161 kernel: CR2: 00000000f7f4e000 CR3: 0000000000201000 CR4: 00000000000006e0

This means there is a problem with the node that most likely can only be cured by a reset:

ireset wn161

once the node has come back up, make sure lustre is mounted, then start grid-services:

ssh wn161
service lustre start
start grid-service restart
service gmond restart

That should get everything going again.

2. Lustre itself has gone nuts

The best way is to check with lfs df:

ssh wn161 lfs df

If all is good, it will return this:

May 30 16:49 [root@xen11:~]# ssh wn161 lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
scratch-MDT0000_UUID  1537958380     3162932  1446904876   0% /lustre/scratch[MDT:0]
scratch-OST0000_UUID  3845783560   219166780  3431255004   5% /lustre/scratch[OST:0]
scratch-OST0001_UUID  3845783560   227456072  3422966860   5% /lustre/scratch[OST:1]
scratch-OST0002_UUID  3845783560   221470296  3428956596   5% /lustre/scratch[OST:2]
scratch-OST0003_UUID  3845783560   222771872  3427653524   5% /lustre/scratch[OST:3]
scratch-OST0004_UUID  3845783560   217679228  3432744644   5% /lustre/scratch[OST:4]
scratch-OST0005_UUID  3845783560   226387856  3424037128   5% /lustre/scratch[OST:5]
scratch-OST0006_UUID  3845783560   212193564  3438228380   5% /lustre/scratch[OST:6]
scratch-OST0007_UUID  3845783560   216765892  3433663092   5% /lustre/scratch[OST:7]
scratch-OST0008_UUID  3845783560   220501124  3429923164   5% /lustre/scratch[OST:8]
scratch-OST0009_UUID  3845783560   222424724  3427998656   5% /lustre/scratch[OST:9]
scratch-OST000a_UUID  3845783560   221762460  3428662804   5% /lustre/scratch[OST:10]
scratch-OST000b_UUID  3845783560   220137692  3430291280   5% /lustre/scratch[OST:11]
scratch-OST000c_UUID  3845783560   215120080  3435303932   5% /lustre/scratch[OST:12]
scratch-OST000d_UUID  3845783560   212610320  3437818128   5% /lustre/scratch[OST:13]
scratch-OST000e_UUID  3845783560   218839300  3431585980   5% /lustre/scratch[OST:14]
scratch-OST000f_UUID  3845783560   215957740  3434471216   5% /lustre/scratch[OST:15]
scratch-OST0010_UUID  3845783560   227552128  3422870716   5% /lustre/scratch[OST:16]
scratch-OST0011_UUID  3845783560   228407164  3422015724   5% /lustre/scratch[OST:17]
scratch-OST0012_UUID  3845783560   213511964  3436911568   5% /lustre/scratch[OST:18]
scratch-OST0013_UUID  3845783560   223095692  3427327972   5% /lustre/scratch[OST:19]
scratch-OST0014_UUID  3845783560   217036892  3433386504   5% /lustre/scratch[OST:20]
scratch-OST0015_UUID  3845783560   219465120  3430956096   5% /lustre/scratch[OST:21]
scratch-OST0016_UUID  3845783560   220117284  3430305440   5% /lustre/scratch[OST:22]
scratch-OST0017_UUID  3845783560   218675392  3431746676   5% /lustre/scratch[OST:23]
scratch-OST0018_UUID  3845783560   237427616  3412995440   6% /lustre/scratch[OST:24]
scratch-OST0019_UUID  3845783560   223293232  3427129084   5% /lustre/scratch[OST:25]
scratch-OST001a_UUID  3845783560   215128968  3435296652   5% /lustre/scratch[OST:26]
scratch-OST001b_UUID  3845783560   238067324  3412361636   6% /lustre/scratch[OST:27]
scratch-OST001c_UUID  3845783560   215975200  3434453780   5% /lustre/scratch[OST:28]
scratch-OST001d_UUID  3845783560   219035464  3431387460   5% /lustre/scratch[OST:29]
scratch-OST001e_UUID  3845783560   225213428  3425210188   5% /lustre/scratch[OST:30]
scratch-OST001f_UUID  3845783560   216786528  3433638072   5% /lustre/scratch[OST:31]

filesystem summary:  123065073920  7070034396 109743553396   5% /lustre/scratch

Otherwise, you will notice that probably 4 of the mounts will be missing. In that case, find out which OSS caused the problem:

May 30 16:49 [root@xen11:~]# phoenix lustre "mount |grep lustre"
----------------
mds1
----------------
/dev/md10 on /lustre/scratch/mdt0 type lustre (rw)
----------------
oss[11,21,31,41]
----------------
/dev/md31 on /lustre/scratch/bmp11 type ext3 (rw)
/dev/md11 on /lustre/scratch/ost11 type lustre (rw)
/dev/md33 on /lustre/scratch/bmp13 type ext3 (rw)
/dev/md13 on /lustre/scratch/ost13 type lustre (rw)
/dev/md35 on /lustre/scratch/bmp15 type ext3 (rw)
/dev/md15 on /lustre/scratch/ost15 type lustre (rw)
/dev/md37 on /lustre/scratch/bmp17 type ext3 (rw)
/dev/md17 on /lustre/scratch/ost17 type lustre (rw)
----------------
oss[12,22,32,42]
----------------
/dev/md30 on /lustre/scratch/bmp10 type ext3 (rw)
/dev/md10 on /lustre/scratch/ost10 type lustre (rw)
/dev/md32 on /lustre/scratch/bmp12 type ext3 (rw)
/dev/md12 on /lustre/scratch/ost12 type lustre (rw)
/dev/md34 on /lustre/scratch/bmp14 type ext3 (rw)
/dev/md14 on /lustre/scratch/ost14 type lustre (rw)
/dev/md36 on /lustre/scratch/bmp16 type ext3 (rw)
/dev/md16 on /lustre/scratch/ost16 type lustre (rw)

This is the ideal situation, and everything is mounted just fine. If something is awry, like an oss or mds is down, then you go to the corresponding server (if oss11 is down, go to oss12) and issue the following command:

/usr/lib64/heartbeat/hb_takeover foreign

and that should assemble the raids and mount everything for you - takes about 5 minutes. (SEE * LustreFreezeUpProcedures )

2a. Another issue - frozen IB card

In a perfect world, this will be the reponse:

May 30 16:55 [root@xen11:~]# phoenix lustre "ifconfig ib0|grep Bcast"
----------------
mds1
----------------
          inet addr:148.187.70.34  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
mds2
----------------
          inet addr:148.187.70.35  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss11
----------------
          inet addr:148.187.70.3  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss12
----------------
          inet addr:148.187.70.4  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss21
----------------
          inet addr:148.187.70.9  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss22
----------------
          inet addr:148.187.70.10  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss31
----------------
          inet addr:148.187.70.15  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss32
----------------
          inet addr:148.187.70.16  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss41
----------------
          inet addr:148.187.70.21  Bcast:148.187.71.255  Mask:255.255.252.0
----------------
oss42
----------------
          inet addr:148.187.70.22  Bcast:148.187.71.255  Mask:255.255.252.0

Otherwise, if an OSS is missing its IB device, you ireset the oss, then once the oss has rebooted, you balance out the mount points with hb_takeover local.

2b. How to Disable a Bad OST

When an OST (a raid on an OSS) is unavailable, it causes the clients to freeze on disk access, such as df commands, ls, etc.

To mitigate this, you need to disable the OST so that they will ignore it, and jobs can continue to run.

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
odt	CSCS_Lustre_Runbook_v0.8.odt	r1	manage	161.3 K	2010-07-27 - 14:18	JasonTemple	This is the sun-provided Lustre information - how the fs was created, and how to generally use it...
pdf	CSCS_Lustre_Runbook_v0.8.pdf	r1	manage	277.1 K	2011-05-30 - 14:40	JasonTemple