LustreFreezeUpProcedures < LCGTier2

Lustre OSS Freeze Up Procedures

Occasionally, the hardware that Lustre runs on shows its true lame colors, and the controller becomes overwhelmed, which sometimes causes Lustre to get stuck on an OSS. This does not happen on the MDS.

How to Diagnose the Problem

I have written a script which checks for this condition, and will send an email alert every 30 minutes or so until the problem is dealt with. The script is very simple - it just checks the messages logs for two conditions:

grep mptbase /var/log/messages |grep Abort

This is generally an 'Abort' notification on writes that are supposed to be completed on a controller, but fail because it is in a frozen state. mptbase is the driver for the controller.

The second message it looks for is this:

grep "LustreError: dumping log"

Finding this in the logs does not necessarily mean that Lustre is frozen. Occasionally, I see this isolated event in the logs, about 1 every two weeks. If it only happens once or twice, then there was just a temporary error that is not a show stopper. However, if it is in there repeatedly, and keeps generating this error in the logs, then you have encountered our last problem with Lustre.

When this condition is triggered, there are several signs that you will see that will immediately catch your eye:

1. The affected OSS will usually have a very high, unnatural load. I think it likes to go up to 199, if I'm not mistaken. The ganglia graph will show the obvious problem, because the value gets stuck, and it doesn't change.

2. All job scheduling will freeze.

3. You will get emails from my watcher program

4. The running jobs chart will get stuck and not move either

How to Deal with This:

The first thing you need to do is ssh to the OSS and look at the latest messages logs. You will probably see the mptbase:Abort errors, as well as the LustreError, over and over again.

The OSS machines work in pairs, so oss11 goes with oss12, oss21 with oss22, oss31 with oss32, and oss41 goes with oss42. Typically the mounts look like this:

[2010-07-26|10:38][root@oss41:~]# mount
/dev/sda3 on / type ext3 (rw)
...
/dev/md31 on /lustre/scratch/bmp11 type ext3 (rw)
/dev/md11 on /lustre/scratch/ost11 type lustre (rw)
/dev/md33 on /lustre/scratch/bmp13 type ext3 (rw)
/dev/md13 on /lustre/scratch/ost13 type lustre (rw)
/dev/md35 on /lustre/scratch/bmp15 type ext3 (rw)
/dev/md15 on /lustre/scratch/ost15 type lustre (rw)
/dev/md37 on /lustre/scratch/bmp17 type ext3 (rw)
/dev/md17 on /lustre/scratch/ost17 type lustre (rw)

 [2010-07-26|10:38][root@oss42:~]# mount
/dev/sda3 on / type ext3 (rw)
...
/dev/md30 on /lustre/scratch/bmp10 type ext3 (rw)
/dev/md10 on /lustre/scratch/ost10 type lustre (rw)
/dev/md32 on /lustre/scratch/bmp12 type ext3 (rw)
/dev/md12 on /lustre/scratch/ost12 type lustre (rw)
/dev/md34 on /lustre/scratch/bmp14 type ext3 (rw)
/dev/md14 on /lustre/scratch/ost14 type lustre (rw)
/dev/md36 on /lustre/scratch/bmp16 type ext3 (rw)
/dev/md16 on /lustre/scratch/ost16 type lustre (rw)

Here, you can see the two oss's have balanced mounts - 4 lustre OST's per OSS.

So, if the error happens on oss41, open another terminal and ssh to oss42.

On the system that is NOT affected, issue this command to force a failover:

/usr/lib64/heartbeat/hb_takeover foreign

there are three versions of this, hb_takeover local|foreign|all. Since we are trying to takeover the disks from oss41, you run hb_takeover foreign on oss42.

Typically, when the system gets in this state, this command won't work. How do you know? Well, you tail -f the logs on the affected node, and watch to see if Lustre is setting the OST's to be read only, then turning off the different raids that make up the 4 OSTs.

When it works, it will mark all the OST's read-only, then disable the raids. When this completes, the other system is informed and starts to assemble the raids and mount the OST's.

The problem here is that this doesn't always work. When it doesn't work, you will see continued LustreErrors in the logs, and the raids will never be stopped.

When there isn't a problem, the failover takes between 2-10 minutes to initiate, depending on how busy the system is. Then, on the other oss, the process from start to finish can take anywhere from 5-40 minutes. You will see the other server start the raids, then mount the lustre partitions. Normally after a failover, the OSS's will not allow the clients to connect to each OST for about 5-10 minutes while it replays all the transactions that were in flight when the problem happened.

So if you attempted the hb_takeover command and it doesn't work after about 20 or 30 minutes (BE PATIENT - this process isn't fast even when there isn't a problem), then you need to reset the hung server.

[2010-07-20|09:25][root@xen02:~]# ireset oss41

After you do this, watch the logs on oss42. After the server goes down, it takes about 5 minutes for heartbeat to recognize that the other host disappeared, and then it will start the takeover of the OST's at that point. After about 20-30 minutes, it will complete. Then you should see this on oss42:

[2010-07-26|10:38][root@oss41:~]# mount
/dev/sda3 on / type ext3 (rw)
...
/dev/md31 on /lustre/scratch/bmp11 type ext3 (rw)
/dev/md11 on /lustre/scratch/ost11 type lustre (rw)
/dev/md33 on /lustre/scratch/bmp13 type ext3 (rw)
/dev/md13 on /lustre/scratch/ost13 type lustre (rw)
/dev/md35 on /lustre/scratch/bmp15 type ext3 (rw)
/dev/md15 on /lustre/scratch/ost15 type lustre (rw)
/dev/md37 on /lustre/scratch/bmp17 type ext3 (rw)
/dev/md17 on /lustre/scratch/ost17 type lustre (rw)
/dev/md30 on /lustre/scratch/bmp10 type ext3 (rw)
/dev/md10 on /lustre/scratch/ost10 type lustre (rw)
/dev/md32 on /lustre/scratch/bmp12 type ext3 (rw)
/dev/md12 on /lustre/scratch/ost12 type lustre (rw)
/dev/md34 on /lustre/scratch/bmp14 type ext3 (rw)
/dev/md14 on /lustre/scratch/ost14 type lustre (rw)
/dev/md36 on /lustre/scratch/bmp16 type ext3 (rw)
/dev/md16 on /lustre/scratch/ost16 type lustre (rw)

What Happens if Failover Doesn't Work

Once in a while, failover doesn't work. What do you do? Don't cry. Don't stress, it's going to be ok.

There are basically two types of failure: 1) heartbeat won't start the Raid sets because the raid is not perfect, or 2) heartbeat just doesn't feel like working.

For situation 1)

Heartbeat has the tolerance for disk failure hard-coded into the binaries. I haven't yet found a way to circumvent this.

If there is a failed disk in your array, (it has an (F) next to the disk name in /proc/mdstat), you must first replace the disk or remove it from the raid before attempting to start heartbeat again.

This process is detailed here --> LustreDiskChangeProcedures. Once you've removed the disk from the raid and/or replaced the faulty disk, start lustre normally.

Which brings us to situation 2)

Heartbeat doesn't feel like starting. This can be either because of the above failed disk issue, or it is because of something I haven't figured out yet. But, in the above case #1, if you are in a big hurry and want to get lustre back up and running, the solution will be the same as it will be for case #2.

Starting Lustre by Hand

So you are tired of waiting for heartbeat to start lustre. Perhaps it's been 30 minutes since a full reboot, or something just hasn't come up the way you like. Here is how to start lustre by hand:

OSSNAME=`uname -n | cut -d. -f1`
case ${OSSNAME} in
oss?1)  OSTs="1 3 5 7" ;;
oss?2)  OSTs="0 2 4 6" ;;
*)  echo "Wrong node, exiting!"; exit 1 ;;
esac

for i in $OSTs; do
  mdadm -A -c /etc/mdadm.conf.local /dev/md3$i
  mount /lustre/scratch/bmp1$i
  mdadm -A -c /etc/mdadm.conf.local /dev/md2$i
  mdadm -A -c /etc/mdadm.conf.local /dev/md1$i
done
for i in $OSTs; do
  mount `grep '^/dev/md1'$i /etc/fstab | awk '{print $2}'`
  sleep 10
done

If the raid is missing a disk, then you need to start them by hand, instead of using this fancy script.

You can tell this is happening when, in the messages log, as heartbeat is trying to assemble the raids, it will tell you something like "can't start raid due to missing disk". This can also happen if the mdadm.conf has incorrect information.

for example, if /dev/md17 is not starting:

 mdadm -A -c /etc/mdadm.conf.local /dev/md37
 mount /lustre/scratch/bmp17
 mdadm -A -c /etc/mdadm.conf.local /dev/md27

* here is the important line
 mdadm -A --run -c /etc/mdadm.conf.local /dev/md17
*

 mount /dev/md17

-- JasonTemple - 2010-07-26