Solutions to Operational Issues

Solutions to Operational Issues

daily check of slurm

Even though many problems are signalled through mails by Icinga, I also prefer to have a look each morning on the Slurm master: t3slurm.psi.ch. These are the kind of queries I usually run to get a short overview

sinfo -o "%.12P %.16F %.16C %.14l %.16L %.12G %N"

sinfo -R

squeue  -o "%.18i %.8P %.15j %.12u (%a) %.2t %.10M %.6Dn %.6Ccpu %A  %.20S %.30R" | less

sprio -l -S Y | less    # only in case of need

To make investigations easier, these alias definitions exist for the root account on t3slurm:

alias df_sinfo='sinfo -o "%.12P %.16F %.16C %.14l %.16L %.12G %N"'

alias df_squeue='squeue  -o "%.18i  %.2t %.8P %.15j %.12u (%7a) %.10M %.6Dn %.6Ccpu %.10m %.20S %.20R %Z"'

alias df_sacct='sacct --format="JobID%16,User%12,State%16,partition,time,elapsed,TotalCPU,UserCPU,ReqMem,MaxRss,MaxVMSize,ncpus,nnodes,reqcpus,reqnode,Start,End,NodeList"'

Especially the df_sacct one is useful for a fast investigation of past jobs.

Slurm: Nodes drained due to Jobs not ending with signals

This can happen for various reasons, e.g. when the processes are blocked by tasks in kernel space, or also if the slurm controller is so busy that communication is not possible. Often such occurrences happen in bunches, so that a large number of nodes may be affected. The timeout before a node is identified as problematic is already 3 minutes, so enlarging that probably does not make sense.

Check on at least one of the nodes whether the node is ok and responsive (e.g., run the nhc health check utility, list some path from /pnfs/psi.ch/cms/trivcat/store).
If nodes are ok, use a command like the following from the slurm master node to put nodes back to work
```
 scontrol update NodeName=t3wn[56,71,73] State=RESUME Reason=""
```

Users banned by fail2ban on one of the UI nodes

We are running fail2ban on the UI nodes for at least a minimum of protection from brute force attacks. If a genuine user communicates via our admin list that he has been banned, you need to identify the user and source address from the journal logs, and then check whether that address indeed has been banned. Even better, maybe the user herself can provide the address, but frequently they do not know.

There is a little script to help with discovering the address associated with the banning of a particular user. It is found on the UI nodes at /root/bin/is-user-banned.sh. You run it with the user name like this:

/root/bin/is-user-banned noehte_l
   banned: 129.129.71.65

The script greps through the sshd logs to identify lines where the user failed authentication, and composes a host list. Then it checks whether those host addresses are on fail2ban's banned list. This can fail, if the failure was a longer time back, or if another account from the same IP address triggered the banning. You can use -s 30 to extend the log search to 30 days back.

If you know the IP address you can check whether it is banned by explicitly querying the fail2ban

fail2ban-client banned 81.221.220.179
[['sshd']]

In the above case, the address is indeed banned from ssh (it's in the 'sshd' jail).

In order to unban an address, you need to execute

fail2ban-client unban 129.129.71.65

-- DerekFeichtinger - 2023-08-15

Topic revision: r5 - 2023-09-02 - DerekFeichtinger

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs