Solutions to Operational Issues

Nodes drained due to Jobs not ending with signals

This can happen for various reasons, e.g. when the processes are blocked by tasks in kernel space, or also if the slurm controller is so busy that communication is not possible. Often such occurrences happen in bunches, so that a large number of nodes may be affected. The timeout before a node is identified as problematic is already 3 minutes, so enlarging that probably does not make sense.

Check on at least one of the nodes whether the node is ok and responsive (e.g., run the nhc health check utility, list some path from /pnfs/psi.ch/cms/trivcat/store).
If nodes are ok, use a command like the following from the slurm master node to put nodes back to work
```
 scontrol update NodeName=t3wn[56,71,73] State=RESUME Reason=""
```

Users banned by fail2ban on one of the UI nodes

We are running fail2ban on the UI nodes for at least a minimum of protection from brute force attacks. If a genuine user communicates via our admin list that he has been banned, you need to identify the user and source address from the journal logs, and then check whether that address indeed has been banned.

There is a little script to help with that in the /root home directory of the UI nodes: is-user-banned.sh. You run it with the user name like this:

/root/is-user-banned.sh noehte_l
   banned: 129.129.71.65

In order to unban an address, you need to execute

fail2ban-client unban 129.129.71.65

-- DerekFeichtinger - 2023-08-15

Topic revision: r1 - 2023-08-15 - DerekFeichtinger

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs

Account
- Log In

~~Edit~~
~~Attach~~